SRE 每日主题:ELK 日志收集与分析

日期: 2026-03-11
主题序号: 11 (11 % 12 = 11)
难度等级: ⭐⭐⭐⭐
适用场景: 生产环境日志集中化管理


一、架构概述

1.1 ELK 栈组件

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Filebeat   │───▶│  Logstash   │───▶│ Elasticsearch│
│  (采集端)    │    │  (处理端)    │    │   (存储端)   │
└─────────────┘    └─────────────┘    └─────────────┘
                                              │
                                              ▼
                                       ┌─────────────┐
                                       │   Kibana    │
                                       │   (展示端)   │
                                       └─────────────┘

1.2 推荐部署架构(生产环境)

应用集群 (多台)
    │
    ├── Filebeat ──┐
    ├── Filebeat ──┼──▶ Kafka (缓冲) ──▶ Logstash 集群 ──▶ Elasticsearch 集群
    ├── Filebeat ──┘
    │
    └───▶ Kibana (负载均衡后)

二、Elasticsearch 生产配置

2.1 系统内核参数调优

# /etc/sysctl.conf
vm.max_map_count=262144
fs.file-max=655360
vm.swappiness=1
net.ipv4.tcp_retries2=5
net.core.somaxconn=65535
# /etc/security/limits.conf
elasticsearch  -  nofile  65536
elasticsearch  -  nproc   65536
elasticsearch  -  memlock unlimited

2.2 JVM 配置

# /etc/elasticsearch/jvm.options
-Xms16g
-Xmx16g
-XX:+UseG1GC
-XX:G1HeapRegionSize=4m
-XX:InitiatingHeapOccupancyPercent=30
-XX:+ParallelRefProcEnabled
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/elasticsearch
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

2.3 Elasticsearch 主配置

# /etc/elasticsearch/elasticsearch.yml
cluster.name: production-elk
node.name: es-node-01
node.roles: [ master, data, ingest ]

path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch

network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

discovery.seed_hosts: ["es-node-01", "es-node-02", "es-node-03"]
cluster.initial_master_nodes: ["es-node-01", "es-node-02", "es-node-03"]

# 生产环境安全配置
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: certs/elastic-certificates.p12
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: certs/http.p12

# 性能优化
indices.memory.index_buffer_size: 20%
thread_pool.write.queue_size: 1000
thread_pool.search.queue_size: 1000

# 慢日志配置
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.fetch.warn: 1s
index.indexing.slowlog.threshold.index.warn: 10s

2.4 索引模板配置

PUT _index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "refresh_interval": "5s",
      "index.lifecycle.name": "logs-policy",
      "index.lifecycle.rollover_alias": "logs",
      "codec": "best_compression"
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "host": {
          "type": "object",
          "properties": {
            "name": { "type": "keyword" },
            "ip": { "type": "ip" }
          }
        },
        "service": { "type": "keyword" },
        "level": { "type": "keyword" },
        "message": { 
          "type": "text",
          "analyzer": "standard"
        },
        "trace_id": { "type": "keyword" },
        "span_id": { "type": "keyword" },
        "duration_ms": { "type": "long" },
        "status_code": { "type": "integer" }
      }
    }
  },
  "priority": 100
}

2.5 ILM 生命周期策略

PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "2d",
        "actions": {
          "set_priority": { "priority": 50 },
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "cold": {
        "min_age": "7d",
        "actions": {
          "set_priority": { "priority": 0 },
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

三、Logstash 配置

3.1 主配置文件

# /etc/logstash/logstash.yml
http.host: "0.0.0.0"
http.port: 9600
log.level: info
path.logs: /var/log/logstash
pipeline.workers: 4
pipeline.batch.size: 125
pipeline.batch.delay: 50
queue.type: persisted
queue.max_bytes: 4gb
queue.checkpoint.writes: 1024
dead_letter_queue.enable: true

3.2 Pipeline 配置(多文件组织)

# /etc/logstash/conf.d/01-inputs.conf
input {
  kafka {
    bootstrap_servers => "kafka-01:9092,kafka-02:9092,kafka-03:9092"
    topics => ["app-logs", "system-logs", "nginx-logs"]
    group_id => "logstash-consumer"
    consumer_threads => 4
    decorate_events => true
    auto_offset_reset => "latest"
    codec => json
  }

  beats {
    port => 5044
    ssl => true
    ssl_certificate => "/etc/logstash/certs/logstash.crt"
    ssl_key => "/etc/logstash/certs/logstash.key"
  }
}

# /etc/logstash/conf.d/02-filters.conf
filter {
  # 解析时间
  date {
    match => [ "timestamp", "ISO8601", "yyyy-MM-dd HH:mm:ss", "UNIX" ]
    target => "@timestamp"
    timezone => "Asia/Shanghai"
  }

  # 解析 JSON 消息
  if [message] =~ /^\{.*\}$/ {
    json {
      source => "message"
      target => "json_data"
    }
  }

  # Grok 解析常见日志格式
  grok {
    match => {
      "message" => [
        "%{COMBINEDAPACHELOG}",
        "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:host} %{DATA:program}: %{GREEDYDATA:log_message}",
        "%{TIMESTAMP_ISO8601:log_timestamp} %{LOGLEVEL:level} %{DATA:logger} - %{GREEDYDATA:log_message}"
      ]
    }
    tag_on_failure => ["_grokparsefailure"]
  }

  # 添加环境标签
  mutate {
    add_field => {
      "environment" => "production"
      "datacenter" => "cn-east-1"
    }
  }

  # 删除不需要的字段
  mutate {
    remove_field => ["host", "agent", "ecs", "input_type"]
  }

  # 根据日志级别路由
  if [level] in ["ERROR", "FATAL", "CRITICAL"] {
    mutate {
      add_tag => ["high_priority"]
    }
  }
}

# /etc/logstash/conf.d/03-outputs.conf
output {
  elasticsearch {
    hosts => ["https://es-node-01:9200", "https://es-node-02:9200", "https://es-node-03:9200"]
    index => "logs-%{+YYYY.MM.dd}"
    user => "logstash_writer"
    password => "${ES_PASSWORD}"
    ssl => true
    ssl_certificate_verification => true
    cacert => "/etc/logstash/certs/ca.crt"
    manage_template => false
    ilm_enabled => true
    ilm_rollover_alias => "logs"
    ilm_pattern => "000001"
    ilm_policy => "logs-policy"
    action => "create"
  }

  # 错误日志单独输出
  if "high_priority" in [tags] {
    elasticsearch {
      hosts => ["https://es-node-01:9200"]
      index => "alerts-%{+YYYY.MM.dd}"
      user => "logstash_writer"
      password => "${ES_PASSWORD}"
      ssl => true
      ssl_certificate_verification => true
      cacert => "/etc/logstash/certs/ca.crt"
    }
  }

  # 调试输出(生产环境关闭)
  # stdout { codec => rubydebug }
}

四、Filebeat 配置

4.1 应用日志采集

# /etc/filebeat/filebeat.yml
filebeat.inputs:
  # Nginx 访问日志
  - type: log
    enabled: true
    paths:
      - /var/log/nginx/access.log
    tags: ["nginx", "access"]
    fields:
      service: nginx
      log_type: access
    multiline.pattern: '^\d+\.\d+\.\d+\.\d+'
    multiline.negate: true
    multiline.match: after
    json.keys_under_root: true
    json.overwrite_keys: true
    json.add_error_key: true

  # 应用日志(JSON 格式)
  - type: log
    enabled: true
    paths:
      - /var/log/app/*.log
    tags: ["application"]
    fields:
      service: myapp
      log_type: application
    json.keys_under_root: true
    json.overwrite_keys: true
    json.add_error_key: true
    include_lines: ['ERROR', 'WARN', 'INFO', 'DEBUG']

  # 系统日志
  - type: syslog
    enabled: true
    protocol.udp:
      host: "localhost:514"
    fields:
      service: syslog
      log_type: system

  # Kubernetes 容器日志
  - type: container
    enabled: true
    paths:
      - /var/log/containers/*.log
    processors:
      - add_kubernetes_metadata:
          host: ${NODE_NAME}
          matchers:
            - logs_path:
                logs_path: "/var/log/containers/"

# Logstash 输出
output.logstash:
  hosts: ["logstash-01:5044", "logstash-02:5044"]
  ssl.enabled: true
  ssl.certificate_authorities: ["/etc/filebeat/certs/ca.crt"]
  ssl.certificate: "/etc/filebeat/certs/filebeat.crt"
  ssl.key: "/etc/filebeat/certs/filebeat.key"
  loadbalance: true
  worker: 2

# 监控配置
monitoring.enabled: true
monitoring.elasticsearch:
  hosts: ["https://es-node-01:9200"]
  username: "filebeat_internal"
  password: "${ES_PASSWORD}"
  ssl.enabled: true
  ssl.certificate_authorities: ["/etc/filebeat/certs/ca.crt"]

# 日志配置
logging.level: info
logging.to_files: true
logging.files:
  path: /var/log/filebeat
  name: filebeat.log
  keepfiles: 7
  permissions: 0644

4.2 Filebeat Module 使用

# 启用常用模块
filebeat modules enable nginx
filebeat modules enable mysql
filebeat modules enable redis
filebeat modules enable kafka
filebeat modules enable system

# 配置模块
cat > /etc/filebeat/modules.d/nginx.yml << EOF
- module: nginx
  access:
    enabled: true
    var.paths: ["/var/log/nginx/access.log"]
  error:
    enabled: true
    var.paths: ["/var/log/nginx/error.log"]
  ingress_controller:
    enabled: false
EOF

五、Kibana 配置

5.1 基础配置

# /etc/kibana/kibana.yml
server.port: 5601
server.host: "0.0.0.0"
server.name: "kibana"
elasticsearch.hosts: ["https://es-node-01:9200", "https://es-node-02:9200"]
elasticsearch.username: "kibana_system"
elasticsearch.password: "${ES_PASSWORD}"
elasticsearch.ssl.verificationMode: certificate
elasticsearch.ssl.certificateAuthorities: ["/etc/kibana/certs/ca.crt"]

xpack.security.enabled: true
xpack.encryptedSavedObjects.encryptionKey: "至少 32 字节的随机密钥"
xpack.reporting.encryptionKey: "至少 32 字节的随机密钥"
xpack.security.encryptionKey: "至少 32 字节的随机密钥"

# 性能优化
elasticsearch.requestTimeout: 300000
elasticsearch.pingTimeout: 1500
elasticsearch.shardTimeout: 30000

# 中文支持
i18n.locale: "zh-CN"

5.2 常用查询示例

# 查询错误日志
level: ERROR AND @timestamp > now-1h

# 查询特定服务
service: "payment-service" AND level: ERROR

# 查询慢请求
duration_ms > 1000

# 查询特定 Trace ID
trace_id: "abc123def456"

# 聚合查询(按服务统计错误数)
POST /logs-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        { "range": { "@timestamp": { "gte": "now-1h" } } },
        { "term": { "level": "ERROR" } }
      ]
    }
  },
  "aggs": {
    "services": {
      "terms": { "field": "service" }
    }
  }
}

六、监控与告警

6.1 Elasticsearch 健康检查命令

# 集群健康状态
curl -X GET "localhost:9200/_cluster/health?pretty"

# 节点状态
curl -X GET "localhost:9200/_cat/nodes?v&h=name,ip,heap.percent,ram.percent,cpu,load_1m,node.role"

# 索引状态
curl -X GET "localhost:9200/_cat/indices?v&s=store.size:desc"

# 分片分布
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,docs,store,node"

# 慢查询日志
curl -X GET "localhost:9200/_tasks?detailed&pretty"

# 集群统计
curl -X GET "localhost:9200/_cluster/stats?pretty"

6.2 关键监控指标

指标 阈值 说明
集群状态 red/yellow 红色=数据丢失,黄色=副本缺失
CPU 使用率 > 80% 持续高负载需扩容
堆内存使用率 > 75% 触发 GC 频繁
磁盘使用率 > 85% 触发水位告警
查询延迟 P95 > 1s 用户体验下降
写入延迟 P95 > 500ms 写入瓶颈
拒绝写入数 > 0 队列已满

6.3 Prometheus 监控配置

# prometheus.yml  scrape_configs
- job_name: 'elasticsearch'
  static_configs:
    - targets: ['es-node-01:9200', 'es-node-02:9200']
  metrics_path: /_prometheus/metrics
  scheme: https
  basic_auth:
    username: prometheus
    password: ${ES_PASSWORD}
  tls_config:
    ca_file: /etc/prometheus/ca.crt

- job_name: 'logstash'
  static_configs:
    - targets: ['logstash-01:9600', 'logstash-02:9600']
  metrics_path: /_node/stats/prometheus

- job_name: 'filebeat'
  static_configs:
    - targets: ['filebeat-exporter:5066']

6.4 告警规则(Prometheus AlertManager)

# alerting-rules.yml
groups:
  - name: elasticsearch
    rules:
      - alert: ElasticsearchClusterRed
        expr: elasticsearch_cluster_health_status{color="red"} == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "ES 集群状态为 RED"

      - alert: ElasticsearchHighHeapUsage
        expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ES 堆内存使用率超过 85%"

      - alert: ElasticsearchDiskWatermarkHigh
        expr: elasticsearch_indices_store_size_bytes / elasticsearch_filesystem_data_size_bytes > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "ES 磁盘使用率超过 85%"

      - alert: LogstashPipelineBackpressure
        expr: rate(logstash_events_out[5m]) < rate(logstash_events_in[5m]) * 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Logstash 处理积压"

七、故障排查指南

7.1 常见问题及解决方案

问题 1: 写入拒绝 (Write Rejected)

症状: 日志显示 rejected execution of processing

排查步骤:

# 检查线程池队列
curl -X GET "localhost:9200/_cat/thread_pool/write?v&h=node_name,name,active,queue,rejected"

# 检查热点分片
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,docs,store,node" | sort -k4 -rn | head -20

解决方案:

  1. 增加写入线程池队列:thread_pool.write.queue_size: 2000
  2. 增加分片数或减少副本数
  3. 优化 bulk 写入批次大小(建议 5-15MB)
  4. 检查是否有单一大文档

问题 2: 查询超时

症状: Kibana 查询超时或返回部分结果

排查步骤:

# 查看慢查询
curl -X GET "localhost:9200/_tasks?detailed&actions=*search"

# 检查分片分布是否均匀
curl -X GET "localhost:9200/_cat/shards/{index}?v&h=index,shard,prirep,docs,store,node"

解决方案:

  1. 优化查询语句,避免全表扫描
  2. 使用 filter 代替 query(filter 可缓存)
  3. 增加查询超时时间
  4. 对高频查询字段建立索引

问题 3: 磁盘水位告警

症状: 集群变为 yellow 或 red,无法写入

排查步骤:

# 检查磁盘使用
curl -X GET "localhost:9200/_cat/allocation?v"

# 查看索引大小
curl -X GET "localhost:9200/_cat/indices?v&s=store.size:desc"

解决方案:

  1. 临时降低水位线(应急):
    curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
    {
    "transient": {
     "cluster.routing.allocation.disk.watermark.low": "90%",
     "cluster.routing.allocation.disk.watermark.high": "95%",
     "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
    }
    }'
    
  2. 删除旧索引或缩短 ILM 保留时间
  3. 扩容数据节点
  4. 启用冷热架构

问题 4: Filebeat 日志丢失

症状: 部分日志未采集到 ES

排查步骤:

# 检查 Filebeat 状态
filebeat status
filebeat test output

# 查看 Filebeat 日志
tail -f /var/log/filebeat/filebeat.log | grep -i error

# 检查 registry 文件
cat /var/lib/filebeat/registry/filebeat/log.json

解决方案:

  1. 检查日志文件权限
  2. 调整 scan_frequencyharvester_limit
  3. 检查 multiline 配置是否正确
  4. 增加 queue.mem.eventsoutput.logstash.worker

7.2 性能诊断命令

# 热点分析(CPU)
curl -X GET "localhost:9200/_nodes/hot_threads"

# 内存分析
curl -X GET "localhost:9200/_nodes/stats/jvm?pretty"

# 索引性能
curl -X GET "localhost:9200/_nodes/stats/indices?pretty"

# 恢复状态
curl -X GET "localhost:9200/_cat/recovery?v&active_only=true"

# 段合并状态
curl -X GET "localhost:9200/_cat/segment_replication?v"

八、最佳实践

8.1 索引设计

  1. 按时间分索引: logs-2026.03.11,便于管理和删除
  2. 合理分片: 单分片 20-50GB,过多分片影响性能
  3. 使用别名: 通过别名实现无缝切换
  4. 关闭不需要的字段: enabled: false 减少存储

8.2 写入优化

  1. 使用 Bulk API: 批量写入,减少网络开销
  2. 调整批次大小: 5-15MB 或 1000-5000 条
  3. 异步写入: 不等待确认,提高吞吐量
  4. 使用 Kafka 缓冲: 削峰填谷,保护 ES

8.3 查询优化

  1. 使用 filter context: 可缓存,性能更好
  2. 避免 wildcard 查询: *abc* 性能极差
  3. 使用 routing: 定向查询特定分片
  4. 预计算聚合: 对复杂聚合提前计算

8.4 安全加固

  1. 启用 X-Pack Security: 强制认证和加密
  2. 最小权限原则: 为不同角色分配最小权限
  3. 网络隔离: ES 集群不暴露公网
  4. 定期轮换密码: 自动化密码管理
  5. 审计日志: 开启访问审计

8.5 容量规划

日志量/天 ES 节点配置 存储需求 (30 天)
10GB 3 节点 (8C16G) 900GB (含副本)
50GB 5 节点 (16C32G) 4.5TB
100GB 9 节点 (16C64G) 9TB
500GB+ 20+ 节点 (32C128G) 45TB+

计算公式:

总存储 = 日增量 × 保留天数 × (1 + 副本数) × 1.2(开销系数)

九、快速部署脚本

9.1 Docker Compose 快速启动(测试环境)

# docker-compose.yml
version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms2g -Xmx2g"
    ports:
      - "9200:9200"
    volumes:
      - es_data:/usr/share/elasticsearch/data

  logstash:
    image: docker.elastic.co/logstash/logstash:8.11.0
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
    ports:
      - "5044:5044"
      - "9600:9600"
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:8.11.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

  filebeat:
    image: docker.elastic.co/beats/filebeat:8.11.0
    user: root
    volumes:
      - ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    depends_on:
      - logstash

volumes:
  es_data:

9.2 生产环境部署检查清单

  • 系统内核参数已调优
  • JVM 堆内存已配置(不超过 32GB)
  • SSL/TLS 证书已生成并配置
  • 用户权限已创建(最小权限原则)
  • ILM 策略已配置
  • 索引模板已创建
  • 监控告警已配置
  • 备份策略已配置(Snapshot)
  • 网络防火墙规则已配置
  • 日志轮转已配置

十、参考资源


文档版本: v1.0
最后更新: 2026-03-11
维护团队: SRE Team

results matching ""

    No results matching ""