SRE 话题文档:SLO/SLI 服务等级目标实践

本文档面向生产环境,涵盖 SLO/SLI/SLA 定义、错误预算管理、告警策略、可观测性集成等核心 SRE 实践。


1. SLO/SLI/SLA 概念与框架

1.1 核心概念

┌─────────────────────────────────────────────────────────────────────────────┐
│                        SLO/SLI/SLA 关系图                                    │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                              SLA (服务等级协议)                              │
│                                                                             │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                         SLO (服务等级目标)                              │  │
│  │                                                                       │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │                     SLI (服务等级指标)                            │  │  │
│  │  │                                                                 │  │  │
│  │  │  • 可用性 (Availability)                                        │  │  │
│  │  │  • 延迟 (Latency)                                               │  │  │
│  │  │  • 吞吐量 (Throughput)                                          │  │  │
│  │  │  • 错误率 (Error Rate)                                          │  │  │
│  │  │  • 饱和度 (Saturation)                                          │  │  │
│  │  │                                                                 │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │                                                                       │  │
│  │  目标值:99.9% 可用性,P99 延迟 < 500ms                                │  │
│  │                                                                       │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  违约后果:服务退款、赔偿条款                                                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  错误预算 (Error Budget)                                                    │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  可用性 99.9% = 每月允许停机时间:                                     │   │
│  │  30天 × 24小时 × 60分钟 × (1 - 0.999) = 43.2 分钟                    │   │
│  │                                                                     │   │
│  │  错误预算 = 1 - SLO = 0.1% 的故障时间                                 │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

1.2 SLO 制定框架

┌─────────────────────────────────────────────────────────────────────────────┐
│                        SLO 制定流程                                          │
└─────────────────────────────────────────────────────────────────────────────┘

Step 1: 识别关键服务路径
        ┌───────────┐     ┌───────────┐     ┌───────────┐
        │   用户    │────▶│   API     │────▶│   数据库   │
        │   请求    │     │   网关    │     │           │
        └───────────┘     └───────────┘     └───────────┘
              │
              ▼
Step 2: 定义 SLI 指标
        • 请求成功率 = 成功请求数 / 总请求数
        • 延迟分布 = P50, P95, P99
        • 吞吐量 = 请求/秒
        • 错误率 = 5xx 响应 / 总响应
              │
              ▼
Step 3: 设定 SLO 目标
        • 可用性 SLO: 99.9%
        • 延迟 SLO: P99 < 500ms
        • 错误率 SLO: < 0.1%
              │
              ▼
Step 4: 计算错误预算
        • 月度错误预算 = 43.2 分钟
        • 每日错误预算 = 1.44 分钟
        • 按请求量计算预算
              │
              ▼
Step 5: 建立告警策略
        • 错误预算消耗速率告警
        • SLO 燃尽告警
        • 多窗口告警策略

2. SLI 指标设计

2.1 常见 SLI 定义

# sli-definitions.yaml - SLI 指标定义

# ==============================================================================
# 可用性 SLI (Availability)
# ==============================================================================
availability_sli:
  name: "请求可用性"
  description: "成功处理的请求比例"
  formula: "成功请求数 / 总请求数"
  measurement:
    source: "prometheus"
    query: |
      sum(rate(http_requests_total{status!~"5.."}[5m]))
      / sum(rate(http_requests_total[5m]))
  exclusion:
    - "健康检查请求"
    - "预发布环境请求"
    - "内部监控请求"
  slo_target: 0.999  # 99.9%

# ==============================================================================
# 延迟 SLI (Latency)
# ==============================================================================
latency_sli:
  name: "请求延迟"
  description: "请求响应时间分布"
  measurement:
    - name: "P50 延迟"
      query: 'histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))'
      slo_target: 0.1  # 100ms

    - name: "P95 延迟"
      query: 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'
      slo_target: 0.3  # 300ms

    - name: "P99 延迟"
      query: 'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))'
      slo_target: 0.5  # 500ms

  exclusion:
    - "大文件上传请求"
    - "批量导出请求"

# ==============================================================================
# 错误率 SLI (Error Rate)
# ==============================================================================
error_rate_sli:
  name: "错误率"
  description: "失败请求比例"
  formula: "错误请求数 / 总请求数"
  measurement:
    source: "prometheus"
    query: |
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m]))
  slo_target: 0.001  # 0.1%

# ==============================================================================
# 吞吐量 SLI (Throughput)
# ==============================================================================
throughput_sli:
  name: "吞吐量"
  description: "系统处理请求的能力"
  measurement:
    source: "prometheus"
    query: "sum(rate(http_requests_total[5m]))"
  slo_target: 10000  # 10000 req/s

# ==============================================================================
# 饱和度 SLI (Saturation)
# ==============================================================================
saturation_sli:
  - name: "CPU 使用率"
    query: 'avg(rate(container_cpu_usage_seconds_total[5m]))'
    slo_target: 0.8  # 80%

  - name: "内存使用率"
    query: 'container_memory_usage_bytes / container_spec_memory_limit_bytes'
    slo_target: 0.85  # 85%

  - name: "磁盘使用率"
    query: 'node_filesystem_usage_bytes / node_filesystem_size_bytes'
    slo_target: 0.85  # 85%

2.2 服务类型 SLI 模板

# ==============================================================================
# Web API 服务 SLI
# ==============================================================================
web_api_sli:
  critical_path:
    - endpoint: "/api/v1/users"
      availability_slo: 0.999
      latency_p99_slo: 0.2

    - endpoint: "/api/v1/orders"
      availability_slo: 0.9999  # 订单更关键
      latency_p99_slo: 0.5

  non_critical_path:
    - endpoint: "/api/v1/analytics"
      availability_slo: 0.99
      latency_p99_slo: 2.0  # 分析可以慢

# ==============================================================================
# 数据库服务 SLI
# ==============================================================================
database_sli:
  - name: "查询延迟"
    query_type: "select"
    p99_slo: 0.05  # 50ms

  - name: "写入延迟"
    query_type: "insert"
    p99_slo: 0.1   # 100ms

  - name: "连接池可用性"
    slo: 0.999

# ==============================================================================
# 消息队列服务 SLI
# ==============================================================================
message_queue_sli:
  - name: "消息投递延迟"
    description: "消息从生产到消费的时间"
    p99_slo: 5.0  # 5秒

  - name: "消息投递成功率"
    slo: 0.9999

  - name: "消费者积压"
    threshold: 100000  # 最大积压量

# ==============================================================================
# 缓存服务 SLI
# ==============================================================================
cache_sli:
  - name: "缓存命中率"
    slo: 0.9  # 90% 命中率

  - name: "缓存延迟"
    p99_slo: 0.01  # 10ms

3. 错误预算管理

3.1 错误预算计算

#!/usr/bin/env python3
# error_budget_calculator.py - 错误预算计算器

from datetime import datetime, timedelta
from typing import Dict, List

class ErrorBudgetCalculator:
    """错误预算计算器"""

    # 时间窗口配置
    TIME_WINDOWS = {
        'daily': timedelta(days=1),
        'weekly': timedelta(weeks=1),
        'monthly': timedelta(days=30),
        'quarterly': timedelta(days=90),
        'yearly': timedelta(days=365)
    }

    def __init__(self, slo_target: float):
        """
        初始化错误预算计算器

        Args:
            slo_target: SLO 目标值 (0.0 - 1.0),例如 0.999 表示 99.9%
        """
        self.slo_target = slo_target
        self.error_budget = 1 - slo_target

    def calculate_downtime_budget(self, window: str = 'monthly') -> Dict[str, float]:
        """
        计算指定时间窗口的停机预算

        Returns:
            包含各项时间单位的字典
        """
        duration = self.TIME_WINDOWS.get(window, self.TIME_WINDOWS['monthly'])
        total_seconds = duration.total_seconds()
        budget_seconds = total_seconds * self.error_budget

        return {
            'window': window,
            'total_seconds': total_seconds,
            'budget_seconds': budget_seconds,
            'budget_minutes': budget_seconds / 60,
            'budget_hours': budget_seconds / 3600,
            'slo_target': self.slo_target,
            'slo_percentage': f"{self.slo_target * 100:.3f}%"
        }

    def calculate_request_budget(self, total_requests: int) -> Dict[str, float]:
        """
        计算基于请求数的错误预算

        Args:
            total_requests: 总请求数

        Returns:
            错误预算详情
        """
        allowed_failures = int(total_requests * self.error_budget)

        return {
            'total_requests': total_requests,
            'allowed_failures': allowed_failures,
            'slo_target': self.slo_target
        }

    def calculate_burn_rate(self, 
                           consumed_budget: float, 
                           time_elapsed_hours: float,
                           window_hours: float = 720) -> float:
        """
        计算错误预算消耗速率

        Args:
            consumed_budget: 已消耗的预算 (0.0 - 1.0)
            time_elapsed_hours: 已经过的时间(小时)
            window_hours: 预算窗口(小时),默认 30 天

        Returns:
            燃烧速率倍数
        """
        expected_burn_rate = 1.0 / window_hours  # 正常燃烧速率
        actual_burn_rate = consumed_budget / time_elapsed_hours / window_hours

        return actual_burn_rate / expected_burn_rate

    def calculate_slo_targets(self) -> Dict[str, Dict]:
        """
        计算不同可用性目标的停机时间

        Returns:
            各可用性等级的停机时间
        """
        targets = [0.99, 0.999, 0.9999, 0.99999]
        results = {}

        for target in targets:
            calculator = ErrorBudgetCalculator(target)
            results[f"{target * 100}%"] = {
                'daily': calculator.calculate_downtime_budget('daily')['budget_minutes'],
                'weekly': calculator.calculate_downtime_budget('weekly')['budget_minutes'],
                'monthly': calculator.calculate_downtime_budget('monthly')['budget_minutes'],
                'yearly': calculator.calculate_downtime_budget('yearly')['budget_hours']
            }

        return results


# 使用示例
if __name__ == "__main__":
    # 99.9% SLO
    calculator = ErrorBudgetCalculator(0.999)

    # 月度停机预算
    monthly_budget = calculator.calculate_downtime_budget('monthly')
    print(f"月度停机预算: {monthly_budget['budget_minutes']:.2f} 分钟")

    # 请求预算
    request_budget = calculator.calculate_request_budget(1000000)
    print(f"100万请求允许失败: {request_budget['allowed_failures']} 次")

    # 所有 SLO 等级对照表
    print("\nSLO 等级对照表:")
    import json
    print(json.dumps(calculator.calculate_slo_targets(), indent=2))

3.2 错误预算消耗监控

# error_budget_monitoring.yaml

# ==============================================================================
# Prometheus 规则 - 错误预算计算
# ==============================================================================
groups:
  - name: error_budget_rules
    interval: 1m
    rules:
      # 30 天可用性
      - record: slo:availability:30d
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[30d]))
          / sum(rate(http_requests_total[30d]))

      # 错误预算剩余
      - record: slo:error_budget:remaining
        expr: |
          (slo:availability:30d - 0.999) / (1 - 0.999)

      # 错误预算消耗速率
      - record: slo:error_budget:burn_rate
        expr: |
          (1 - slo:availability:30d)
          / (1 - 0.999)
          / 30  # 标准化为每日燃烧速率

      # 小时级可用性
      - record: slo:availability:1h
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[1h]))
          / sum(rate(http_requests_total[1h]))

      # 分钟级可用性
      - record: slo:availability:5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          / sum(rate(http_requests_total[5m]))

# ==============================================================================
# Grafana 面板变量
# ==============================================================================
dashboard_variables:
  - name: "SLO Target"
    query: "0.99,0.999,0.9999"

  - name: "Time Window"
    query: "1h,6h,24h,7d,30d"

4. SLO 告警策略

4.1 多窗口多燃烧率告警

# slo-alerting.yaml - 基于 SLO 的告警策略

# ==============================================================================
# 多窗口多燃烧率告警(推荐)
# ==============================================================================
# 原理:同时检查短窗口和长窗口的错误率,避免短时抖动误报
# 参考:Google SRE Book - Alerting on SLOs

groups:
  - name: slo-alerts
    rules:
      # ----------------------------------------------------------------------
      # 可用性 SLO 告警
      # ----------------------------------------------------------------------

      # 1小时燃烧率 2x(消耗 5% 预算)
      - alert: SLOErrorBudgetBurnRate2x
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          ) > (1 - 0.999) * 2
        for: 5m
        labels:
          severity: warning
          slo: availability
          burn_rate: "2x"
        annotations:
          summary: "可用性 SLO 燃烧率 2x"
          description: |
            过去 1 小时错误率超过 SLO 阈值的 2 倍。
            当前错误率: {{ $value | printf "%.4f" }}
            预计消耗 5% 月度错误预算。

      # 6小时燃烧率 5x(消耗 2% 预算)
      - alert: SLOErrorBudgetBurnRate5x
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            / sum(rate(http_requests_total[6h]))
          ) > (1 - 0.999) * 5
        for: 30m
        labels:
          severity: warning
          slo: availability
          burn_rate: "5x"
        annotations:
          summary: "可用性 SLO 燃烧率 5x"
          description: |
            过去 6 小时错误率超过 SLO 阈值的 5 倍。
            当前错误率: {{ $value | printf "%.4f" }}
            预计消耗 2% 月度错误预算。

      # 1天燃烧率 10x(消耗 3% 预算)
      - alert: SLOErrorBudgetBurnRate10x
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1d]))
            / sum(rate(http_requests_total[1d]))
          ) > (1 - 0.999) * 10
        for: 2h
        labels:
          severity: critical
          slo: availability
          burn_rate: "10x"
        annotations:
          summary: "可用性 SLO 燃烧率 10x"
          description: |
            过去 24 小时错误率超过 SLO 阈值的 10 倍。
            当前错误率: {{ $value | printf "%.4f" }}
            预计消耗 3% 月度错误预算。

      # ----------------------------------------------------------------------
      # 延迟 SLO 告警
      # ----------------------------------------------------------------------

      - alert: SLOLatencyP99BurnRate
        expr: |
          (
            sum(rate(http_request_duration_seconds_bucket{le="0.5"}[1h]))
            / sum(rate(http_request_duration_seconds_count[1h]))
          ) < (1 - 0.99)  # 99% 请求应 < 500ms
        for: 5m
        labels:
          severity: warning
          slo: latency
        annotations:
          summary: "延迟 SLO 违规"
          description: |
            过去 1 小时 P99 延迟超过 500ms 阈值。
            当前比例: {{ $value | printf "%.2f%%" }}

      # ----------------------------------------------------------------------
      # 错误预算即将耗尽
      # ----------------------------------------------------------------------

      - alert: SLOErrorBudgetExhausted
        expr: slo:error_budget:remaining < 0.1
        for: 5m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "错误预算即将耗尽"
          description: |
            30 天错误预算剩余不足 10%。
            当前剩余: {{ $value | printf "%.1f%%" }}
            建议暂停新功能发布,专注于可靠性改进。

      - alert: SLOErrorBudgetExhausted
        expr: slo:error_budget:remaining < 0
        for: 1m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "错误预算已耗尽"
          description: "SLO 已被违反,需立即处理"

# ==============================================================================
# 多窗口告警配置(降低误报率)
# ==============================================================================
# 同时满足短窗口和长窗口条件才告警

  - name: multi-window-alerts
    rules:
      - alert: SLOAvailabilityCritical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))
          ) > 0.01
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          ) > 0.001
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "可用性严重下降"
          description: |
            短期(5分钟)错误率: {{ $value | printf "%.2f" }}
            长期(1小时)错误率: {{ $value | printf "%.2f" }}

4.2 告警分级策略

# alerting-tiering.yaml - 告警分级策略

# ==============================================================================
# 告警优先级矩阵
# ==============================================================================
# 
#                    │ 短窗口(5m) │ 中窗口(1h) │ 长窗口(1d) │
#    ────────────────┼──────────────┼──────────────┼──────────────┤
#    高燃烧率(10x)  │   P0 立即    │   P0 立即    │   P1 1小时内  │
#    中燃烧率(5x)   │   P1 1小时内 │   P1 1小时内 │   P2 4小时内  │
#    低燃烧率(2x)   │   P2 4小时内 │   P2 4小时内 │   P3 工作时间 │
#

alert_routing:
  # P0 - 立即响应(5分钟内)
  - match:
      severity: critical
      burn_rate: "10x"
    actions:
      - type: pagerduty
        priority: high
      - type: slack
        channel: "#incidents-critical"
      - type: sms
        to: oncall_team

  # P1 - 1小时内响应
  - match:
      severity: critical
    actions:
      - type: pagerduty
        priority: medium
      - type: slack
        channel: "#incidents"

  # P2 - 4小时内响应
  - match:
      severity: warning
      burn_rate: "5x"
    actions:
      - type: slack
        channel: "#sre-alerts"
      - type: email
        to: sre-team@example.com

  # P3 - 工作时间处理
  - match:
      severity: warning
    actions:
      - type: slack
        channel: "#sre-alerts"
      - type: jira
        priority: low

5. SLO 看板设计

5.1 Grafana 面板 JSON

{
  "dashboard": {
    "title": "SLO Dashboard",
    "uid": "slo-dashboard",
    "panels": [
      {
        "title": "Service Availability (30d)",
        "type": "gauge",
        "gridPos": {"x": 0, "y": 0, "w": 6, "h": 6},
        "targets": [
          {"expr": "slo:availability:30d", "legendFormat": "Availability"}
        ],
        "fieldConfig": {
          "defaults": {
            "min": 0.99,
            "max": 1,
            "unit": "percentunit",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "red", "value": 0.99},
                {"color": "yellow", "value": 0.999},
                {"color": "green", "value": 0.9999}
              ]
            }
          }
        }
      },
      {
        "title": "Error Budget Remaining",
        "type": "gauge",
        "gridPos": {"x": 6, "y": 0, "w": 6, "h": 6},
        "targets": [
          {"expr": "slo:error_budget:remaining * 100", "legendFormat": "Budget %"}
        ],
        "fieldConfig": {
          "defaults": {
            "min": 0,
            "max": 100,
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "red", "value": 0},
                {"color": "yellow", "value": 10},
                {"color": "green", "value": 30}
              ]
            }
          }
        }
      },
      {
        "title": "Burn Rate",
        "type": "stat",
        "gridPos": {"x": 12, "y": 0, "w": 6, "h": 3},
        "targets": [
          {"expr": "slo:error_budget:burn_rate", "legendFormat": "Current"}
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 2},
                {"color": "red", "value": 5}
              ]
            }
          }
        }
      },
      {
        "title": "Monthly Downtime Budget",
        "type": "stat",
        "gridPos": {"x": 18, "y": 0, "w": 6, "h": 3},
        "targets": [
          {"expr": "(1 - slo:availability:30d) * 30 * 24 * 60", "legendFormat": "Minutes Used"}
        ]
      },
      {
        "title": "Availability Trend",
        "type": "graph",
        "gridPos": {"x": 0, "y": 6, "w": 12, "h": 8},
        "targets": [
          {
            "expr": "slo:availability:1h",
            "legendFormat": "1h Window"
          },
          {
            "expr": "slo:availability:30d",
            "legendFormat": "30d Window"
          }
        ]
      },
      {
        "title": "Error Budget Burn Down",
        "type": "graph",
        "gridPos": {"x": 12, "y": 6, "w": 12, "h": 8},
        "targets": [
          {
            "expr": "slo:error_budget:remaining * 100",
            "legendFormat": "Budget Remaining %"
          }
        ]
      },
      {
        "title": "Error Rate by Service",
        "type": "graph",
        "gridPos": {"x": 0, "y": 14, "w": 12, "h": 8},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ]
      },
      {
        "title": "Latency P99 by Service",
        "type": "graph",
        "gridPos": {"x": 12, "y": 14, "w": 12, "h": 8},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{ service }}"
          }
        ]
      }
    ]
  }
}

6. SLO 报告生成

6.1 周报自动化

#!/usr/bin/env python3
# slo_report_generator.py - SLO 周报生成器

import json
from datetime import datetime, timedelta
from typing import Dict, List
import requests

class SLOReportGenerator:
    """SLO 周报生成器"""

    def __init__(self, prometheus_url: str, services: List[str]):
        self.prometheus_url = prometheus_url
        self.services = services
        self.slo_targets = {
            'availability': 0.999,
            'latency_p99': 0.5,  # 500ms
            'error_rate': 0.001
        }

    def query_prometheus(self, query: str) -> Dict:
        """查询 Prometheus"""
        response = requests.get(
            f"{self.prometheus_url}/api/v1/query",
            params={'query': query}
        )
        return response.json()

    def calculate_availability(self, service: str, days: int = 7) -> Dict:
        """计算服务可用性"""
        query = f'''
            sum(rate(http_requests_total{{service="{service}",status!~"5.."}}[{days}d]))
            / sum(rate(http_requests_total{{service="{service}"}}[{days}d]))
        '''
        result = self.query_prometheus(query)
        availability = float(result['data']['result'][0]['value'][1])

        return {
            'service': service,
            'availability': availability,
            'target': self.slo_targets['availability'],
            'met': availability >= self.slo_targets['availability'],
            'gap': availability - self.slo_targets['availability']
        }

    def calculate_error_budget(self, service: str, days: int = 7) -> Dict:
        """计算错误预算消耗"""
        # 总请求数
        total_query = f'sum(increase(http_requests_total{{service="{service}"}}[{days}d]))'
        total_result = self.query_prometheus(total_query)
        total_requests = float(total_result['data']['result'][0]['value'][1])

        # 错误请求数
        error_query = f'sum(increase(http_requests_total{{service="{service}",status=~"5.."}}[{days}d]))'
        error_result = self.query_prometheus(error_query)
        error_requests = float(error_result['data']['result'][0]['value'][1])

        # 预算消耗
        allowed_errors = total_requests * (1 - self.slo_targets['availability'])
        budget_consumed = error_requests / allowed_errors if allowed_errors > 0 else 0

        return {
            'service': service,
            'total_requests': int(total_requests),
            'error_requests': int(error_requests),
            'allowed_errors': int(allowed_errors),
            'budget_consumed_percent': budget_consumed * 100,
            'budget_remaining_percent': (1 - budget_consumed) * 100
        }

    def generate_weekly_report(self) -> Dict:
        """生成周报"""
        report = {
            'period': {
                'start': (datetime.now() - timedelta(days=7)).isoformat(),
                'end': datetime.now().isoformat()
            },
            'services': []
        }

        for service in self.services:
            service_report = {
                'name': service,
                'availability': self.calculate_availability(service),
                'error_budget': self.calculate_error_budget(service),
                'incidents': [],  # 从告警系统获取
                'recommendations': []
            }

            # 生成建议
            if service_report['availability']['gap'] < 0:
                service_report['recommendations'].append(
                    f"可用性低于目标 {abs(service_report['availability']['gap'])*100:.2f}%,建议优化"
                )

            if service_report['error_budget']['budget_remaining_percent'] < 20:
                service_report['recommendations'].append(
                    "错误预算剩余不足 20%,建议暂停新功能发布"
                )

            report['services'].append(service_report)

        return report

    def format_markdown_report(self, report: Dict) -> str:
        """格式化为 Markdown"""
        md = f"""# SLO 周报

**报告周期**: {report['period']['start']} 至 {report['period']['end']}

## 服务概览

| 服务 | 可用性 | 目标 | 状态 | 预算剩余 |
|------|--------|------|------|----------|
"""
        for service in report['services']:
            status = "✅" if service['availability']['met'] else "❌"
            md += f"| {service['name']} | "
            md += f"{service['availability']['availability']*100:.3f}% | "
            md += f"{service['availability']['target']*100:.3f}% | "
            md += f"{status} | "
            md += f"{service['error_budget']['budget_remaining_percent']:.1f}% |\n"

        md += "\n## 详细分析\n\n"
        for service in report['services']:
            md += f"### {service['name']}\n\n"
            md += f"- 总请求数: {service['error_budget']['total_requests']:,}\n"
            md += f"- 错误请求: {service['error_budget']['error_requests']:,}\n"
            md += f"- 预算消耗: {service['error_budget']['budget_consumed_percent']:.1f}%\n"

            if service['recommendations']:
                md += "\n**建议**:\n"
                for rec in service['recommendations']:
                    md += f"- {rec}\n"
            md += "\n"

        return md


# 使用示例
if __name__ == "__main__":
    generator = SLOReportGenerator(
        prometheus_url="http://prometheus:9090",
        services=["api-gateway", "order-service", "payment-service"]
    )

    report = generator.generate_weekly_report()
    print(generator.format_markdown_report(report))

7. 最佳实践

7.1 SLO 设定原则

原则 说明
从小开始 先设定核心服务的核心指标
渐进式 根据历史数据调整目标
区分优先级 关键路径要求更高 SLO
用户导向 从用户视角定义 SLI
可达成 SLO 应具有挑战但可达成

7.2 错误预算使用策略

预算剩余 行动建议
> 50% 可以加快发布节奏
20-50% 正常发布,关注可靠性
10-20% 暂停新功能,专注于稳定性
< 10% 停止所有发布,全面可靠性改进
< 0% SLO 违约,紧急可靠性修复

8. 参考资料


文档版本: 1.0 更新日期: 2024-01-15 适用环境: Prometheus + Grafana

results matching ""

    No results matching ""