SRE 话题文档：SLO/SLI 服务等级目标实践

本文档面向生产环境，涵盖 SLO/SLI/SLA 定义、错误预算管理、告警策略、可观测性集成等核心 SRE 实践。

1. SLO/SLI/SLA 概念与框架

1.1 核心概念

┌─────────────────────────────────────────────────────────────────────────────┐
│                        SLO/SLI/SLA 关系图                                    │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                              SLA (服务等级协议)                              │
│                                                                             │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                         SLO (服务等级目标)                              │  │
│  │                                                                       │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │                     SLI (服务等级指标)                            │  │  │
│  │  │                                                                 │  │  │
│  │  │  • 可用性 (Availability)                                        │  │  │
│  │  │  • 延迟 (Latency)                                               │  │  │
│  │  │  • 吞吐量 (Throughput)                                          │  │  │
│  │  │  • 错误率 (Error Rate)                                          │  │  │
│  │  │  • 饱和度 (Saturation)                                          │  │  │
│  │  │                                                                 │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │                                                                       │  │
│  │  目标值：99.9% 可用性，P99 延迟 < 500ms                                │  │
│  │                                                                       │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  违约后果：服务退款、赔偿条款                                                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  错误预算 (Error Budget)                                                    │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  可用性 99.9% = 每月允许停机时间：                                     │   │
│  │  30天 × 24小时 × 60分钟 × (1 - 0.999) = 43.2 分钟                    │   │
│  │                                                                     │   │
│  │  错误预算 = 1 - SLO = 0.1% 的故障时间                                 │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

1.2 SLO 制定框架

┌─────────────────────────────────────────────────────────────────────────────┐
│                        SLO 制定流程                                          │
└─────────────────────────────────────────────────────────────────────────────┘

Step 1: 识别关键服务路径
        ┌───────────┐     ┌───────────┐     ┌───────────┐
        │   用户    │────▶│   API     │────▶│   数据库   │
        │   请求    │     │   网关    │     │           │
        └───────────┘     └───────────┘     └───────────┘
              │
              ▼
Step 2: 定义 SLI 指标
        • 请求成功率 = 成功请求数 / 总请求数
        • 延迟分布 = P50, P95, P99
        • 吞吐量 = 请求/秒
        • 错误率 = 5xx 响应 / 总响应
              │
              ▼
Step 3: 设定 SLO 目标
        • 可用性 SLO: 99.9%
        • 延迟 SLO: P99 < 500ms
        • 错误率 SLO: < 0.1%
              │
              ▼
Step 4: 计算错误预算
        • 月度错误预算 = 43.2 分钟
        • 每日错误预算 = 1.44 分钟
        • 按请求量计算预算
              │
              ▼
Step 5: 建立告警策略
        • 错误预算消耗速率告警
        • SLO 燃尽告警
        • 多窗口告警策略

2. SLI 指标设计

2.1 常见 SLI 定义

# sli-definitions.yaml - SLI 指标定义

# ==============================================================================
# 可用性 SLI (Availability)
# ==============================================================================
availability_sli:
  name: "请求可用性"
  description: "成功处理的请求比例"
  formula: "成功请求数 / 总请求数"
  measurement:
    source: "prometheus"
    query: |
      sum(rate(http_requests_total{status!~"5.."}[5m]))
      / sum(rate(http_requests_total[5m]))
  exclusion:
    - "健康检查请求"
    - "预发布环境请求"
    - "内部监控请求"
  slo_target: 0.999  # 99.9%

# ==============================================================================
# 延迟 SLI (Latency)
# ==============================================================================
latency_sli:
  name: "请求延迟"
  description: "请求响应时间分布"
  measurement:
    - name: "P50 延迟"
      query: 'histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))'
      slo_target: 0.1  # 100ms

    - name: "P95 延迟"
      query: 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'
      slo_target: 0.3  # 300ms

    - name: "P99 延迟"
      query: 'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))'
      slo_target: 0.5  # 500ms

  exclusion:
    - "大文件上传请求"
    - "批量导出请求"

# ==============================================================================
# 错误率 SLI (Error Rate)
# ==============================================================================
error_rate_sli:
  name: "错误率"
  description: "失败请求比例"
  formula: "错误请求数 / 总请求数"
  measurement:
    source: "prometheus"
    query: |
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m]))
  slo_target: 0.001  # 0.1%

# ==============================================================================
# 吞吐量 SLI (Throughput)
# ==============================================================================
throughput_sli:
  name: "吞吐量"
  description: "系统处理请求的能力"
  measurement:
    source: "prometheus"
    query: "sum(rate(http_requests_total[5m]))"
  slo_target: 10000  # 10000 req/s

# ==============================================================================
# 饱和度 SLI (Saturation)
# ==============================================================================
saturation_sli:
  - name: "CPU 使用率"
    query: 'avg(rate(container_cpu_usage_seconds_total[5m]))'
    slo_target: 0.8  # 80%

  - name: "内存使用率"
    query: 'container_memory_usage_bytes / container_spec_memory_limit_bytes'
    slo_target: 0.85  # 85%

  - name: "磁盘使用率"
    query: 'node_filesystem_usage_bytes / node_filesystem_size_bytes'
    slo_target: 0.85  # 85%

2.2 服务类型 SLI 模板

# ==============================================================================
# Web API 服务 SLI
# ==============================================================================
web_api_sli:
  critical_path:
    - endpoint: "/api/v1/users"
      availability_slo: 0.999
      latency_p99_slo: 0.2

    - endpoint: "/api/v1/orders"
      availability_slo: 0.9999  # 订单更关键
      latency_p99_slo: 0.5

  non_critical_path:
    - endpoint: "/api/v1/analytics"
      availability_slo: 0.99
      latency_p99_slo: 2.0  # 分析可以慢

# ==============================================================================
# 数据库服务 SLI
# ==============================================================================
database_sli:
  - name: "查询延迟"
    query_type: "select"
    p99_slo: 0.05  # 50ms

  - name: "写入延迟"
    query_type: "insert"
    p99_slo: 0.1   # 100ms

  - name: "连接池可用性"
    slo: 0.999

# ==============================================================================
# 消息队列服务 SLI
# ==============================================================================
message_queue_sli:
  - name: "消息投递延迟"
    description: "消息从生产到消费的时间"
    p99_slo: 5.0  # 5秒

  - name: "消息投递成功率"
    slo: 0.9999

  - name: "消费者积压"
    threshold: 100000  # 最大积压量

# ==============================================================================
# 缓存服务 SLI
# ==============================================================================
cache_sli:
  - name: "缓存命中率"
    slo: 0.9  # 90% 命中率

  - name: "缓存延迟"
    p99_slo: 0.01  # 10ms

3. 错误预算管理

3.1 错误预算计算

#!/usr/bin/env python3
# error_budget_calculator.py - 错误预算计算器

from datetime import datetime, timedelta
from typing import Dict, List

class ErrorBudgetCalculator:
    """错误预算计算器"""

    # 时间窗口配置
    TIME_WINDOWS = {
        'daily': timedelta(days=1),
        'weekly': timedelta(weeks=1),
        'monthly': timedelta(days=30),
        'quarterly': timedelta(days=90),
        'yearly': timedelta(days=365)
    }

    def __init__(self, slo_target: float):
        """
        初始化错误预算计算器

        Args:
            slo_target: SLO 目标值 (0.0 - 1.0)，例如 0.999 表示 99.9%
        """
        self.slo_target = slo_target
        self.error_budget = 1 - slo_target

    def calculate_downtime_budget(self, window: str = 'monthly') -> Dict[str, float]:
        """
        计算指定时间窗口的停机预算

        Returns:
            包含各项时间单位的字典
        """
        duration = self.TIME_WINDOWS.get(window, self.TIME_WINDOWS['monthly'])
        total_seconds = duration.total_seconds()
        budget_seconds = total_seconds * self.error_budget

        return {
            'window': window,
            'total_seconds': total_seconds,
            'budget_seconds': budget_seconds,
            'budget_minutes': budget_seconds / 60,
            'budget_hours': budget_seconds / 3600,
            'slo_target': self.slo_target,
            'slo_percentage': f"{self.slo_target * 100:.3f}%"
        }

    def calculate_request_budget(self, total_requests: int) -> Dict[str, float]:
        """
        计算基于请求数的错误预算

        Args:
            total_requests: 总请求数

        Returns:
            错误预算详情
        """
        allowed_failures = int(total_requests * self.error_budget)

        return {
            'total_requests': total_requests,
            'allowed_failures': allowed_failures,
            'slo_target': self.slo_target
        }

    def calculate_burn_rate(self, 
                           consumed_budget: float, 
                           time_elapsed_hours: float,
                           window_hours: float = 720) -> float:
        """
        计算错误预算消耗速率

        Args:
            consumed_budget: 已消耗的预算 (0.0 - 1.0)
            time_elapsed_hours: 已经过的时间（小时）
            window_hours: 预算窗口（小时），默认 30 天

        Returns:
            燃烧速率倍数
        """
        expected_burn_rate = 1.0 / window_hours  # 正常燃烧速率
        actual_burn_rate = consumed_budget / time_elapsed_hours / window_hours

        return actual_burn_rate / expected_burn_rate

    def calculate_slo_targets(self) -> Dict[str, Dict]:
        """
        计算不同可用性目标的停机时间

        Returns:
            各可用性等级的停机时间
        """
        targets = [0.99, 0.999, 0.9999, 0.99999]
        results = {}

        for target in targets:
            calculator = ErrorBudgetCalculator(target)
            results[f"{target * 100}%"] = {
                'daily': calculator.calculate_downtime_budget('daily')['budget_minutes'],
                'weekly': calculator.calculate_downtime_budget('weekly')['budget_minutes'],
                'monthly': calculator.calculate_downtime_budget('monthly')['budget_minutes'],
                'yearly': calculator.calculate_downtime_budget('yearly')['budget_hours']
            }

        return results


# 使用示例
if __name__ == "__main__":
    # 99.9% SLO
    calculator = ErrorBudgetCalculator(0.999)

    # 月度停机预算
    monthly_budget = calculator.calculate_downtime_budget('monthly')
    print(f"月度停机预算: {monthly_budget['budget_minutes']:.2f} 分钟")

    # 请求预算
    request_budget = calculator.calculate_request_budget(1000000)
    print(f"100万请求允许失败: {request_budget['allowed_failures']} 次")

    # 所有 SLO 等级对照表
    print("\nSLO 等级对照表:")
    import json
    print(json.dumps(calculator.calculate_slo_targets(), indent=2))

3.2 错误预算消耗监控

# error_budget_monitoring.yaml

# ==============================================================================
# Prometheus 规则 - 错误预算计算
# ==============================================================================
groups:
  - name: error_budget_rules
    interval: 1m
    rules:
      # 30 天可用性
      - record: slo:availability:30d
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[30d]))
          / sum(rate(http_requests_total[30d]))

      # 错误预算剩余
      - record: slo:error_budget:remaining
        expr: |
          (slo:availability:30d - 0.999) / (1 - 0.999)

      # 错误预算消耗速率
      - record: slo:error_budget:burn_rate
        expr: |
          (1 - slo:availability:30d)
          / (1 - 0.999)
          / 30  # 标准化为每日燃烧速率

      # 小时级可用性
      - record: slo:availability:1h
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[1h]))
          / sum(rate(http_requests_total[1h]))

      # 分钟级可用性
      - record: slo:availability:5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          / sum(rate(http_requests_total[5m]))

# ==============================================================================
# Grafana 面板变量
# ==============================================================================
dashboard_variables:
  - name: "SLO Target"
    query: "0.99,0.999,0.9999"

  - name: "Time Window"
    query: "1h,6h,24h,7d,30d"

4. SLO 告警策略

4.1 多窗口多燃烧率告警

# slo-alerting.yaml - 基于 SLO 的告警策略

# ==============================================================================
# 多窗口多燃烧率告警（推荐）
# ==============================================================================
# 原理：同时检查短窗口和长窗口的错误率，避免短时抖动误报
# 参考：Google SRE Book - Alerting on SLOs

groups:
  - name: slo-alerts
    rules:
      # ----------------------------------------------------------------------
      # 可用性 SLO 告警
      # ----------------------------------------------------------------------

      # 1小时燃烧率 2x（消耗 5% 预算）
      - alert: SLOErrorBudgetBurnRate2x
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          ) > (1 - 0.999) * 2
        for: 5m
        labels:
          severity: warning
          slo: availability
          burn_rate: "2x"
        annotations:
          summary: "可用性 SLO 燃烧率 2x"
          description: |
            过去 1 小时错误率超过 SLO 阈值的 2 倍。
            当前错误率: {{ $value | printf "%.4f" }}
            预计消耗 5% 月度错误预算。

      # 6小时燃烧率 5x（消耗 2% 预算）
      - alert: SLOErrorBudgetBurnRate5x
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            / sum(rate(http_requests_total[6h]))
          ) > (1 - 0.999) * 5
        for: 30m
        labels:
          severity: warning
          slo: availability
          burn_rate: "5x"
        annotations:
          summary: "可用性 SLO 燃烧率 5x"
          description: |
            过去 6 小时错误率超过 SLO 阈值的 5 倍。
            当前错误率: {{ $value | printf "%.4f" }}
            预计消耗 2% 月度错误预算。

      # 1天燃烧率 10x（消耗 3% 预算）
      - alert: SLOErrorBudgetBurnRate10x
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1d]))
            / sum(rate(http_requests_total[1d]))
          ) > (1 - 0.999) * 10
        for: 2h
        labels:
          severity: critical
          slo: availability
          burn_rate: "10x"
        annotations:
          summary: "可用性 SLO 燃烧率 10x"
          description: |
            过去 24 小时错误率超过 SLO 阈值的 10 倍。
            当前错误率: {{ $value | printf "%.4f" }}
            预计消耗 3% 月度错误预算。

      # ----------------------------------------------------------------------
      # 延迟 SLO 告警
      # ----------------------------------------------------------------------

      - alert: SLOLatencyP99BurnRate
        expr: |
          (
            sum(rate(http_request_duration_seconds_bucket{le="0.5"}[1h]))
            / sum(rate(http_request_duration_seconds_count[1h]))
          ) < (1 - 0.99)  # 99% 请求应 < 500ms
        for: 5m
        labels:
          severity: warning
          slo: latency
        annotations:
          summary: "延迟 SLO 违规"
          description: |
            过去 1 小时 P99 延迟超过 500ms 阈值。
            当前比例: {{ $value | printf "%.2f%%" }}

      # ----------------------------------------------------------------------
      # 错误预算即将耗尽
      # ----------------------------------------------------------------------

      - alert: SLOErrorBudgetExhausted
        expr: slo:error_budget:remaining < 0.1
        for: 5m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "错误预算即将耗尽"
          description: |
            30 天错误预算剩余不足 10%。
            当前剩余: {{ $value | printf "%.1f%%" }}
            建议暂停新功能发布，专注于可靠性改进。

      - alert: SLOErrorBudgetExhausted
        expr: slo:error_budget:remaining < 0
        for: 1m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "错误预算已耗尽"
          description: "SLO 已被违反，需立即处理"

# ==============================================================================
# 多窗口告警配置（降低误报率）
# ==============================================================================
# 同时满足短窗口和长窗口条件才告警

  - name: multi-window-alerts
    rules:
      - alert: SLOAvailabilityCritical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))
          ) > 0.01
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          ) > 0.001
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "可用性严重下降"
          description: |
            短期（5分钟）错误率: {{ $value | printf "%.2f" }}
            长期（1小时）错误率: {{ $value | printf "%.2f" }}

4.2 告警分级策略

# alerting-tiering.yaml - 告警分级策略

# ==============================================================================
# 告警优先级矩阵
# ==============================================================================
# 
#                    │ 短窗口（5m） │ 中窗口（1h） │ 长窗口（1d） │
#    ────────────────┼──────────────┼──────────────┼──────────────┤
#    高燃烧率（10x）  │   P0 立即    │   P0 立即    │   P1 1小时内  │
#    中燃烧率（5x）   │   P1 1小时内 │   P1 1小时内 │   P2 4小时内  │
#    低燃烧率（2x）   │   P2 4小时内 │   P2 4小时内 │   P3 工作时间 │
#

alert_routing:
  # P0 - 立即响应（5分钟内）
  - match:
      severity: critical
      burn_rate: "10x"
    actions:
      - type: pagerduty
        priority: high
      - type: slack
        channel: "#incidents-critical"
      - type: sms
        to: oncall_team

  # P1 - 1小时内响应
  - match:
      severity: critical
    actions:
      - type: pagerduty
        priority: medium
      - type: slack
        channel: "#incidents"

  # P2 - 4小时内响应
  - match:
      severity: warning
      burn_rate: "5x"
    actions:
      - type: slack
        channel: "#sre-alerts"
      - type: email
        to: sre-team@example.com

  # P3 - 工作时间处理
  - match:
      severity: warning
    actions:
      - type: slack
        channel: "#sre-alerts"
      - type: jira
        priority: low

5. SLO 看板设计

5.1 Grafana 面板 JSON

{
  "dashboard": {
    "title": "SLO Dashboard",
    "uid": "slo-dashboard",
    "panels": [
      {
        "title": "Service Availability (30d)",
        "type": "gauge",
        "gridPos": {"x": 0, "y": 0, "w": 6, "h": 6},
        "targets": [
          {"expr": "slo:availability:30d", "legendFormat": "Availability"}
        ],
        "fieldConfig": {
          "defaults": {
            "min": 0.99,
            "max": 1,
            "unit": "percentunit",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "red", "value": 0.99},
                {"color": "yellow", "value": 0.999},
                {"color": "green", "value": 0.9999}
              ]
            }
          }
        }
      },
      {
        "title": "Error Budget Remaining",
        "type": "gauge",
        "gridPos": {"x": 6, "y": 0, "w": 6, "h": 6},
        "targets": [
          {"expr": "slo:error_budget:remaining * 100", "legendFormat": "Budget %"}
        ],
        "fieldConfig": {
          "defaults": {
            "min": 0,
            "max": 100,
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "red", "value": 0},
                {"color": "yellow", "value": 10},
                {"color": "green", "value": 30}
              ]
            }
          }
        }
      },
      {
        "title": "Burn Rate",
        "type": "stat",
        "gridPos": {"x": 12, "y": 0, "w": 6, "h": 3},
        "targets": [
          {"expr": "slo:error_budget:burn_rate", "legendFormat": "Current"}
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 2},
                {"color": "red", "value": 5}
              ]
            }
          }
        }
      },
      {
        "title": "Monthly Downtime Budget",
        "type": "stat",
        "gridPos": {"x": 18, "y": 0, "w": 6, "h": 3},
        "targets": [
          {"expr": "(1 - slo:availability:30d) * 30 * 24 * 60", "legendFormat": "Minutes Used"}
        ]
      },
      {
        "title": "Availability Trend",
        "type": "graph",
        "gridPos": {"x": 0, "y": 6, "w": 12, "h": 8},
        "targets": [
          {
            "expr": "slo:availability:1h",
            "legendFormat": "1h Window"
          },
          {
            "expr": "slo:availability:30d",
            "legendFormat": "30d Window"
          }
        ]
      },
      {
        "title": "Error Budget Burn Down",
        "type": "graph",
        "gridPos": {"x": 12, "y": 6, "w": 12, "h": 8},
        "targets": [
          {
            "expr": "slo:error_budget:remaining * 100",
            "legendFormat": "Budget Remaining %"
          }
        ]
      },
      {
        "title": "Error Rate by Service",
        "type": "graph",
        "gridPos": {"x": 0, "y": 14, "w": 12, "h": 8},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ]
      },
      {
        "title": "Latency P99 by Service",
        "type": "graph",
        "gridPos": {"x": 12, "y": 14, "w": 12, "h": 8},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{ service }}"
          }
        ]
      }
    ]
  }
}

6. SLO 报告生成

6.1 周报自动化

#!/usr/bin/env python3
# slo_report_generator.py - SLO 周报生成器

import json
from datetime import datetime, timedelta
from typing import Dict, List
import requests

class SLOReportGenerator:
    """SLO 周报生成器"""

    def __init__(self, prometheus_url: str, services: List[str]):
        self.prometheus_url = prometheus_url
        self.services = services
        self.slo_targets = {
            'availability': 0.999,
            'latency_p99': 0.5,  # 500ms
            'error_rate': 0.001
        }

    def query_prometheus(self, query: str) -> Dict:
        """查询 Prometheus"""
        response = requests.get(
            f"{self.prometheus_url}/api/v1/query",
            params={'query': query}
        )
        return response.json()

    def calculate_availability(self, service: str, days: int = 7) -> Dict:
        """计算服务可用性"""
        query = f'''
            sum(rate(http_requests_total{{service="{service}",status!~"5.."}}[{days}d]))
            / sum(rate(http_requests_total{{service="{service}"}}[{days}d]))
        '''
        result = self.query_prometheus(query)
        availability = float(result['data']['result'][0]['value'][1])

        return {
            'service': service,
            'availability': availability,
            'target': self.slo_targets['availability'],
            'met': availability >= self.slo_targets['availability'],
            'gap': availability - self.slo_targets['availability']
        }

    def calculate_error_budget(self, service: str, days: int = 7) -> Dict:
        """计算错误预算消耗"""
        # 总请求数
        total_query = f'sum(increase(http_requests_total{{service="{service}"}}[{days}d]))'
        total_result = self.query_prometheus(total_query)
        total_requests = float(total_result['data']['result'][0]['value'][1])

        # 错误请求数
        error_query = f'sum(increase(http_requests_total{{service="{service}",status=~"5.."}}[{days}d]))'
        error_result = self.query_prometheus(error_query)
        error_requests = float(error_result['data']['result'][0]['value'][1])

        # 预算消耗
        allowed_errors = total_requests * (1 - self.slo_targets['availability'])
        budget_consumed = error_requests / allowed_errors if allowed_errors > 0 else 0

        return {
            'service': service,
            'total_requests': int(total_requests),
            'error_requests': int(error_requests),
            'allowed_errors': int(allowed_errors),
            'budget_consumed_percent': budget_consumed * 100,
            'budget_remaining_percent': (1 - budget_consumed) * 100
        }

    def generate_weekly_report(self) -> Dict:
        """生成周报"""
        report = {
            'period': {
                'start': (datetime.now() - timedelta(days=7)).isoformat(),
                'end': datetime.now().isoformat()
            },
            'services': []
        }

        for service in self.services:
            service_report = {
                'name': service,
                'availability': self.calculate_availability(service),
                'error_budget': self.calculate_error_budget(service),
                'incidents': [],  # 从告警系统获取
                'recommendations': []
            }

            # 生成建议
            if service_report['availability']['gap'] < 0:
                service_report['recommendations'].append(
                    f"可用性低于目标 {abs(service_report['availability']['gap'])*100:.2f}%，建议优化"
                )

            if service_report['error_budget']['budget_remaining_percent'] < 20:
                service_report['recommendations'].append(
                    "错误预算剩余不足 20%，建议暂停新功能发布"
                )

            report['services'].append(service_report)

        return report

    def format_markdown_report(self, report: Dict) -> str:
        """格式化为 Markdown"""
        md = f"""# SLO 周报

**报告周期**: {report['period']['start']} 至 {report['period']['end']}

## 服务概览

| 服务 | 可用性 | 目标 | 状态 | 预算剩余 |
|------|--------|------|------|----------|
"""
        for service in report['services']:
            status = "✅" if service['availability']['met'] else "❌"
            md += f"| {service['name']} | "
            md += f"{service['availability']['availability']*100:.3f}% | "
            md += f"{service['availability']['target']*100:.3f}% | "
            md += f"{status} | "
            md += f"{service['error_budget']['budget_remaining_percent']:.1f}% |\n"

        md += "\n## 详细分析\n\n"
        for service in report['services']:
            md += f"### {service['name']}\n\n"
            md += f"- 总请求数: {service['error_budget']['total_requests']:,}\n"
            md += f"- 错误请求: {service['error_budget']['error_requests']:,}\n"
            md += f"- 预算消耗: {service['error_budget']['budget_consumed_percent']:.1f}%\n"

            if service['recommendations']:
                md += "\n**建议**:\n"
                for rec in service['recommendations']:
                    md += f"- {rec}\n"
            md += "\n"

        return md


# 使用示例
if __name__ == "__main__":
    generator = SLOReportGenerator(
        prometheus_url="http://prometheus:9090",
        services=["api-gateway", "order-service", "payment-service"]
    )

    report = generator.generate_weekly_report()
    print(generator.format_markdown_report(report))

7. 最佳实践

7.1 SLO 设定原则

原则	说明
从小开始	先设定核心服务的核心指标
渐进式	根据历史数据调整目标
区分优先级	关键路径要求更高 SLO
用户导向	从用户视角定义 SLI
可达成	SLO 应具有挑战但可达成

7.2 错误预算使用策略

预算剩余	行动建议
> 50%	可以加快发布节奏
20-50%	正常发布，关注可靠性
10-20%	暂停新功能，专注于稳定性
< 10%	停止所有发布，全面可靠性改进
< 0%	SLO 违约，紧急可靠性修复

8. 参考资料

Google SRE Book: https://sre.google/sre-book/service-level-objectives/
The Art of SLO: https://sre.google/workbook/implementing-slos/
OpenSLO: https://github.com/OpenSLO/OpenSLO
SLO Generator: https://github.com/google/slo-generator

文档版本: 1.0 更新日期: 2024-01-15 适用环境: Prometheus + Grafana

SLO/SLI 服务等级目标实践

SRE 话题文档：SLO/SLI 服务等级目标实践

1. SLO/SLI/SLA 概念与框架

1.1 核心概念

1.2 SLO 制定框架

2. SLI 指标设计

2.1 常见 SLI 定义

2.2 服务类型 SLI 模板

3. 错误预算管理

3.1 错误预算计算

3.2 错误预算消耗监控

4. SLO 告警策略

4.1 多窗口多燃烧率告警

4.2 告警分级策略

5. SLO 看板设计

5.1 Grafana 面板 JSON

6. SLO 报告生成

6.1 周报自动化

7. 最佳实践

7.1 SLO 设定原则

7.2 错误预算使用策略

8. 参考资料

results matching ""

No results matching ""