SRE 话题生产实践：SLO/SLI/SLA 定义与实践

文档版本： v1.0
生成日期： 2026-04-07
适用读者： 有一定运维经验的工程师、SRE 从业者、平台工程师
标签： SRE, SLO, SLI, SLA, 可靠性, 可用性, SLO 工程实践

1. 概述与背景

1.1 什么是 SRE？为何从 SLO/SLI/SLA 出发？

SRE（Site Reliability Engineering，站点可靠性工程） 是 Google 在 2003 年提出的理念，旨在用软件工程思维和运维实践相结合，使大规模系统的可靠性目标可衡量、可达成。SRE 的核心目标一句话概括：在可靠性与开发速度之间找到最优平衡点。

在 SRE 方法论中，SLO/SLI/SLA 是整套可靠性工程的基石——它们回答了三个最根本的问题：

概念	问题	本质
SLI（Service Level Indicator）	我们怎么衡量可靠性？	可量化的指标
SLO（Service Level Objective）	我们追求的可靠性目标是什么？	目标阈值
SLA（Service Level Agreement）	我们对外承诺的可靠性是什么？	合同/承诺

这三个层次形成一个从内部度量 → 内部目标 → 外部承诺的漏斗。SLO 处于中心位置，是 SRE 实践中最重要的概念。

1.2 SLO 实践的必要性

很多团队在没有 SLO 体系时面临以下困境：

"服务正常"还是"服务异常"靠人工判断，没有客观标准
告警泛滥：所有指标都设告警，但真正的故障被淹没
团队争吵：开发说服务正常，运维说有问题，无法对齐
对外承诺模糊：业务方说可用性 99.99%，工程师不知道具体怎么算
发布无底：不敢发布或过于激进发布，无法量化风险

建立 SLO 体系后，这些问题将逐一得到系统性解决。

1.3 SLO 理念的核心原则

SLO 实践基于以下核心理念（来自 Google SRE Book）：

不要以 100% 可靠性为目标 —— 100% 不可实现，且成本收益极低
关注用户真正体验 —— 选择能代表用户感受的指标，而非基础设施指标
SLO 决定 Error Budget —— SLO 之外的容忍度就是 Error Budget，是发布的"燃料"
用数据说话 —— 所有决策基于 SLI 实际数据，而非直觉
持续迭代 —— SLO 不是一次性设定就完事了，需要根据业务和用户反馈持续调整

2. 核心原理

2.1 SLI：服务级别指标的定义

SLI 是对服务可靠性的精确量化测量。 理想情况下，SLI 应该直接反映用户感受到的性能和可用性。

2.1.1 四大类 SLI

类型	描述	常见测量方式
可用性（Availability）	服务可用的比例	成功请求 / 总请求，或 uptime 探针
延迟（Latency）	响应速度是否足够快	P50/P90/P95/P99 延迟分布
吞吐量（Throughput）	服务的处理能力	QPS、RPS、每秒事务数
正确性（Correctness）	返回结果是否正确	错误率、数据一致性检测

2.1.2 SLI 的测量边界

SLI 测量必须明确以下边界：

请求窗口（Request Window）:
  用户发起请求 ──► 服务处理 ──► 返回响应
       │                        │
       └──── 测量起点 ──────────┴──── 测量终点

关键问题：

测量范围是端到端（用户 → 服务 → 数据库）还是组件级（仅服务自身）？
什么算"有效请求"？重试请求、CDN 缓存命中、OPTIONS 请求等如何处理？
测量点在哪里：服务端、客户端、CDN 边缘节点？

推荐实践： 以用户视角端到端测量为主，辅以组件级指标做根因分析。

2.1.3 SLI 数据来源

数据来源	优点	缺点
黑盒监控（Synthetic Monitoring）	不侵入服务，覆盖真实用户路径	无法覆盖所有用户场景
白盒监控（Application Metrics）	数据精确，可关联内部状态	需要在代码中埋点
RUM（Real User Monitoring）	真实用户体验数据	隐私合规，采集成本
日志分析	细节丰富	延迟高，存储成本大
SRE 人员手动抽样测试	灵活，能发现监控盲区	不可规模化

推荐方案： 主要依赖 Prometheus + Metrics，以黑盒探针做兜底验证。

2.2 SLO：服务级别目标的设定

SLO 是 SLI 指标的目标阈值，定义了我们希望服务达到的可靠性水平。

2.2.1 SLO 的数学表达

SLO = 目标时间窗口内满足 SLI 阈值的好请求数 / 总请求数 × 100%

例如：
  月度 SLO = 99.9%（月度 43 分钟宕机时间）
  计算：30天 × 24小时 × 60分钟 × (1 - 0.999) = 43.2 分钟

2.2.2 SLO 层级结构

企业级 SLO
  │
  ├── 产品级 SLO（如"搜索服务 99.5%"）
  │     │
  │     ├── 服务级 SLO（如"搜索 API 99.9%"）
  │     │     │
  │     │     ├── 组件级 SLO（如"索引服务 99.95%"）
  │     │     └── 依赖服务 SLO（如"MySQL 99.99%"）

建议服务自身 SLO 比对外承诺高一个档次（比如对外 99.9%，自身目标 99.95%），留出安全缓冲。

2.2.3 Error Budget：错误预算

Error Budget = 1 - SLO，是 SLO 之外允许"出问题"的量。

月度 99.9% SLO → Error Budget = 0.1%
Error Budget = 30天 × 24小时 × 60分钟 × 0.001 = 43.2 分钟

这 43.2 分钟就是当月的"容错额度"，
可以用来：发布新功能、实验性变更、接受一定的技术债

Error Budget 消耗速度：

状态	消耗速度	策略
🟢 绿色（< 50% 消耗）	正常	保持常规发布节奏
🟡 黄色（50-100% 消耗）	加速	减少发布频率，优先修可靠性
🔴 红色（超过 SLO）	失控	停止非必要变更，全员修 Bug

这是 SLO/SLI 体系最有价值的产出之一： 用客观数据来决定"今天还能不能发版"。

2.2.4 SLO 目标设定的经验法则

从用户需求出发，而非从技术便利出发
参考行业基准：
- 99%（2 个 9）：基本可用，允许日常维护窗口
- 99.9%（3 个 9）：高可用，日常场景无感知宕机
- 99.99%（4 个 9）：任务关键系统（金融支付、医疗）
- 99.999%（5 个 9）：极高可靠性（电信网络核心）
不要一开始就设最高目标，从当前实际水平起步，逐步提升
SLO 越严格，成本指数增长：99% → 99.9% 成本约 10 倍；99.9% → 99.99% 再增 10 倍

2.3 SLA：服务级别协议

SLA 是对外承诺的法律/商业合同，通常由 SLO 衍生而来，但对外承诺要比内部目标更保守。

内部 SLO = 99.95%（自己努力的目标）
对外 SLA = 99.9%（承诺给客户的兜底）
多出的 0.05% 是缓冲垫（buffer）

SLA 违约通常涉及经济赔偿（SLA Credit），因此设定 SLA 时务必保守。

3. 主要功能与应用场景

3.1 SLO 在生产环境中的核心应用场景

场景一：发布决策引擎

SLO + Error Budget 直接驱动发布策略：

# 伪代码：基于 Error Budget 的发布决策
def can_deploy(slo, error_budget_remaining, change_risk):
    if error_budget_remaining <= 0:
        return False, "Error Budget 已耗尽，禁止发布"
    elif error_budget_remaining < 0.25:  # 少于 25% 的月度预算
        return False, f"剩余 Error Budget 仅剩 {error_budget_remaining:.1%}，风险过高"
    elif error_budget_remaining < 0.5 and change_risk == "high":
        return False, f"Error Budget 不足且变更风险高"
    else:
        return True, f"Error Budget 充足 ({error_budget_remaining:.1%})，可以发布"

场景二：告警收敛与分级

传统告警体系的问题： 指标多、告警多、真正重要的故障被淹没。

基于 SLO 的告警体系：

SLO 健康状态（决定是否需要告警）
  │
  ├── Burn Rate Alert（燃烧率告警）
  │     核心思路：当 Error Budget 的消耗速度超过正常速度时告警
  │     例如：1 小时内消耗了 1 天半的 Error Budget → 立即告警
  │
  ├── 传统指标告警（保留但降低优先级）
  │     例如：CPU > 80%、内存 > 85%（辅助参考，不阻断）
  │
  └── Runbook 关联
        每个 SLO 告警自动关联对应的事故处理手册

场景三：多服务依赖的可靠性评估

微服务架构中，单个服务 SLO ≠ 整体可用性：

如果：
  - 用户服务 SLO = 99.9%（月度错误预算 43.2 分钟）
  - 订单服务 SLO = 99.9%（月度错误预算 43.2 分钟）
  - 支付服务 SLO = 99.95%（月度错误预算 21.6 分钟）

则用户下单链路（依赖三者）的整体可用性：
  整体可用性 ≈ 99.9% × 99.9% × 99.95% ≈ 99.75%
  对应月度错误预算 ≈ 108 分钟

注意：不是加法，是乘法！依赖越多，可用性越低。

场景四：SLO 合规报告（Business Review）

定期向业务方、技术负责人输出 SLO 健康度报告：

月度 SLO 报告内容：
  1. 各服务 SLO 达成情况（柱状图）
  2. Error Budget 消耗曲线
  3. 未达成 SLO 的故障事件列表
  4. 下月 SLO 调整计划（是否收紧/放宽）
  5. 可靠性投入产出分析

4. 部署与安装步骤

4.1 技术选型

我们选择 Prometheus + Sloth 组合来构建 SLO 基础设施：

组件	用途	选型理由
Prometheus	指标采集与存储	事实标准的云原生监控
Sloth	SLO 声明式管理 + 自动生成 Prometheus 规则	自动处理 Multi-Window Burn Rate Alert
Grafana	可视化看板	与 Prometheus 完美集成
Alertmanager	告警路由与聚合	支持 SLO 专项告警

4.2 环境准备

前置条件： Kubernetes 集群（推荐 K8s 1.21+）

# 确认 Kubernetes 集群状态
kubectl cluster-info
kubectl get nodes

# 确认 Helm 已安装（Sloth 需要 Helm 安装）
helm version
# 版本信息: version.BuildInfo{Version:"v3.x.x"...}

# 如果没有安装 Helm，执行：
curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

4.3 安装 Prometheus（使用 Prometheus Operator）

# 添加 Prometheus Community Helm 仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# 安装 kube-prometheus-stack（包含 Prometheus + Grafana + Alertmanager）
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set grafana.persistence.enabled=true \
  --set grafana.persistence.size=10Gi

# 等待 Pod 就绪
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=prometheus -n monitoring --timeout=300s
kubectl get pods -n monitoring

4.4 安装 Sloth（核心 SLO 管理工具）

# 添加 Sloth Helm 仓库
helm repo add sloth https://sloth.dev/helm
helm repo update

# 安装 Sloth（包含 CRD 和 PrometheusPlugin）
helm install sloth sloth/sloth \
  --namespace monitoring \
  --create-namespace

# 验证安装
kubectl get pods -n monitoring -l app.kubernetes.io/name=sloth
# 期望输出：sloth-xxx-xxx Running 1/1

# 安装 Sloth CLI（用于本地验证 SLO 配置文件）
# macOS
brew install sloth-dev/tap/sloth

# 验证 CLI
sloth version
# 期望：Sloth version v0.x.x

4.5 定义你的第一个 SLO

Sloth 使用声明式 YAML 定义 SLO。以下是一个完整的示例：

# frontend-api-slo.yaml
apiVersion: sloth.dev/v1
kind: SLO
metadata:
  name: frontend-api-availability
  namespace: monitoring
spec:
  description: >
    前端 API 服务可用性：99.5%（基于 /healthz 端成功率）
    月度 Error Budget = 3.6 小时

  serviceLevelIndicators:
    source: prometheus/caleydo
    total:
      metric: |
        sum(rate(http_requests_total{
          job="frontend-api",
          status=~"2..|3.."
        }[__auto])) or vector(0)
    good:
      metric: |
        sum(rate(http_requests_total{
          job="frontend-api"
        }[__auto])) or vector(0)

  objective: 99.5

  windowing:
    kind: Rolling
    duration: 30d

  alerting:
    name: FrontendAPIAvailability
    labels:
      severity: page
      team: platform
      service: frontend-api
    annotations:
      summary: "Frontend API 可用性低于 SLO 目标"
      runbook_url: "https://runbooks.example.com/frontend-api-slo"
    multiWindow:
      big: 14.4/100
      medium: 6/100
      short: 3/100
      long: 1/100
    sloAlert:
      - name: SLOBudgetBurn
        expr: |
          sum(slo:errorBudget30d:slo:name{slo="frontend-api-availability"}) < 0.75
        labels:
          severity: warning

4.6 更多 SLO 示例

示例 2：API 延迟 SLO（P99 延迟）

# api-latency-slo.yaml
apiVersion: sloth.dev/v1
kind: SLO
metadata:
  name: api-p99-latency
  namespace: monitoring
spec:
  description: >
    API 服务 P99 延迟 < 500ms，月度 Error Budget = 3.6 小时
  serviceLevelIndicators:
    source: prometheus/caleydo
    total:
      metric: |
        sum(rate(http_request_duration_seconds_count{
          job="api-gateway"
        }[__auto]))
    good:
      metric: |
        sum(rate(http_request_duration_seconds_bucket{
          job="api-gateway",
          le="0.5"
        }[__auto]))
  objective: 99.5
  windowing:
    kind: Rolling
    duration: 30d
  alerting:
    name: APIP99Latency
    labels:
      severity: page
      team: backend

示例 3：数据库可用性 SLO

# mysql-availability-slo.yaml
apiVersion: sloth.dev/v1
kind: SLO
metadata:
  name: mysql-availability
  namespace: monitoring
spec:
  description: >
    MySQL 主库可用性 99.99%（月度 4.4 分钟宕机时间）
  serviceLevelIndicators:
    source: prometheus/caleydo
    total:
      metric: |
        sum(rate(mysql_up{job="mysql-primary"}[__auto]))
    good:
      metric: |
        sum(rate(mysql_up{job="mysql-primary"}[__auto]))
  objective: 99.99
  windowing:
    kind: Rolling
    duration: 30d
  alerting:
    name: MySQLAvailability
    labels:
      severity: page
      team: dba
    multiWindow:
      big: 14.4/100
      medium: 6/100
      short: 3/100
      long: 1/100

4.7 应用 SLO 配置到集群

# 创建 SLO 配置目录
mkdir -p sre-docs/examples

# 将 SLO YAML 文件应用到集群
kubectl apply -f sre-docs/examples/

# 查看 Sloth 自动生成的 PrometheusRule
kubectl get prometheusrules -n monitoring | grep -i slo

# 查看 Sloth CRD 状态
kubectl get slo -n monitoring

# 使用 Sloth CLI 验证 SLO 配置语法
sloth validate sre-docs/examples/frontend-api-slo.yaml

# 预览 Sloth 将生成的 Prometheus 告警规则
sloth generate sre-docs/examples/frontend-api-slo.yaml | head -100

4.8 部署 Grafana SLO Dashboard

# 下载 Sloth 官方 Grafana Dashboard
# Dashboard ID: 14658（Sloth SLO Overview Dashboard）
# Dashboard ID: 13837（Sloth SLO Detail Dashboard）

# 1. 打开 Grafana（默认: http://localhost:3000）
# 2. + → Import → 输入 Dashboard ID: 14658
# 3. 选择 Prometheus 数据源
# 4. 点击 Import

# 或通过脚本导入：
GRAFANA_URL="http://localhost:3000"
GRAFANA_TOKEN="your-api-token"

curl -X POST \
  -H "Authorization: Bearer ${GRAFANA_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "id": 14658,
    "overwrite": true,
    "message": "Importing Sloth SLO Dashboard"
  }' \
  "${GRAFANA_URL}/api/dashboards/import"

4.9 验证 SLO 链路是否打通

# 1. 检查 Prometheus 中是否有 SLO 相关指标
kubectl exec -n monitoring -it prometheus-prometheus-0 -- \
  promtool query instant \
  --url http://localhost:9090 \
  'slo:errorBudget30d'

# 2. 检查 PrometheusRule 是否已创建
kubectl get prometheusrules -n monitoring -l slo=frontend-api-availability -o yaml

# 3. 检查 Sloth 日志（排查问题）
kubectl logs -n monitoring -l app.kubernetes.io/name=sloth --tail=50

# 4. 手动触发一次 Prometheus 告警测试
kubectl exec -n monitoring -it prometheus-prometheus-0 -- \
  promtool rules test \
  /etc/prometheus/rules/*.yaml

# 5. 在 Grafana 中查看 SLO Dashboard
# 预期看到：
#   - 各服务 SLO 达成率柱状图
#   - Error Budget 剩余量
#   - Burn Rate 曲线

5. 常见问题与排查

问题 1：Sloth 安装后 CRD 不生效

现象： kubectl get slo 报错 "No resources found" 或 CRD 未注册

排查与解决：

# 检查 Sloth CRD 是否存在
kubectl get crd | grep slo

# 如果不存在，手动安装 CRD：
kubectl apply -f https://raw.githubusercontent.com/slok/sloth/main/pkg/prometheus/sloth-crd/slo-crd.yaml

# 重新安装 Sloth（强制覆盖）：
helm upgrade --install sloth sloth/sloth \
  --namespace monitoring \
  --create-namespace \
  --force

# 检查 Sloth Pod 日志：
kubectl logs -n monitoring deployment/sloth -f

问题 2：Prometheus 没有采集到 SLO 指标

现象： Grafana 中 SLO 面板无数据

排查步骤：

# 步骤1：确认 Prometheus 中有原始指标
kubectl exec -n monitoring -it prometheus-prometheus-0 -- \
  promtool query instant 'http_requests_total{job="frontend-api"}'

# 步骤2：检查 Prometheus 是否加载了 Sloth 生成的 PrometheusPlugin
kubectl get prometheus -n monitoring -o yaml | grep -A 10 plugin

# 步骤3：Sloth 需要以 PrometheusPlugin 模式安装：
helm install sloth sloth/sloth \
  --namespace monitoring \
  --set plugin.enabled=true

# 步骤4：查看 Prometheus Operator 日志：
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-operator --tail=100

问题 3：Burn Rate Alert 告警频繁但 SLO 实际正常

现象： short 和 long 窗口的 Burn Rate Alert 频繁触发，但实际 SLO 达成率接近目标

解决方案： 调整 Burn Rate 阈值，根据业务特点定制：

# 自定义 Burn Rate 阈值（适合流量相对平稳的服务）
alerting:
  multiWindow:
    big: 50/100    # 1h：50 倍燃烧率才告警
    medium: 20/100 # 6h：20 倍燃烧率
    short: 10/100  # 3d：10 倍燃烧率
    long: 3/100    # 30d：3 倍燃烧率

问题 4：SLO 目标设定过高导致 Error Budget 快速耗尽

现象： 每月 Error Budget 在第一周就耗尽，SLO 持续未达成

解决方案：

# 1. 先用历史数据反推合理的 SLO
promtool query instant '
  1 - (
    sum(increase(http_requests_total{job="frontend-api",status=~"5.."}[30d]))
    /
    sum(increase(http_requests_total{job="frontend-api"}[30d]))
  )
'

# 2. 如果历史数据不足，先设定一个宽松的 SLO 运行 1 个月
objective: 95.0

# 3. 一个月后根据实际数据收紧，逐步迭代

问题 5：多服务依赖场景下 SLO 达成率统计不准确

现象： 各服务 SLO 都达标，但端到端用户请求失败率高

解决方案： 必须建立端到端用户旅程 SLO：

# 端到端下单链路 SLO
apiVersion: sloth.dev/v1
kind: SLO
metadata:
  name: checkout-journey-success
  namespace: monitoring
spec:
  description: >
    端到端下单成功（用户发起 → 支付完成），全链路无错误
  serviceLevelIndicators:
    source: prometheus/caleydo
    total:
      metric: |
        sum(rate(checkout_flow_total[__auto]))
    good:
      metric: |
        sum(rate(checkout_flow_total{status="success"}[__auto]))
  objective: 99.0

问题 6：告警通知太频繁，On-Call 人员疲劳

解决方案： 告警分级

告警级别	触发条件	通知方式	响应要求
P1 - SLO 燃烧告警	Burn Rate 超过阈值	立即通知（电话+短信）	15 分钟内响应
P2 - 指标异常	组件指标异常	Slack 频道	30 分钟内响应
P3 - 趋势预警	容量趋势接近阈值	邮件	工作时间处理
P4 - 信息	例行通知	日志记录	无需响应

# alertmanager-config.yaml
route:
  group_by: ['slo', 'alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # P1: SLO Burn Rate 告警 → 电话通知
    - match:
        severity: page
        slo_burn: "true"
      receiver: oncall-phone
      group_wait: 10s
      repeat_interval: 1h

    # P2: 指标告警 → Slack 频道
    - match:
        severity: warning
      receiver: slack-platform
      group_wait: 1m

6. 生产实践建议

6.1 SLO 设计与治理

建议 1：从用户旅程出发设计 SLO，而非从技术架构出发

❌ 不好：CPU < 80%、内存 < 85%、磁盘 < 90%
✅ 好：用户下单成功率 99.5%、API P99 延迟 < 500ms、搜索结果返回时间 < 2s

核心原则：指标必须和用户感受直接挂钩。
如果一个指标用户根本感知不到，即使技术上重要，也不要作为 SLO。

建议 2：SLO 数量控制在合理范围（建议 3-7 个核心服务 SLO）

每个服务最多 2-3 个 SLO（可用性 + 主要延迟）
优先覆盖"用户最常用"和"业务最关键"的场景
避免 SLO 数量过多导致维护成本爆炸

建议 3：SLO 要有明确的"负责人"（SLO Owner）

| SLO | Owner | Team | 评审周期 |
|-----|-------|------|---------|
| 前端 API 可用性 99.5% | @张三 | 平台工程 | 每季度 |
| 搜索服务 P99 延迟 < 500ms | @李四 | 搜索团队 | 每季度 |
| 支付链路端到端成功率 99.9% | @王五 | 支付团队 | 每月 |

建议 4：每季度评审 SLO，更新目标

基于历史数据评估是否需要收紧或放宽
业务重大变更（如大促、新功能上线）需要临时调整 SLO

6.2 Error Budget 策略与实践

建议 5：建立 Error Budget 自动化看板

Error Budget Dashboard 必选指标：
1. 当前 Error Budget 剩余百分比（环形图）
2. Error Budget 消耗速度曲线（vs. 计划速度）
3. 预计耗尽时间（基于当前消耗速度）
4. 历史消耗热力图（定位问题高发时段）

推荐使用 Sloth 自带的 Grafana Dashboard（ID: 14658），开箱即用。

建议 6：基于 Error Budget 制定发布策略

# Error Budget Policy 示例
policies:
  - name: fast-track-release
    conditions:
      error_budget_remaining: "> 0.75"
      recent_change_risk: "low"
    actions:
      auto_approve: true

  - name: cautious-release
    conditions:
      error_budget_remaining: "0.25-0.75"
    actions:
      require_review: ["team-lead"]
      canary_required: true
      canary_traffic: 5

  - name: release-freeze
    conditions:
      error_budget_remaining: "< 0.25"
    actions:
      require_review: ["vp-engineering"]
      message: "Error Budget 低于 25%，发布需要 VP 审批"

建议 7：Error Budget 消耗后的复盘不是惩罚，而是学习

Google SRE 的理念：Error Budget 是"允许你犯错"的安全垫，用掉它比存着不用更好。

存着不用的 Error Budget 意味着你在过度保守，牺牲了太多开发速度。

但用掉 Error Budget 后，一定要复盘学到什么。

6.3 SLO 与监控体系的融合

建议 8：SLO 告警是 On-Call 的第一优先级

On-Call 工程师只响应 SLO/Burn Rate 告警（P1）
其他指标告警通过 Slack 异步处理（P2-P4）
告警响应 SOP 第一步永远是检查 Error Budget 状态

建议 9：SLO 数据要可追溯，至少保留 90 天

# kube-prometheus-stack values:
prometheus:
  prometheusSpec:
    retention: 90d
    # 对于 SLO 告警，至少需要 30d 窗口数据才能准确计算月度 SLO

6.4 SLO 文化与组织实践

建议 10：让开发团队也能看到 SLO 数据

SLO Dashboard 对所有工程师可见（不仅是 On-Call）
每次故障复盘（Postmortem）从 SLO 视角分析

建议 11：在团队 OKR 中引入 SLO 指标

示例 OKR：
Objective: 提升服务可靠性，减少故障对用户的影响
  KR1: API 服务月度 SLO 从 99.5% 提升至 99.7%
  KR2: Error Budget 月均消耗减少 50%
  KR3: 平均故障恢复时间（MTTR）从 45 分钟降至 20 分钟

建议 12：从一个小服务开始试点，逐步推广

选择 1-2 个关键服务作为试点（优先：流量稳定、有明确 SLI、历史数据充足）
试点 1-2 个季度后总结经验，再推广

7. 参考资料

官方文档与书籍

Google SRE Book（站点可靠性工程）
- 官方网址：https://sre.google/sre-book/
- 必读章节：Chapter 5 - Embracing Risk、Chapter 6 - SLIs/SLOs/SLAs
Google SRE Workbook
- 官方网址：https://sre.google/workbook/
- 必读章节：Implementing SLIs
Sloth 官方文档
- GitHub：https://github.com/slok/sloth
- 官方文档：https://sloth.dev/
Prometheus 官方文档
- 网址：https://prometheus.io/docs/introduction/overview/
Grafana 官方文档 - Sloth Dashboard
- https://grafana.com/grafana/dashboards/14658-sloth-slo-overview/

行业实践

Cloudflare: How We Built SLO Monitoring at Scale
- https://blog.cloudflare.com/how-we-built-slo-monitoring-at-scale/
Netflix: Global SLO Planning and Reliability Culture
- https://netflixtechblog.com/
PagerDuty: SLO-Based Alerting Best Practices
- https://PagerDuty.com/resources/guides/slo-alerting/

开源工具生态

工具	用途	GitHub
Sloth	SLO 声明式管理 + Prometheus Plugin	github.com/slok/sloth
OpenSLO	SLO 标准规范格式	github.com/itscontain/sloth
Thanos	SLO 数据长期存储与全局视图	github.com/thanos-io/thanos
Keep	SLO 告警 + 自动化管理	github.com/keephq/keep

视频学习资源

SREcon 会议录像
- https://www.usenix.org/conference/srecon
Google Cloud SLO Workshop
- https://cloud.google.com/sts/docs/slo-workshop

文档生成时间：2026-04-07 | SRE 实践系列