SRE 话题文档:SLO/SLI 服务等级目标实践
本文档面向生产环境,涵盖 SLO/SLI/SLA 定义、错误预算管理、告警策略、可观测性集成等核心 SRE 实践。
1. SLO/SLI/SLA 概念与框架
1.1 核心概念
┌─────────────────────────────────────────────────────────────────────────────┐
│ SLO/SLI/SLA 关系图 │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ SLA (服务等级协议) │
│ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ SLO (服务等级目标) │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ SLI (服务等级指标) │ │ │
│ │ │ │ │ │
│ │ │ • 可用性 (Availability) │ │ │
│ │ │ • 延迟 (Latency) │ │ │
│ │ │ • 吞吐量 (Throughput) │ │ │
│ │ │ • 错误率 (Error Rate) │ │ │
│ │ │ • 饱和度 (Saturation) │ │ │
│ │ │ │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ 目标值:99.9% 可用性,P99 延迟 < 500ms │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ 违约后果:服务退款、赔偿条款 │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 错误预算 (Error Budget) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 可用性 99.9% = 每月允许停机时间: │ │
│ │ 30天 × 24小时 × 60分钟 × (1 - 0.999) = 43.2 分钟 │ │
│ │ │ │
│ │ 错误预算 = 1 - SLO = 0.1% 的故障时间 │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
1.2 SLO 制定框架
┌─────────────────────────────────────────────────────────────────────────────┐
│ SLO 制定流程 │
└─────────────────────────────────────────────────────────────────────────────┘
Step 1: 识别关键服务路径
┌───────────┐ ┌───────────┐ ┌───────────┐
│ 用户 │────▶│ API │────▶│ 数据库 │
│ 请求 │ │ 网关 │ │ │
└───────────┘ └───────────┘ └───────────┘
│
▼
Step 2: 定义 SLI 指标
• 请求成功率 = 成功请求数 / 总请求数
• 延迟分布 = P50, P95, P99
• 吞吐量 = 请求/秒
• 错误率 = 5xx 响应 / 总响应
│
▼
Step 3: 设定 SLO 目标
• 可用性 SLO: 99.9%
• 延迟 SLO: P99 < 500ms
• 错误率 SLO: < 0.1%
│
▼
Step 4: 计算错误预算
• 月度错误预算 = 43.2 分钟
• 每日错误预算 = 1.44 分钟
• 按请求量计算预算
│
▼
Step 5: 建立告警策略
• 错误预算消耗速率告警
• SLO 燃尽告警
• 多窗口告警策略
2. SLI 指标设计
2.1 常见 SLI 定义
# sli-definitions.yaml - SLI 指标定义
# ==============================================================================
# 可用性 SLI (Availability)
# ==============================================================================
availability_sli:
name: "请求可用性"
description: "成功处理的请求比例"
formula: "成功请求数 / 总请求数"
measurement:
source: "prometheus"
query: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
exclusion:
- "健康检查请求"
- "预发布环境请求"
- "内部监控请求"
slo_target: 0.999 # 99.9%
# ==============================================================================
# 延迟 SLI (Latency)
# ==============================================================================
latency_sli:
name: "请求延迟"
description: "请求响应时间分布"
measurement:
- name: "P50 延迟"
query: 'histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))'
slo_target: 0.1 # 100ms
- name: "P95 延迟"
query: 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'
slo_target: 0.3 # 300ms
- name: "P99 延迟"
query: 'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))'
slo_target: 0.5 # 500ms
exclusion:
- "大文件上传请求"
- "批量导出请求"
# ==============================================================================
# 错误率 SLI (Error Rate)
# ==============================================================================
error_rate_sli:
name: "错误率"
description: "失败请求比例"
formula: "错误请求数 / 总请求数"
measurement:
source: "prometheus"
query: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
slo_target: 0.001 # 0.1%
# ==============================================================================
# 吞吐量 SLI (Throughput)
# ==============================================================================
throughput_sli:
name: "吞吐量"
description: "系统处理请求的能力"
measurement:
source: "prometheus"
query: "sum(rate(http_requests_total[5m]))"
slo_target: 10000 # 10000 req/s
# ==============================================================================
# 饱和度 SLI (Saturation)
# ==============================================================================
saturation_sli:
- name: "CPU 使用率"
query: 'avg(rate(container_cpu_usage_seconds_total[5m]))'
slo_target: 0.8 # 80%
- name: "内存使用率"
query: 'container_memory_usage_bytes / container_spec_memory_limit_bytes'
slo_target: 0.85 # 85%
- name: "磁盘使用率"
query: 'node_filesystem_usage_bytes / node_filesystem_size_bytes'
slo_target: 0.85 # 85%
2.2 服务类型 SLI 模板
# ==============================================================================
# Web API 服务 SLI
# ==============================================================================
web_api_sli:
critical_path:
- endpoint: "/api/v1/users"
availability_slo: 0.999
latency_p99_slo: 0.2
- endpoint: "/api/v1/orders"
availability_slo: 0.9999 # 订单更关键
latency_p99_slo: 0.5
non_critical_path:
- endpoint: "/api/v1/analytics"
availability_slo: 0.99
latency_p99_slo: 2.0 # 分析可以慢
# ==============================================================================
# 数据库服务 SLI
# ==============================================================================
database_sli:
- name: "查询延迟"
query_type: "select"
p99_slo: 0.05 # 50ms
- name: "写入延迟"
query_type: "insert"
p99_slo: 0.1 # 100ms
- name: "连接池可用性"
slo: 0.999
# ==============================================================================
# 消息队列服务 SLI
# ==============================================================================
message_queue_sli:
- name: "消息投递延迟"
description: "消息从生产到消费的时间"
p99_slo: 5.0 # 5秒
- name: "消息投递成功率"
slo: 0.9999
- name: "消费者积压"
threshold: 100000 # 最大积压量
# ==============================================================================
# 缓存服务 SLI
# ==============================================================================
cache_sli:
- name: "缓存命中率"
slo: 0.9 # 90% 命中率
- name: "缓存延迟"
p99_slo: 0.01 # 10ms
3. 错误预算管理
3.1 错误预算计算
#!/usr/bin/env python3
# error_budget_calculator.py - 错误预算计算器
from datetime import datetime, timedelta
from typing import Dict, List
class ErrorBudgetCalculator:
"""错误预算计算器"""
# 时间窗口配置
TIME_WINDOWS = {
'daily': timedelta(days=1),
'weekly': timedelta(weeks=1),
'monthly': timedelta(days=30),
'quarterly': timedelta(days=90),
'yearly': timedelta(days=365)
}
def __init__(self, slo_target: float):
"""
初始化错误预算计算器
Args:
slo_target: SLO 目标值 (0.0 - 1.0),例如 0.999 表示 99.9%
"""
self.slo_target = slo_target
self.error_budget = 1 - slo_target
def calculate_downtime_budget(self, window: str = 'monthly') -> Dict[str, float]:
"""
计算指定时间窗口的停机预算
Returns:
包含各项时间单位的字典
"""
duration = self.TIME_WINDOWS.get(window, self.TIME_WINDOWS['monthly'])
total_seconds = duration.total_seconds()
budget_seconds = total_seconds * self.error_budget
return {
'window': window,
'total_seconds': total_seconds,
'budget_seconds': budget_seconds,
'budget_minutes': budget_seconds / 60,
'budget_hours': budget_seconds / 3600,
'slo_target': self.slo_target,
'slo_percentage': f"{self.slo_target * 100:.3f}%"
}
def calculate_request_budget(self, total_requests: int) -> Dict[str, float]:
"""
计算基于请求数的错误预算
Args:
total_requests: 总请求数
Returns:
错误预算详情
"""
allowed_failures = int(total_requests * self.error_budget)
return {
'total_requests': total_requests,
'allowed_failures': allowed_failures,
'slo_target': self.slo_target
}
def calculate_burn_rate(self,
consumed_budget: float,
time_elapsed_hours: float,
window_hours: float = 720) -> float:
"""
计算错误预算消耗速率
Args:
consumed_budget: 已消耗的预算 (0.0 - 1.0)
time_elapsed_hours: 已经过的时间(小时)
window_hours: 预算窗口(小时),默认 30 天
Returns:
燃烧速率倍数
"""
expected_burn_rate = 1.0 / window_hours # 正常燃烧速率
actual_burn_rate = consumed_budget / time_elapsed_hours / window_hours
return actual_burn_rate / expected_burn_rate
def calculate_slo_targets(self) -> Dict[str, Dict]:
"""
计算不同可用性目标的停机时间
Returns:
各可用性等级的停机时间
"""
targets = [0.99, 0.999, 0.9999, 0.99999]
results = {}
for target in targets:
calculator = ErrorBudgetCalculator(target)
results[f"{target * 100}%"] = {
'daily': calculator.calculate_downtime_budget('daily')['budget_minutes'],
'weekly': calculator.calculate_downtime_budget('weekly')['budget_minutes'],
'monthly': calculator.calculate_downtime_budget('monthly')['budget_minutes'],
'yearly': calculator.calculate_downtime_budget('yearly')['budget_hours']
}
return results
# 使用示例
if __name__ == "__main__":
# 99.9% SLO
calculator = ErrorBudgetCalculator(0.999)
# 月度停机预算
monthly_budget = calculator.calculate_downtime_budget('monthly')
print(f"月度停机预算: {monthly_budget['budget_minutes']:.2f} 分钟")
# 请求预算
request_budget = calculator.calculate_request_budget(1000000)
print(f"100万请求允许失败: {request_budget['allowed_failures']} 次")
# 所有 SLO 等级对照表
print("\nSLO 等级对照表:")
import json
print(json.dumps(calculator.calculate_slo_targets(), indent=2))
3.2 错误预算消耗监控
# error_budget_monitoring.yaml
# ==============================================================================
# Prometheus 规则 - 错误预算计算
# ==============================================================================
groups:
- name: error_budget_rules
interval: 1m
rules:
# 30 天可用性
- record: slo:availability:30d
expr: |
sum(rate(http_requests_total{status!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))
# 错误预算剩余
- record: slo:error_budget:remaining
expr: |
(slo:availability:30d - 0.999) / (1 - 0.999)
# 错误预算消耗速率
- record: slo:error_budget:burn_rate
expr: |
(1 - slo:availability:30d)
/ (1 - 0.999)
/ 30 # 标准化为每日燃烧速率
# 小时级可用性
- record: slo:availability:1h
expr: |
sum(rate(http_requests_total{status!~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
# 分钟级可用性
- record: slo:availability:5m
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# ==============================================================================
# Grafana 面板变量
# ==============================================================================
dashboard_variables:
- name: "SLO Target"
query: "0.99,0.999,0.9999"
- name: "Time Window"
query: "1h,6h,24h,7d,30d"
4. SLO 告警策略
4.1 多窗口多燃烧率告警
# slo-alerting.yaml - 基于 SLO 的告警策略
# ==============================================================================
# 多窗口多燃烧率告警(推荐)
# ==============================================================================
# 原理:同时检查短窗口和长窗口的错误率,避免短时抖动误报
# 参考:Google SRE Book - Alerting on SLOs
groups:
- name: slo-alerts
rules:
# ----------------------------------------------------------------------
# 可用性 SLO 告警
# ----------------------------------------------------------------------
# 1小时燃烧率 2x(消耗 5% 预算)
- alert: SLOErrorBudgetBurnRate2x
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > (1 - 0.999) * 2
for: 5m
labels:
severity: warning
slo: availability
burn_rate: "2x"
annotations:
summary: "可用性 SLO 燃烧率 2x"
description: |
过去 1 小时错误率超过 SLO 阈值的 2 倍。
当前错误率: {{ $value | printf "%.4f" }}
预计消耗 5% 月度错误预算。
# 6小时燃烧率 5x(消耗 2% 预算)
- alert: SLOErrorBudgetBurnRate5x
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))
) > (1 - 0.999) * 5
for: 30m
labels:
severity: warning
slo: availability
burn_rate: "5x"
annotations:
summary: "可用性 SLO 燃烧率 5x"
description: |
过去 6 小时错误率超过 SLO 阈值的 5 倍。
当前错误率: {{ $value | printf "%.4f" }}
预计消耗 2% 月度错误预算。
# 1天燃烧率 10x(消耗 3% 预算)
- alert: SLOErrorBudgetBurnRate10x
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1d]))
/ sum(rate(http_requests_total[1d]))
) > (1 - 0.999) * 10
for: 2h
labels:
severity: critical
slo: availability
burn_rate: "10x"
annotations:
summary: "可用性 SLO 燃烧率 10x"
description: |
过去 24 小时错误率超过 SLO 阈值的 10 倍。
当前错误率: {{ $value | printf "%.4f" }}
预计消耗 3% 月度错误预算。
# ----------------------------------------------------------------------
# 延迟 SLO 告警
# ----------------------------------------------------------------------
- alert: SLOLatencyP99BurnRate
expr: |
(
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[1h]))
/ sum(rate(http_request_duration_seconds_count[1h]))
) < (1 - 0.99) # 99% 请求应 < 500ms
for: 5m
labels:
severity: warning
slo: latency
annotations:
summary: "延迟 SLO 违规"
description: |
过去 1 小时 P99 延迟超过 500ms 阈值。
当前比例: {{ $value | printf "%.2f%%" }}
# ----------------------------------------------------------------------
# 错误预算即将耗尽
# ----------------------------------------------------------------------
- alert: SLOErrorBudgetExhausted
expr: slo:error_budget:remaining < 0.1
for: 5m
labels:
severity: critical
slo: availability
annotations:
summary: "错误预算即将耗尽"
description: |
30 天错误预算剩余不足 10%。
当前剩余: {{ $value | printf "%.1f%%" }}
建议暂停新功能发布,专注于可靠性改进。
- alert: SLOErrorBudgetExhausted
expr: slo:error_budget:remaining < 0
for: 1m
labels:
severity: critical
slo: availability
annotations:
summary: "错误预算已耗尽"
description: "SLO 已被违反,需立即处理"
# ==============================================================================
# 多窗口告警配置(降低误报率)
# ==============================================================================
# 同时满足短窗口和长窗口条件才告警
- name: multi-window-alerts
rules:
- alert: SLOAvailabilityCritical
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) > 0.01
and
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > 0.001
for: 2m
labels:
severity: critical
annotations:
summary: "可用性严重下降"
description: |
短期(5分钟)错误率: {{ $value | printf "%.2f" }}
长期(1小时)错误率: {{ $value | printf "%.2f" }}
4.2 告警分级策略
# alerting-tiering.yaml - 告警分级策略
# ==============================================================================
# 告警优先级矩阵
# ==============================================================================
#
# │ 短窗口(5m) │ 中窗口(1h) │ 长窗口(1d) │
# ────────────────┼──────────────┼──────────────┼──────────────┤
# 高燃烧率(10x) │ P0 立即 │ P0 立即 │ P1 1小时内 │
# 中燃烧率(5x) │ P1 1小时内 │ P1 1小时内 │ P2 4小时内 │
# 低燃烧率(2x) │ P2 4小时内 │ P2 4小时内 │ P3 工作时间 │
#
alert_routing:
# P0 - 立即响应(5分钟内)
- match:
severity: critical
burn_rate: "10x"
actions:
- type: pagerduty
priority: high
- type: slack
channel: "#incidents-critical"
- type: sms
to: oncall_team
# P1 - 1小时内响应
- match:
severity: critical
actions:
- type: pagerduty
priority: medium
- type: slack
channel: "#incidents"
# P2 - 4小时内响应
- match:
severity: warning
burn_rate: "5x"
actions:
- type: slack
channel: "#sre-alerts"
- type: email
to: sre-team@example.com
# P3 - 工作时间处理
- match:
severity: warning
actions:
- type: slack
channel: "#sre-alerts"
- type: jira
priority: low
5. SLO 看板设计
5.1 Grafana 面板 JSON
{
"dashboard": {
"title": "SLO Dashboard",
"uid": "slo-dashboard",
"panels": [
{
"title": "Service Availability (30d)",
"type": "gauge",
"gridPos": {"x": 0, "y": 0, "w": 6, "h": 6},
"targets": [
{"expr": "slo:availability:30d", "legendFormat": "Availability"}
],
"fieldConfig": {
"defaults": {
"min": 0.99,
"max": 1,
"unit": "percentunit",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "red", "value": 0.99},
{"color": "yellow", "value": 0.999},
{"color": "green", "value": 0.9999}
]
}
}
}
},
{
"title": "Error Budget Remaining",
"type": "gauge",
"gridPos": {"x": 6, "y": 0, "w": 6, "h": 6},
"targets": [
{"expr": "slo:error_budget:remaining * 100", "legendFormat": "Budget %"}
],
"fieldConfig": {
"defaults": {
"min": 0,
"max": 100,
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 10},
{"color": "green", "value": 30}
]
}
}
}
},
{
"title": "Burn Rate",
"type": "stat",
"gridPos": {"x": 12, "y": 0, "w": 6, "h": 3},
"targets": [
{"expr": "slo:error_budget:burn_rate", "legendFormat": "Current"}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 2},
{"color": "red", "value": 5}
]
}
}
}
},
{
"title": "Monthly Downtime Budget",
"type": "stat",
"gridPos": {"x": 18, "y": 0, "w": 6, "h": 3},
"targets": [
{"expr": "(1 - slo:availability:30d) * 30 * 24 * 60", "legendFormat": "Minutes Used"}
]
},
{
"title": "Availability Trend",
"type": "graph",
"gridPos": {"x": 0, "y": 6, "w": 12, "h": 8},
"targets": [
{
"expr": "slo:availability:1h",
"legendFormat": "1h Window"
},
{
"expr": "slo:availability:30d",
"legendFormat": "30d Window"
}
]
},
{
"title": "Error Budget Burn Down",
"type": "graph",
"gridPos": {"x": 12, "y": 6, "w": 12, "h": 8},
"targets": [
{
"expr": "slo:error_budget:remaining * 100",
"legendFormat": "Budget Remaining %"
}
]
},
{
"title": "Error Rate by Service",
"type": "graph",
"gridPos": {"x": 0, "y": 14, "w": 12, "h": 8},
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "Latency P99 by Service",
"type": "graph",
"gridPos": {"x": 12, "y": 14, "w": 12, "h": 8},
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{ service }}"
}
]
}
]
}
}
6. SLO 报告生成
6.1 周报自动化
#!/usr/bin/env python3
# slo_report_generator.py - SLO 周报生成器
import json
from datetime import datetime, timedelta
from typing import Dict, List
import requests
class SLOReportGenerator:
"""SLO 周报生成器"""
def __init__(self, prometheus_url: str, services: List[str]):
self.prometheus_url = prometheus_url
self.services = services
self.slo_targets = {
'availability': 0.999,
'latency_p99': 0.5, # 500ms
'error_rate': 0.001
}
def query_prometheus(self, query: str) -> Dict:
"""查询 Prometheus"""
response = requests.get(
f"{self.prometheus_url}/api/v1/query",
params={'query': query}
)
return response.json()
def calculate_availability(self, service: str, days: int = 7) -> Dict:
"""计算服务可用性"""
query = f'''
sum(rate(http_requests_total{{service="{service}",status!~"5.."}}[{days}d]))
/ sum(rate(http_requests_total{{service="{service}"}}[{days}d]))
'''
result = self.query_prometheus(query)
availability = float(result['data']['result'][0]['value'][1])
return {
'service': service,
'availability': availability,
'target': self.slo_targets['availability'],
'met': availability >= self.slo_targets['availability'],
'gap': availability - self.slo_targets['availability']
}
def calculate_error_budget(self, service: str, days: int = 7) -> Dict:
"""计算错误预算消耗"""
# 总请求数
total_query = f'sum(increase(http_requests_total{{service="{service}"}}[{days}d]))'
total_result = self.query_prometheus(total_query)
total_requests = float(total_result['data']['result'][0]['value'][1])
# 错误请求数
error_query = f'sum(increase(http_requests_total{{service="{service}",status=~"5.."}}[{days}d]))'
error_result = self.query_prometheus(error_query)
error_requests = float(error_result['data']['result'][0]['value'][1])
# 预算消耗
allowed_errors = total_requests * (1 - self.slo_targets['availability'])
budget_consumed = error_requests / allowed_errors if allowed_errors > 0 else 0
return {
'service': service,
'total_requests': int(total_requests),
'error_requests': int(error_requests),
'allowed_errors': int(allowed_errors),
'budget_consumed_percent': budget_consumed * 100,
'budget_remaining_percent': (1 - budget_consumed) * 100
}
def generate_weekly_report(self) -> Dict:
"""生成周报"""
report = {
'period': {
'start': (datetime.now() - timedelta(days=7)).isoformat(),
'end': datetime.now().isoformat()
},
'services': []
}
for service in self.services:
service_report = {
'name': service,
'availability': self.calculate_availability(service),
'error_budget': self.calculate_error_budget(service),
'incidents': [], # 从告警系统获取
'recommendations': []
}
# 生成建议
if service_report['availability']['gap'] < 0:
service_report['recommendations'].append(
f"可用性低于目标 {abs(service_report['availability']['gap'])*100:.2f}%,建议优化"
)
if service_report['error_budget']['budget_remaining_percent'] < 20:
service_report['recommendations'].append(
"错误预算剩余不足 20%,建议暂停新功能发布"
)
report['services'].append(service_report)
return report
def format_markdown_report(self, report: Dict) -> str:
"""格式化为 Markdown"""
md = f"""# SLO 周报
**报告周期**: {report['period']['start']} 至 {report['period']['end']}
## 服务概览
| 服务 | 可用性 | 目标 | 状态 | 预算剩余 |
|------|--------|------|------|----------|
"""
for service in report['services']:
status = "✅" if service['availability']['met'] else "❌"
md += f"| {service['name']} | "
md += f"{service['availability']['availability']*100:.3f}% | "
md += f"{service['availability']['target']*100:.3f}% | "
md += f"{status} | "
md += f"{service['error_budget']['budget_remaining_percent']:.1f}% |\n"
md += "\n## 详细分析\n\n"
for service in report['services']:
md += f"### {service['name']}\n\n"
md += f"- 总请求数: {service['error_budget']['total_requests']:,}\n"
md += f"- 错误请求: {service['error_budget']['error_requests']:,}\n"
md += f"- 预算消耗: {service['error_budget']['budget_consumed_percent']:.1f}%\n"
if service['recommendations']:
md += "\n**建议**:\n"
for rec in service['recommendations']:
md += f"- {rec}\n"
md += "\n"
return md
# 使用示例
if __name__ == "__main__":
generator = SLOReportGenerator(
prometheus_url="http://prometheus:9090",
services=["api-gateway", "order-service", "payment-service"]
)
report = generator.generate_weekly_report()
print(generator.format_markdown_report(report))
7. 最佳实践
7.1 SLO 设定原则
| 原则 | 说明 |
|---|---|
| 从小开始 | 先设定核心服务的核心指标 |
| 渐进式 | 根据历史数据调整目标 |
| 区分优先级 | 关键路径要求更高 SLO |
| 用户导向 | 从用户视角定义 SLI |
| 可达成 | SLO 应具有挑战但可达成 |
7.2 错误预算使用策略
| 预算剩余 | 行动建议 |
|---|---|
| > 50% | 可以加快发布节奏 |
| 20-50% | 正常发布,关注可靠性 |
| 10-20% | 暂停新功能,专注于稳定性 |
| < 10% | 停止所有发布,全面可靠性改进 |
| < 0% | SLO 违约,紧急可靠性修复 |
8. 参考资料
- Google SRE Book: https://sre.google/sre-book/service-level-objectives/
- The Art of SLO: https://sre.google/workbook/implementing-slos/
- OpenSLO: https://github.com/OpenSLO/OpenSLO
- SLO Generator: https://github.com/google/slo-generator
文档版本: 1.0 更新日期: 2024-01-15 适用环境: Prometheus + Grafana