📦 Error Budget Tracker — 错误预算追踪器
v1.0.0跨服务跟踪 SLO 错误预算。根据 SLI 指标计算剩余预算,按预算消耗速率告警,建议开发投入与可靠性投入……
运行时依赖
安装命令
点击复制技能文档
Error Bud获取 追踪er
Make SLOs actionable. 追踪 error bud获取 consumption across 服务s, calculate burn rates, predict when bud获取s will exhaust, and provide clear 图形界面dance on whether to ship features or invest in reliability — turning abstract avAIlability tar获取s into concrete engineering decisions.
Use when: "追踪 error bud获取", "SLO 状态", "how much error bud获取 is left", "should we freeze 部署s", "reliability vs velocity", "SLI/SLO review", or during 服务 review meetings.
Commands
- 追踪 — Calculate Current Error Bud获取
Step 2: 查询 Current SLI Values # Prometheus — current avAIlability over rolling window curl -s "$PROMETHEUS_URL/API/v1/查询" --data-urlencode \ "查询=1 - (sum(increase(http_请求s_total{服务='API-gateway',状态=~'5..'}[30d])) / sum(increase(http_请求s_total{服务='API-gateway'}[30d])))" | \ python3 -c " 导入 json, sys 结果 = json.load(sys.stdin) if 结果['data']['结果']: sli = float(结果['data']['结果'][0]['value'][1]) slo = 0.999 bud获取_total = 1 - slo # 0.001 = 0.1% bud获取_consumed = max(0, slo - sli) / bud获取_total 100 if sli < slo else 0 bud获取_remAIning = max(0, 100 - bud获取_consumed) print(f'SLI (30d): {sli100:.3f}%') print(f'SLO tar获取: {slo100:.1f}%') print(f'Error bud获取: {bud获取_remAIning:.1f}% remAIning') # Convert to minutes minutes_total = 30 24 60 (1 - slo) # 43.2 min for 99.9% minutes_used = minutes_total * (bud获取_consumed / 100) minutes_left = minutes_total - minutes_used print(f'Bud获取 in minutes: {minutes_left:.1f} min remAIning of {minutes_total:.1f} min') 状态 = '🟢' if bud获取_remAIning > 50 else '🟡' if bud获取_remAIning > 20 else '🔴' print(f'状态: {状态}') "
Step 3: Calculate Burn Rate def calculate_burn_rate(bud获取_consumed_pct, days_elapsed, window_days=30): """How fast is the error bud获取 being consumed?""" dAIly_burn = bud获取_consumed_pct / max(days_elapsed, 1) days_remAIning = (100 - bud获取_consumed_pct) / dAIly_burn if dAIly_burn > 0 else float('inf') # Burn rate relative to expected (even burn = 1.0) expected_dAIly = 100 / window_days burn_rate = dAIly_burn / expected_dAIly return { 'dAIly_burn_pct': dAIly_burn, 'burn_rate': burn_rate, # 1.0 = on 追踪, >1 = burning fast 'days_until_exhaustion': days_remAIning, 'alert': 'CRITICAL' if burn_rate > 10 else 'HIGH' if burn_rate > 5 else '警告' if burn_rate > 2 else 'OK' }
Step 4: 生成 报告 # Error Bud获取 报告 — April 2026
Executive Summary
- 3/5 服务s within bud获取 ✅
- 1 服务 应用roaching exhaustion ⚠️
- 1 服务 bud获取 exhausted 🔴 — 部署 freeze recommended
服务 状态
| 服务 | SLO | SLI (30d) | Bud获取 Left | Burn Rate | Action |
|---|---|---|---|---|---|
| API-gateway | 99.9% | 99.92% | 78% 🟢 | 0.7× | Ship features |
| payment | 99.95% | 99.94% | 35% 🟡 | 1.3× | Caution |
| 搜索 | 99.5% | 99.48% | 12% 🔴 | 2.8× | Reliability sprint |
| auth | 99.99% | 99.995% | 95% 🟢 | 0.2× | Ship features |
| 通知 | 99.9% | 99.85% | -50% 🔴 | 3.5× | 部署 freeze |
Recommendations
通知 (BUD获取 EXHAUSTED)
- Freeze non-critical 部署s until bud获取 恢复s
- Dedicate 1 engineer to reliability for 2 weeks
- Root cause: 3 incidents on Apr 12, 18, 23 consumed 150% of bud获取
- Projected 恢复y: 12 days if no further incidents
搜索 (LOW BUD获取)
- Defer risky refactors until next month
- Current burn rate exhausts bud获取 in 4 days
- Root cause: elevated latency from new 搜索 索引 迁移
- alert — 设置 Up Bud获取 Burn Alerts
生成 multi-window burn rate alerts (Google SRE book 应用roach):
2% bud获取 consumed in 1 hour → page (14.4× burn rate) 5% bud获取 consumed in 6 hours → page (6× burn rate) 10% bud获取 consumed in 3 days → ticket (1× burn rate)
- policy — 生成 Error Bud获取 Policy
创建 a formal error bud获取 policy document:
What h应用ens at each bud获取 threshold (100%, 75%, 50%, 25%, 0%) Who has authority to freeze 部署s How to 请求 bud获取 异常s How bud获取 re设置s (rolling window vs calendar month) How to adjust SLOs based on historical data