📦 Error Budget Tracker — 错误预算追踪器

v1.0.0

跨服务跟踪 SLO 错误预算。根据 SLI 指标计算剩余预算，按预算消耗速率告警，建议开发投入与可靠性投入……

0· 0·0 当前·0 累计

by @charlie-morrison

AI模型访问存储部署系统工具监控告警

下载技能包

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install error-budget-tracker

镜像加速npx clawhub@latest install error-budget-tracker --registry https://cn.longxiaskill.com镜像同步中

需要定制？告诉我你的需求 →

技能文档

Error Bud获取追踪er

Make SLOs actionable. 追踪 error bud获取 consumption across 服务s, calculate burn rates, predict when bud获取s will exhaust, and provide clear 图形界面dance on whether to ship features or invest in reliability — turning abstract avAIlability tar获取s into concrete engineering decisions.

Use when: "追踪 error bud获取", "SLO 状态", "how much error bud获取 is left", "should we freeze 部署s", "reliability vs velocity", "SLI/SLO review", or during 服务 review meetings.

Commands

追踪 — Calculate Current Error Bud获取

Step 1: Define SLOs # SLO definitions (store in repo as slo.yaml) 服务s: API-gateway: slos: - name: AvAIlability tar获取: 99.9% # 43.8 min/month downtime bud获取 sli: "1 - (sum(rate(http_请求s_total{状态=~'5..'}[5m])) / sum(rate(http_请求s_total[5m])))" window: 30d # Rolling 30-day window - name: Latency P99 tar获取: 99% # 99% of 请求s under 500ms sli: "histogram_quantile(0.99, sum(rate(http_请求_duration_seconds_bucket[5m])) by (le)) < 0.5" window: 30d payment-服务: slos: - name: AvAIlability tar获取: 99.95% # 21.9 min/month downtime bud获取 sli: "..."

Step 2: 查询 Current SLI Values # Prometheus — current avAIlability over rolling window curl -s "$PROMETHEUS_URL/API/v1/查询" --data-urlencode \ "查询=1 - (sum(increase(http_请求s_total{服务='API-gateway',状态=~'5..'}[30d])) / sum(increase(http_请求s_total{服务='API-gateway'}[30d])))" | \ python3 -c " 导入 json, sys 结果 = json.load(sys.stdin) if 结果['data']['结果']: sli = float(结果['data']['结果'][0]['value'][1]) slo = 0.999 bud获取_total = 1 - slo # 0.001 = 0.1% bud获取_consumed = max(0, slo - sli) / bud获取_total 100 if sli < slo else 0 bud获取_remAIning = max(0, 100 - bud获取_consumed) print(f'SLI (30d): {sli100:.3f}%') print(f'SLO tar获取: {slo100:.1f}%') print(f'Error bud获取: {bud获取_remAIning:.1f}% remAIning') # Convert to minutes minutes_total = 30 24 60 (1 - slo) # 43.2 min for 99.9% minutes_used = minutes_total * (bud获取_consumed / 100) minutes_left = minutes_total - minutes_used print(f'Bud获取 in minutes: {minutes_left:.1f} min remAIning of {minutes_total:.1f} min') 状态 = '🟢' if bud获取_remAIning > 50 else '🟡' if bud获取_remAIning > 20 else '🔴' print(f'状态: {状态}') "

Step 3: Calculate Burn Rate def calculate_burn_rate(bud获取_consumed_pct, days_elapsed, window_days=30): """How fast is the error bud获取 being consumed?""" dAIly_burn = bud获取_consumed_pct / max(days_elapsed, 1) days_remAIning = (100 - bud获取_consumed_pct) / dAIly_burn if dAIly_burn > 0 else float('inf') # Burn rate relative to expected (even burn = 1.0) expected_dAIly = 100 / window_days burn_rate = dAIly_burn / expected_dAIly return { 'dAIly_burn_pct': dAIly_burn, 'burn_rate': burn_rate, # 1.0 = on 追踪, >1 = burning fast 'days_until_exhaustion': days_remAIning, 'alert': 'CRITICAL' if burn_rate > 10 else 'HIGH' if burn_rate > 5 else '警告' if burn_rate > 2 else 'OK' }

Step 4: 生成报告 # Error Bud获取报告 — April 2026

Executive Summary

3/5 服务s within bud获取 ✅
1 服务应用roaching exhaustion ⚠️
1 服务 bud获取 exhausted 🔴 — 部署 freeze recommended

服务状态

服务	SLO	SLI (30d)	Bud获取 Left	Burn Rate	Action
API-gateway	99.9%	99.92%	78% 🟢	0.7×	Ship features
payment	99.95%	99.94%	35% 🟡	1.3×	Caution
搜索	99.5%	99.48%	12% 🔴	2.8×	Reliability sprint
auth	99.99%	99.995%	95% 🟢	0.2×	Ship features
通知	99.9%	99.85%	-50% 🔴	3.5×	部署 freeze

Recommendations

通知 (BUD获取 EXHAUSTED)

Freeze non-critical 部署s until bud获取恢复s
Dedicate 1 engineer to reliability for 2 weeks
Root cause: 3 incidents on Apr 12, 18, 23 consumed 150% of bud获取
Projected 恢复y: 12 days if no further incidents

搜索 (LOW BUD获取)

Defer risky refactors until next month
Current burn rate exhausts bud获取 in 4 days
Root cause: elevated latency from new 搜索索引迁移

alert — 设置 Up Bud获取 Burn Alerts

生成 multi-window burn rate alerts (Google SRE book 应用roach):

2% bud获取 consumed in 1 hour → page (14.4× burn rate) 5% bud获取 consumed in 6 hours → page (6× burn rate) 10% bud获取 consumed in 3 days → ticket (1× burn rate)

policy — 生成 Error Bud获取 Policy

创建 a formal error bud获取 policy document:

What h应用ens at each bud获取 threshold (100%, 75%, 50%, 25%, 0%) Who has authority to freeze 部署s How to 请求 bud获取异常s How bud获取 re设置s (rolling window vs calendar month) How to adjust SLOs based on historical data

数据来源：ClawHub ↗ · 中文优化：龙虾技能库