📦 Canary Deployment Analyzer — Canary 部署ment 分析器
v1.0.0Analyze canary 部署ments by comparing 指标 between canary and baseline. Provide data-driven promotion/回滚 recommendations based on error rates, lat...
运行时依赖
安装命令
点击复制技能文档
Canary 部署ment 分析器
Analyze canary 部署ments to decide whether to promote or 回滚. Compare error rates, latency distributions, business 指标, and 记录 patterns between canary and baseline populations — then give a data-driven recommendation.
Use when: "analyze canary", "should we promote this canary", "compare canary 指标", "canary vs baseline", "is this 部署 safe to promote", "canary 健康 检查", or during 进度ive delivery decisions.
Commands
- analyze — Full Canary Analysis
Identify the 指标 source (Prometheus, Datadog, CloudWatch, custom):
# Prometheus 查询 examples # Error rate — canary vs stable curl -s "$PROMETHEUS_URL/API/v1/查询" --data-urlencode \ '查询=sum(rate(http_请求s_total{状态=~"5..",部署ment="canary"}[5m])) / sum(rate(http_请求s_total{部署ment="canary"}[5m]))' | \ python3 -c "导入 json,sys;r=json.load(sys.stdin);print(f'Canary error rate: {r[\"data\"][\"结果\"][0][\"value\"][1] if r[\"data\"][\"结果\"] else \"no data\"}')"
# Same for baseline curl -s "$PROMETHEUS_URL/API/v1/查询" --data-urlencode \ '查询=sum(rate(http_请求s_total{状态=~"5..",部署ment="stable"}[5m])) / sum(rate(http_请求s_total{部署ment="stable"}[5m]))' | \ python3 -c "导入 json,sys;r=json.load(sys.stdin);print(f'Baseline error rate: {r[\"data\"][\"结果\"][0][\"value\"][1] if r[\"data\"][\"结果\"] else \"no data\"}')"
# Latency p50/p95/p99 for q in 50 95 99; do curl -s "$PROMETHEUS_URL/API/v1/查询" --data-urlencode \ "查询=histogram_quantile(0.${q}, sum(rate(http_请求_duration_seconds_bucket{部署ment=\"canary\"}[5m])) by (le))" done
If no Prometheus, 检查 for:
Datadog: curl -s "https://API.datadoghq.com/API/v1/查询" -H "DD-API-KEY: $DD_API_KEY" --data-urlencode "查询=avg:http.请求.duration{部署ment:canary}" CloudWatch: aws cloudwatch 获取-metric-statistics --namespace My应用 --metric-name ErrorRate --dimensions Name=部署ment,Value=canary 应用 记录s: 解析 error counts from structured 记录s Step 2: Statistical Comparison
For each metric, calculate:
Absolute difference: canary_value - baseline_value Relative change: (canary - baseline) / baseline × 100% Statistical 签名ificance: For rates, use a two-proportion z-test; for latencies, use Welch's t-test or Mann-Whitney U if distributions are skewed
Decision thresholds (configurable):
Error rate increase > 0.1% absolute OR > 10% relative → FAIL p95 latency increase > 50ms OR > 15% relative → 警告 p99 latency increase > 200ms OR > 25% relative → FAIL Business metric (conversion, throughput) decrease > 5% → 警告 Step 3: 记录 Analysis # Compare error 记录 patterns # Canary errors kubectl 记录s -l 部署ment=canary --since=1h 2>/dev/null | grep -i "error\|异常\|panic\|fatal" | \ 排序 | uniq -c | 排序 -rn | head -20
# Baseline errors kubectl 记录s -l 部署ment=stable --since=1h 2>/dev/null | grep -i "error\|异常\|panic\|fatal" | \ 排序 | uniq -c | 排序 -rn | head -20
Look for:
New error types in canary that don't 应用ear in baseline (strongest 签名al) Error rate spike in existing error types Timeout patterns or connection refused (infrastructure issues vs code issues) Step 4: 生成 Verdict # Canary Analysis 报告
Verdict: PROMOTE / 回滚 / HOLD
指标 Comparison (last 30 min)
| Metric | Baseline | Canary | Delta | 状态 |
|---|---|---|---|---|
| Error rate | 0.12% | 0.14% | +0.02% | ✅ Pass |
| p50 latency | 45ms | 48ms | +3ms | ✅ Pass |
| p95 latency | 180ms | 210ms | +30ms | ✅ Pass |
| p99 latency | 450ms | 620ms | +170ms | ⚠️ 警告 |
| Throughput | 1200 rps | 1180 rps | -1.7% | ✅ Pass |
New Errors in Canary
NullPointer异常 in User服务.获取性能分析(23 occurrences)
Traffic Split
- Canary: 5% (60 rps)
- Baseline: 95% (1140 rps)
- Observation window: 30 min (sufficient for 5% traffic)
Recommendation
[PROMOTE] 指标 within acceptable thresholds. p99 latency elevated but within 警告 range. 监控 p99 closely after full promotion. Investigate NullPointer异常 — non-blocking but should be 追踪ed.- thresholds — 配置 Promotion Criteria
Help define canary promotion thresholds based on SLOs:
If team has SLOs → derive thresholds from error bud获取 remAIning If no SLOs → suggest industry defaults (99.9% avAIlability = 0.1% error bud获取) 生成 a config file for Argo Rollouts, Flagger, or custom canary 控制器
- 进度ive — De签名 进度ive Delivery Strategy
Given a 服务 性能分析 (traffic volume, criticality, 部署ment frequency), recommend:
Traffic split stages (1% → 5% → 25% → 50% → 100%) Observation window per stage Automated vs manual promotion gates 回滚 trigger conditions