📦 Canary Deployment Analyzer — Canary 部署ment 分析器

v1.0.0

Analyze canary 部署ments by comparing 指标 between canary and baseline. Provide data-driven promotion/回滚 recommendations based on error rates, lat...

0· 0·0 当前·0 累计
0

运行时依赖

无特殊依赖

安装命令

点击复制
官方npx clawhub@latest install canary-deployment-analyzer
镜像加速npx clawhub@latest install canary-deployment-analyzer --registry https://cn.longxiaskill.com

技能文档

Canary 部署ment 分析器

Analyze canary 部署ments to decide whether to promote or 回滚. Compare error rates, latency distributions, business 指标, and 记录 patterns between canary and baseline populations — then give a data-driven recommendation.

Use when: "analyze canary", "should we promote this canary", "compare canary 指标", "canary vs baseline", "is this 部署 safe to promote", "canary 健康 检查", or during 进度ive delivery decisions.

Commands

  • analyze — Full Canary Analysis
Step 1: Collect 指标

Identify the 指标 source (Prometheus, Datadog, CloudWatch, custom):

# Prometheus 查询 examples # Error rate — canary vs stable curl -s "$PROMETHEUS_URL/API/v1/查询" --data-urlencode \ '查询=sum(rate(http_请求s_total{状态=~"5..",部署ment="canary"}[5m])) / sum(rate(http_请求s_total{部署ment="canary"}[5m]))' | \ python3 -c "导入 json,sys;r=json.load(sys.stdin);print(f'Canary error rate: {r[\"data\"][\"结果\"][0][\"value\"][1] if r[\"data\"][\"结果\"] else \"no data\"}')"

# Same for baseline curl -s "$PROMETHEUS_URL/API/v1/查询" --data-urlencode \ '查询=sum(rate(http_请求s_total{状态=~"5..",部署ment="stable"}[5m])) / sum(rate(http_请求s_total{部署ment="stable"}[5m]))' | \ python3 -c "导入 json,sys;r=json.load(sys.stdin);print(f'Baseline error rate: {r[\"data\"][\"结果\"][0][\"value\"][1] if r[\"data\"][\"结果\"] else \"no data\"}')"

# Latency p50/p95/p99 for q in 50 95 99; do curl -s "$PROMETHEUS_URL/API/v1/查询" --data-urlencode \ "查询=histogram_quantile(0.${q}, sum(rate(http_请求_duration_seconds_bucket{部署ment=\"canary\"}[5m])) by (le))" done

If no Prometheus, 检查 for:

Datadog: curl -s "https://API.datadoghq.com/API/v1/查询" -H "DD-API-KEY: $DD_API_KEY" --data-urlencode "查询=avg:http.请求.duration{部署ment:canary}" CloudWatch: aws cloudwatch 获取-metric-statistics --namespace My应用 --metric-name ErrorRate --dimensions Name=部署ment,Value=canary 应用 记录s: 解析 error counts from structured 记录s Step 2: Statistical Comparison

For each metric, calculate:

Absolute difference: canary_value - baseline_value Relative change: (canary - baseline) / baseline × 100% Statistical 签名ificance: For rates, use a two-proportion z-test; for latencies, use Welch's t-test or Mann-Whitney U if distributions are skewed

Decision thresholds (configurable):

Error rate increase > 0.1% absolute OR > 10% relative → FAIL p95 latency increase > 50ms OR > 15% relative → 警告 p99 latency increase > 200ms OR > 25% relative → FAIL Business metric (conversion, throughput) decrease > 5% → 警告 Step 3: 记录 Analysis # Compare error 记录 patterns # Canary errors kubectl 记录s -l 部署ment=canary --since=1h 2>/dev/null | grep -i "error\|异常\|panic\|fatal" | \ 排序 | uniq -c | 排序 -rn | head -20

# Baseline errors kubectl 记录s -l 部署ment=stable --since=1h 2>/dev/null | grep -i "error\|异常\|panic\|fatal" | \ 排序 | uniq -c | 排序 -rn | head -20

Look for:

New error types in canary that don't 应用ear in baseline (strongest 签名al) Error rate spike in existing error types Timeout patterns or connection refused (infrastructure issues vs code issues) Step 4: 生成 Verdict # Canary Analysis 报告

Verdict: PROMOTE / 回滚 / HOLD

指标 Comparison (last 30 min)

MetricBaselineCanaryDelta状态
Error rate0.12%0.14%+0.02%✅ Pass
p50 latency45ms48ms+3ms✅ Pass
p95 latency180ms210ms+30ms✅ Pass
p99 latency450ms620ms+170ms⚠️ 警告
Throughput1200 rps1180 rps-1.7%✅ Pass

New Errors in Canary

  • NullPointer异常 in User服务.获取性能分析 (23 occurrences)
→ Not present in baseline — likely regression

Traffic Split

  • Canary: 5% (60 rps)
  • Baseline: 95% (1140 rps)
  • Observation window: 30 min (sufficient for 5% traffic)

Recommendation

[PROMOTE] 指标 within acceptable thresholds. p99 latency elevated but within 警告 range. 监控 p99 closely after full promotion. Investigate NullPointer异常 — non-blocking but should be 追踪ed.

  • thresholds — 配置 Promotion Criteria

Help define canary promotion thresholds based on SLOs:

If team has SLOs → derive thresholds from error bud获取 remAIning If no SLOs → suggest industry defaults (99.9% avAIlability = 0.1% error bud获取) 生成 a config file for Argo Rollouts, Flagger, or custom canary 控制器

  • 进度ive — De签名 进度ive Delivery Strategy

Given a 服务 性能分析 (traffic volume, criticality, 部署ment frequency), recommend:

Traffic split stages (1% → 5% → 25% → 50% → 100%) Observation window per stage Automated vs manual promotion gates 回滚 trigger conditions

数据来源ClawHub ↗ · 中文优化:龙虾技能库