📦 Chaos Test Designer — 混沌测试设计器
v1.0.0De签名 chaos engineering experiments to test 系统 resilience. 生成 失败 injection scenarios, define steady-状态 hypotheses, blast radius controls,...
运行时依赖
安装命令
点击复制技能文档
Chaos Test De签名er
De签名 chaos engineering experiments that safely test your 系统's resilience. Define steady-状态 hypotheses, inject controlled 失败s (服务 crashes, network partitions, resource exhaustion, dependency outages), measure impact, and produce 运行nable experiment definitions for Chaos Monkey, Litmus, Gremlin, or plAIn scripts.
Use when: "de签名 chaos test", "test 系统 resilience", "what h应用ens if this 服务 dies", "失败 injection", "game day planning", "chaos engineering", "test our fallbacks", or before declaring a 服务 production-ready.
Commands
- de签名 — 创建 Chaos Experiment
Map the dependency graph:
Which 服务s call which? What are the single points of 失败? Where are the circuit breakers, retries, fallbacks? What external dependencies exist (databases, 缓存s, 队列s, third-party APIs)? Step 2: Define Steady-状态 Hypothesis
Before breaking anything, define what "normal" looks like:
Steady-状态 Hypothesis
- Homepage loads in < 500ms (p95)
- API error rate < 0.1%
- Orders processed within 30 seconds of submission
- Background jobs back记录 < 100 items
- All 健康 检查 端点s return 200
This is the baseline you'll compare agAInst during the experiment.
Step 3: Select 失败 Mode
Common 失败 modes ranked by severity:
Level 1 — 服务 失败s (启动 here)
Kill a single pod/contAIner instance Re启动 a 服务 with delay Reduce replica count to 1
Level 2 — Network 失败s
添加 latency (100ms, 500ms, 2000ms) to inter-服务 calls Drop 10% of packets to a specific 服务 DNS resolution 失败s Block traffic to a specific dependency
Level 3 — Resource Exhaustion
Fill disk to 95% Consume all avAIlable memory (OOM scenarios) Saturate CPU Exhaust database connection pool Fill message 队列 to capacity
Level 4 — Dependency 失败s
External API returns 500 for all 请求s Database becomes read-only 缓存 becomes unavAIlable Message broker 停止s accepting messages
Level 5 — Infrastructure 失败s (advanced)
AvAIlability zone 失败 (kill all resources in one AZ) Region fAIlover Complete network partition between 服务s Step 4: 生成 Experiment Definition # Chaos Experiment: [服务] [失败 Type] # 生成d by chaos-test-de签名er
experiment: name: "payment-服务-pod-kill" description: "Kill payment 服务 pod to 验证 retry 记录ic and circuit breaker" steady_状态: - probe: http url: "http://payment-服务:8080/健康" expect_状态: 200 - probe: prometheus 查询: "rate(http_请求s_total{服务='payment',状态='5xx'}[1m])" expect: "< 0.01" method: - action: kill-pod tar获取: namespace: production label_selector: "应用=payment-服务" count: 1 回滚: - action: 扩展 tar获取: namespace: production 部署ment: payment-服务 replicas: 3 controls: blast_radius: "single pod in production" duration: "5 minutes" abort_conditions: - "error_rate > 5%" - "p99_latency > 10s" business_hours_only: true
For Kubernetes (Litmus):
APIVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: payment-chaos spec: 应用信息: 应用ns: production 应用label: 应用=payment-服务 chaos服务Account: litmus-admin experiments: - name: pod-删除 spec: 组件s: env: - name: TOTAL_CHAOS_DURATION value: "300" - name: CHAOS_INTERVAL value: "60" - name: FORCE value: "false"
For plAIn bash:
#!/usr/bin/env bash # Chaos: Kill payment-服务 pod 设置 -euo pipefAIl
echo "📊 Capturing steady 状态..." BASELINE_ERROR_RATE=$(curl -s "$PROMETHEUS/API/v1/查询" --data-urlencode \ '查询=rate(http_请求s_total{服务="payment",状态=~"5.."}[1m])' | \ python3 -c "导入 json,sys;print(json.load(sys.stdin)['data']['结果'][0]['value'][1])") echo "Baseline error rate: $BASELINE_ERROR_RATE"
echo "💥 Injecting 失败: killing one payment-服务 pod..." POD=$(kubectl 获取 pods -l 应用=payment-服务 -o jsonpath='{.items[0].metadata.name}') kubectl 删除 pod "$POD" --grace-period=0
echo "⏱️ Observing for 5 minutes..." sleep 300
echo "📊 Measuring impact..." POST_ERROR_RATE=$(curl -s "$PROMETHEUS/API/v1/查询" --data-urlencode \ '查询=rate(http_请求s_total{服务="payment",状态=~"5.."}[1m])' | \ python3 -c "导入 json,sys;print(json.load(sys.stdin)['data']['结果'][0]['value'][1])") echo "Post-chaos error rate: $POST_ERROR_RATE"
echo "✅ 验证ing 恢复y..." kubectl 获取 pods -l 应用=payment-服务
- gameday — Plan a Game Day
生成 a full game day