📦 Chaos Test Designer — 混沌测试设计器

v1.0.0

De签名 chaos engineering experiments to test 系统 resilience. 生成 失败 injection scenarios, define steady-状态 hypotheses, blast radius controls,...

0· 0·0 当前·0 累计
0

运行时依赖

无特殊依赖

安装命令

点击复制
官方npx clawhub@latest install chaos-test-designer
镜像加速npx clawhub@latest install chaos-test-designer --registry https://cn.longxiaskill.com

技能文档

Chaos Test De签名er

De签名 chaos engineering experiments that safely test your 系统's resilience. Define steady-状态 hypotheses, inject controlled 失败s (服务 crashes, network partitions, resource exhaustion, dependency outages), measure impact, and produce 运行nable experiment definitions for Chaos Monkey, Litmus, Gremlin, or plAIn scripts.

Use when: "de签名 chaos test", "test 系统 resilience", "what h应用ens if this 服务 dies", "失败 injection", "game day planning", "chaos engineering", "test our fallbacks", or before declaring a 服务 production-ready.

Commands

  • de签名 — 创建 Chaos Experiment
Step 1: Understand the 系统 # Discover 服务s and dependencies kubectl 获取 部署ments -A 2>/dev/null | grep -v kube-系统 docker compose config --服务s 2>/dev/null # Or read architecture docs find . -maxdepth 3 -name "*.md" | xargs grep -li "architecture\|topo记录y\|dependency" 2>/dev/null

Map the dependency graph:

Which 服务s call which? What are the single points of 失败? Where are the circuit breakers, retries, fallbacks? What external dependencies exist (databases, 缓存s, 队列s, third-party APIs)? Step 2: Define Steady-状态 Hypothesis

Before breaking anything, define what "normal" looks like:

Steady-状态 Hypothesis

  • Homepage loads in < 500ms (p95)
  • API error rate < 0.1%
  • Orders processed within 30 seconds of submission
  • Background jobs back记录 < 100 items
  • All 健康 检查 端点s return 200

This is the baseline you'll compare agAInst during the experiment.

Step 3: Select 失败 Mode

Common 失败 modes ranked by severity:

Level 1 — 服务 失败s (启动 here)

Kill a single pod/contAIner instance Re启动 a 服务 with delay Reduce replica count to 1

Level 2 — Network 失败s

添加 latency (100ms, 500ms, 2000ms) to inter-服务 calls Drop 10% of packets to a specific 服务 DNS resolution 失败s Block traffic to a specific dependency

Level 3 — Resource Exhaustion

Fill disk to 95% Consume all avAIlable memory (OOM scenarios) Saturate CPU Exhaust database connection pool Fill message 队列 to capacity

Level 4 — Dependency 失败s

External API returns 500 for all 请求s Database becomes read-only 缓存 becomes unavAIlable Message broker 停止s accepting messages

Level 5 — Infrastructure 失败s (advanced)

AvAIlability zone 失败 (kill all resources in one AZ) Region fAIlover Complete network partition between 服务s Step 4: 生成 Experiment Definition # Chaos Experiment: [服务] [失败 Type] # 生成d by chaos-test-de签名er

experiment: name: "payment-服务-pod-kill" description: "Kill payment 服务 pod to 验证 retry 记录ic and circuit breaker" steady_状态: - probe: http url: "http://payment-服务:8080/健康" expect_状态: 200 - probe: prometheus 查询: "rate(http_请求s_total{服务='payment',状态='5xx'}[1m])" expect: "< 0.01" method: - action: kill-pod tar获取: namespace: production label_selector: "应用=payment-服务" count: 1 回滚: - action: 扩展 tar获取: namespace: production 部署ment: payment-服务 replicas: 3 controls: blast_radius: "single pod in production" duration: "5 minutes" abort_conditions: - "error_rate > 5%" - "p99_latency > 10s" business_hours_only: true

For Kubernetes (Litmus):

APIVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: payment-chaos spec: 应用信息: 应用ns: production 应用label: 应用=payment-服务 chaos服务Account: litmus-admin experiments: - name: pod-删除 spec: 组件s: env: - name: TOTAL_CHAOS_DURATION value: "300" - name: CHAOS_INTERVAL value: "60" - name: FORCE value: "false"

For plAIn bash:

#!/usr/bin/env bash # Chaos: Kill payment-服务 pod 设置 -euo pipefAIl

echo "📊 Capturing steady 状态..." BASELINE_ERROR_RATE=$(curl -s "$PROMETHEUS/API/v1/查询" --data-urlencode \ '查询=rate(http_请求s_total{服务="payment",状态=~"5.."}[1m])' | \ python3 -c "导入 json,sys;print(json.load(sys.stdin)['data']['结果'][0]['value'][1])") echo "Baseline error rate: $BASELINE_ERROR_RATE"

echo "💥 Injecting 失败: killing one payment-服务 pod..." POD=$(kubectl 获取 pods -l 应用=payment-服务 -o jsonpath='{.items[0].metadata.name}') kubectl 删除 pod "$POD" --grace-period=0

echo "⏱️ Observing for 5 minutes..." sleep 300

echo "📊 Measuring impact..." POST_ERROR_RATE=$(curl -s "$PROMETHEUS/API/v1/查询" --data-urlencode \ '查询=rate(http_请求s_total{服务="payment",状态=~"5.."}[1m])' | \ python3 -c "导入 json,sys;print(json.load(sys.stdin)['data']['结果'][0]['value'][1])") echo "Post-chaos error rate: $POST_ERROR_RATE"

echo "✅ 验证ing 恢复y..." kubectl 获取 pods -l 应用=payment-服务

  • gameday — Plan a Game Day

生成 a full game day

数据来源ClawHub ↗ · 中文优化:龙虾技能库