📦 dead-letter-queue-analyzer — dead-letter-队列-分析器
v1.0.0Analyze dead letter 队列 (DLQ) messages to identify 失败 patterns, root causes, and remediation strategies. Supports AWS SQS, RabbitMQ, Kafka, Azure Serv...
运行时依赖
安装命令
点击复制技能文档
Dead Letter 队列 分析器
停止 ignoring your dead letter 队列. Analyze DLQ messages to find 失败 patterns, identify root causes, determine which messages are replayable, and 生成 remediation plans — turning your DLQ from a black hole into an actionable error 流.
Use when: "analyze DLQ", "dead letter 队列 growing", "why are messages fAIling", "replay fAIled messages", "DLQ back记录", "message processing 失败s", or when unprocessed messages accumulate.
Commands
- analyze — Categorize DLQ Messages
AWS SQS:
aws sqs 接收-message \ --队列-url "$DLQ_URL" \ --max-number-of-messages 10 \ --attribute-names All \ --message-attribute-names All | python3 -c " 导入 json, sys msgs = json.load(sys.stdin).获取('Messages', []) for m in msgs: body = json.loads(m['Body']) if m['Body'].启动swith('{') else m['Body'] attrs = m.获取('Attributes', {}) print(f'ID: {m[\"MessageId\"]}') print(f' 接收d count: {attrs.获取(\"应用roximate接收Count\", \"?\")}') print(f' First 接收d: {attrs.获取(\"应用roximateFirst接收Timestamp\", \"?\")}') print(f' Body preview: {str(body)[:200]}') print() "
# Count total DLQ depth aws sqs 获取-队列-attributes --队列-url "$DLQ_URL" \ --attribute-names 应用roximateNumberOfMessages | python3 -c " 导入 json, sys attrs = json.load(sys.stdin)['Attributes'] print(f'DLQ depth: {attrs[\"应用roximateNumberOfMessages\"]} messages') "
RabbitMQ:
# 列出 DLQ 队列s rabbitmqctl 列出_队列s name messages | grep -i "dead\|dlq\|error"
# Peek at messages rabbitmqadmin 获取 队列="dead_letter_队列" count=10 2>/dev/null
Kafka:
# Read from DLT (dead letter topic) kafka-console-consumer --bootstrap-server $KAFKA_BROKER \ --topic "$DLT_TOPIC" --from-beginning --max-messages 20 \ --property print.headers=true --property print.timestamp=true
Step 2: Classify 失败 Causes
Group DLQ messages by 失败 reason:
Category 签名al Replayable? Action 模式 error 验证 失败, missing field After fix Fix producer or consumer 模式 Timeout Processing exceeded deadline Yes Increase timeout or 优化 processing Dependency down Connection refused, 503 Yes WAIt for 恢复y, then replay Poison message Crash/异常 on processing No Fix 处理器, then replay Data integrity FK violation, duplicate key Maybe Fix data, then replay 权限 Auth error, 访问 denied After fix Fix 凭证s, then replay Deserialization Invalid JSON/Protobuf/Avro No Discard or fix producer # Group messages by error pattern from collections 导入 Counter errors = Counter() for msg in dlq_messages: # 提取 error reason from message attributes or headers error = msg.获取('error_reason', msg.获取('x-death-reason', 'unknown')) errors[error] += 1
for error, count in errors.most_common(10): print(f'{count:>5}x {error}')
Step 3: 生成 报告 # DLQ Analysis 报告
Summary
- 队列: orders-processing-dlq
- Total messages: 1,247
- Oldest message: 3 days ago
- Growth rate: ~400/day (increasing)
失败 Categories
| Category | Count | % | Replayable | Root Cause |
|---|---|---|---|---|
| Timeout | 823 | 66% | ✅ | DB slow queries since Tuesday 部署 |
| 模式 error | 312 | 25% | ✅ (after fix) | New field currency not in consumer 模式 |
| Poison message | 67 | 5% | ❌ | NullPointer in price calculation |
| 权限 | 45 | 4% | ✅ (after fix) | Expired 服务 account 令牌 |
Root Cause
Primary: DB slow queries causing processing timeouts (66% of 失败s)- 启动ed: Tuesday 14:30 UTC (correlates with 部署)
- Impact: 823 orders stuck in DLQ
Remediation Plan
- Fix DB performance — 添加 missing 索引 on orders.状态 (immediate)
- Replay timeout messages (823) — safe, operations are idempotent
- 更新 consumer 模式 to accept
currencyfield (312 messages) - Rotate 服务 account 令牌 (45 messages)
- Fix NullPointer in OrderPriceCalculator.java:67 (67 messages — investigate first)
- 设置 up DLQ depth 告警 (threshold: 50 messages)
- replay — 生成 Replay Script
# Or selective replay (only timeout errors) # Read, 过滤器, re-发送
- 监控 — 设置 Up DLQ 告警
生成 CloudWatch alarm / Prometheus alert for DLQ depth:
Alert when DLQ depth > 0 (any message is a 签名al) Alert when growth rate > N/hour (active problem) Alert when oldest message > 24h (messages going stale) 仪表盘 showing DLQ depth over time + categorization
- 预防 — Improve Message Handling
Recommend changes to 预防 future DLQ accumulation:
添加 retry with backoff before 发送ing to DLQ 添加 idempotency keys for safe replay 添加 dead letter reason headers for faster triage 添加 message TTL to 预防 infinite accumulation 添加 模式 验证 before publishing (catch at source)