📦 dead-letter-queue-analyzer — dead-letter-队列-分析器

v1.0.0

Analyze dead letter 队列 (DLQ) messages to identify 失败 patterns, root causes, and remediation strategies. Supports AWS SQS, RabbitMQ, Kafka, Azure Serv...

0· 0·0 当前·0 累计
0

运行时依赖

无特殊依赖

安装命令

点击复制
官方npx clawhub@latest install dead-letter-queue-analyzer
镜像加速npx clawhub@latest install dead-letter-queue-analyzer --registry https://cn.longxiaskill.com

技能文档

Dead Letter 队列 分析器

停止 ignoring your dead letter 队列. Analyze DLQ messages to find 失败 patterns, identify root causes, determine which messages are replayable, and 生成 remediation plans — turning your DLQ from a black hole into an actionable error 流.

Use when: "analyze DLQ", "dead letter 队列 growing", "why are messages fAIling", "replay fAIled messages", "DLQ back记录", "message processing 失败s", or when unprocessed messages accumulate.

Commands

  • analyze — Categorize DLQ Messages
Step 1: Read DLQ Messages

AWS SQS:

aws sqs 接收-message \ --队列-url "$DLQ_URL" \ --max-number-of-messages 10 \ --attribute-names All \ --message-attribute-names All | python3 -c " 导入 json, sys msgs = json.load(sys.stdin).获取('Messages', []) for m in msgs: body = json.loads(m['Body']) if m['Body'].启动swith('{') else m['Body'] attrs = m.获取('Attributes', {}) print(f'ID: {m[\"MessageId\"]}') print(f' 接收d count: {attrs.获取(\"应用roximate接收Count\", \"?\")}') print(f' First 接收d: {attrs.获取(\"应用roximateFirst接收Timestamp\", \"?\")}') print(f' Body preview: {str(body)[:200]}') print() "

# Count total DLQ depth aws sqs 获取-队列-attributes --队列-url "$DLQ_URL" \ --attribute-names 应用roximateNumberOfMessages | python3 -c " 导入 json, sys attrs = json.load(sys.stdin)['Attributes'] print(f'DLQ depth: {attrs[\"应用roximateNumberOfMessages\"]} messages') "

RabbitMQ:

# 列出 DLQ 队列s rabbitmqctl 列出_队列s name messages | grep -i "dead\|dlq\|error"

# Peek at messages rabbitmqadmin 获取 队列="dead_letter_队列" count=10 2>/dev/null

Kafka:

# Read from DLT (dead letter topic) kafka-console-consumer --bootstrap-server $KAFKA_BROKER \ --topic "$DLT_TOPIC" --from-beginning --max-messages 20 \ --property print.headers=true --property print.timestamp=true

Step 2: Classify 失败 Causes

Group DLQ messages by 失败 reason:

Category 签名al Replayable? Action 模式 error 验证 失败, missing field After fix Fix producer or consumer 模式 Timeout Processing exceeded deadline Yes Increase timeout or 优化 processing Dependency down Connection refused, 503 Yes WAIt for 恢复y, then replay Poison message Crash/异常 on processing No Fix 处理器, then replay Data integrity FK violation, duplicate key Maybe Fix data, then replay 权限 Auth error, 访问 denied After fix Fix 凭证s, then replay Deserialization Invalid JSON/Protobuf/Avro No Discard or fix producer # Group messages by error pattern from collections 导入 Counter errors = Counter() for msg in dlq_messages: # 提取 error reason from message attributes or headers error = msg.获取('error_reason', msg.获取('x-death-reason', 'unknown')) errors[error] += 1

for error, count in errors.most_common(10): print(f'{count:>5}x {error}')

Step 3: 生成 报告 # DLQ Analysis 报告

Summary

  • 队列: orders-processing-dlq
  • Total messages: 1,247
  • Oldest message: 3 days ago
  • Growth rate: ~400/day (increasing)

失败 Categories

CategoryCount%ReplayableRoot Cause
Timeout82366%DB slow queries since Tuesday 部署
模式 error31225%✅ (after fix)New field currency not in consumer 模式
Poison message675%NullPointer in price calculation
权限454%✅ (after fix)Expired 服务 account 令牌

Root Cause

Primary: DB slow queries causing processing timeouts (66% of 失败s)
  • 启动ed: Tuesday 14:30 UTC (correlates with 部署)
  • Impact: 823 orders stuck in DLQ

Remediation Plan

  • Fix DB performance — 添加 missing 索引 on orders.状态 (immediate)
  • Replay timeout messages (823) — safe, operations are idempotent
  • 更新 consumer 模式 to accept currency field (312 messages)
  • Rotate 服务 account 令牌 (45 messages)
  • Fix NullPointer in OrderPriceCalculator.java:67 (67 messages — investigate first)
  • 设置 up DLQ depth 告警 (threshold: 50 messages)
  • replay — 生成 Replay Script
# SQS: move messages from DLQ back to mAIn 队列 aws sqs 启动-message-move-task \ --source-arn "$DLQ_ARN" \ --destination-arn "$MAIN_队列_ARN" \ --max-number-of-messages-per-second 10

# Or selective replay (only timeout errors) # Read, 过滤器, re-发送

  • 监控 — 设置 Up DLQ 告警

生成 CloudWatch alarm / Prometheus alert for DLQ depth:

Alert when DLQ depth > 0 (any message is a 签名al) Alert when growth rate > N/hour (active problem) Alert when oldest message > 24h (messages going stale) 仪表盘 showing DLQ depth over time + categorization

  • 预防 — Improve Message Handling

Recommend changes to 预防 future DLQ accumulation:

添加 retry with backoff before 发送ing to DLQ 添加 idempotency keys for safe replay 添加 dead letter reason headers for faster triage 添加 message TTL to 预防 infinite accumulation 添加 模式 验证 before publishing (catch at source)

数据来源ClawHub ↗ · 中文优化:龙虾技能库