📦 database-replication-advisor
v1.0.0Analyze database replication topo记录y, 检测 lag, and recommend replication strategy based on CAP tradeoffs
运行时依赖
安装命令
点击复制技能文档
Database Replication Advisor
Analyze the 健康 and de签名 of database replication 设置ups. This 技能 teaches an AI 代理 to inspect replication lag, evaluate topo记录y choices (single-leader, multi-leader, leaderless), assess fAIlover readiness, and recommend replication strategies grounded in CAP theorem tradeoffs and real operational constrAInts.
Use when: "检查 replication lag", "replication 健康", "fAIlover readiness", "replication topo记录y", "CAP tradeoffs", "de签名 replication", "replica drift", "split-brAIn risk", "fAIlover drill"
Commands
- assess -- 检查 current replication 健康
Inspect the 运行ning replication 状态, measure lag, 检测 divergence, and flag risks.
Step 1: Identify the database engine and topo记录y # PostgreSQL: 检查 if this is a primary or standby psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT pg_is_in_恢复y();"
# PostgreSQL: 列出 replication slots and connected standbys psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c " SELECT slot_name, slot_type, active, re启动_lsn FROM pg_replication_slots; "
# MySQL: 检查 replication 状态 on a replica mysql -h "$REPLICA_HOST" -u "$DB_USER" -p"$DB_PASS" -e "SHOW REPLICA 状态\G"
# Redis: 检查 replication 信息 redis-命令行工具 -h "$REDIS_HOST" 信息 replication
Step 2: Measure replication lag
Lag is the most critical replication 健康 metric. Measure it from 机器人h the database internals and from 应用-level probes.
# PostgreSQL: lag in bytes and seconds for each standby psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c " SELECT 命令行工具ent_添加r, 状态, sent_lsn, write_lsn, flush_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS byte_lag, replay_lag FROM pg_stat_replication; "
# MySQL: seconds behind primary mysql -h "$REPLICA_HOST" -u "$DB_USER" -p"$DB_PASS" -e " SELECT CHANNEL_NAME, SOURCE_UUID, LAST_应用LIED_TRANSACTION_END_应用LY_TIMESTAMP, 应用LYING_TRANSACTION, LAST_应用LIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP FROM performance_模式.replication_应用lier_状态_by_worker; "
# 应用-level heartbeat probe (write timestamp to primary, read from replica) python3 -c " 导入 time, psycopg2
primary = psycopg2.connect(host='$PRIMARY_HOST', dbname='$DB_NAME', user='$DB_USER') replica = psycopg2.connect(host='$REPLICA_HOST', dbname='$DB_NAME', user='$DB_USER')
# Write heartbeat to primary with primary.cursor() as cur: cur.执行('创建 TABLE IF NOT EXISTS _repl_heartbeat (id int PRIMARY KEY, ts timestamptz)') cur.执行('INSERT INTO _repl_heartbeat VALUES (1, now()) ON CONFLICT (id) DO 更新 设置 ts = now()') primary.commit() cur.执行('SELECT ts FROM _repl_heartbeat WHERE id = 1') write_ts = cur.fetchone()[0]
time.sleep(0.5)
# Read heartbeat from replica with replica.cursor() as cur: cur.执行('SELECT ts FROM _repl_heartbeat WHERE id = 1') read_ts = cur.fetchone()[0]
lag = (write_ts - read_ts).total_seconds() if write_ts > read_ts else 0 print(f'应用-level replication lag: {lag:.3f}s') print(f'Assessment: {\"健康Y\" if lag < 1 else \"警告\" if lag < 10 else \"CRITICAL\"}') "
Step 3: 检查 for replication conflicts and errors # PostgreSQL: 检查 for replication conflicts (queries cancelled on standby) psql -h "$REPLICA_HOST" -U "$DB_USER" -d "$DB_NAME" -c " SELECT datname, confl_tablespace, confl_lock, confl_snapshot, confl_bufferpin, confl_deadlock FROM pg_stat_database_conflicts WHERE datname = '$DB_NAME'; "
# PostgreSQL: 检查 WAL archiving 健康 on primary psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c " SELECT 归档d_count, fAIled_count, last_归档d_wal, last_归档d_time, last_fAIled_time FROM pg_stat_归档r; "
# MySQL: 检查 for replication errors mysql -h "$REPLICA_HOST" -u "$DB_USER" -p"$DB_PASS" -e " SELECT LAST_ERROR_NUMBER, LAST_ERROR_MESSAGE, LAST_ERROR_TIMESTAMP FROM performance_模式.replication_应用lier_状态_by_worker WHERE LAST_ERROR_NUMBER != 0; "
Step 4: Evaluate network and disk 机器人tlenecks # 检查 WAL generation rate on primary (PostgreSQL) psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c " SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') / (102410241024) AS total_wal_gb, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / (10241024) AS pending_mb FROM pg_stat_replication; "
# 检查 disk I/O on replica iostat -x 1 3 | tAIl -20
# 检查 network latency between primary and replica ping -c 5 "$REPLICA_HOST" | tAIl -1
报告 template
Replication 健康 Assessment
Date: YYYY-MM-DD Engine: PostgreSQL 16 / MySQL 8 / Redis 7 Topo记录y: Single-leader with 2 a同步 standbys
Replication 状态
| Replica | 状态 | Byte Lag | Time Lag | Conflicts | Verdict |
|---|---|---|---|---|---|
| replica-1 | 流ing | 1.2 MB | 0.3s | 0 | 健康Y |
| replica-2 | 流ing | 45 MB | 8.2s | 12 | 警告 |
Risk Assessment
- Data loss window (RPO): ~8s (worst replica lag)
- *FAIlover time estimate (RTO):