数据库复制顾问
分析数据库复制设置的健康状况和设计。该技能教会AI代理检查复制延迟、评估拓扑选择(单主、多主、无主)、评估故障转移准备情况,并根据CAP定理权衡和实际操作约束推荐复制策略。
使用时: "检查复制延迟"、 "复制健康状况"、 "故障转移准备情况"、 "复制拓扑"、 "CAP权衡"、 "设计复制"、 "副本漂移"、 "分脑风险"、 "故障转移演练"
命令
检查运行中的复制状态、测量延迟、检测偏差并标记风险。
步骤 1:识别数据库引擎和拓扑
# PostgreSQL:检查是否为主节点或备节点
psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT pg_is_in_recovery();"
# PostgreSQL:列出复制槽和连接的备节点
psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT slot_name, slot_type, active, restart_lsn FROM pg_replication_slots;"
# MySQL:检查复制状态
mysql -h "$REPLICA_HOST" -u "$DB_USER" -p"$DB_PASS" -e "SHOW REPLICA STATUS\G"
# Redis:检查复制信息
redis-cli -h "$REDIS_HOST" INFO replication
步骤 2:测量复制延迟
延迟是最关键的复制健康指标。从数据库内部和应用程序级别探测器测量它。
# PostgreSQL:每个备节点的延迟(字节和秒)
psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS byte_lag, replay_lag FROM pg_stat_replication;"
# MySQL:主节点落后时间(秒)
mysql -h "$REPLICA_HOST" -u "$DB_USER" -p"$DB_PASS" -e "SELECT CHANNEL_NAME, SOURCE_UUID, LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP, APPLYING_TRANSACTION, LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP FROM performance_schema.replication_applier_status_by_worker;"
# 应用程序级别心跳探测(写入时间戳到主节点,读取时间戳从备节点)
python3 -c "import time, psycopg2
primary = psycopg2.connect(host='$PRIMARY_HOST', dbname='$DB_NAME', user='$DB_USER')
replica = psycopg2.connect(host='$REPLICA_HOST', dbname='$DB_NAME', user='$DB_USER')
# 写入心跳到主节点
with primary.cursor() as cur:
cur.execute('CREATE TABLE IF NOT EXISTS _repl_heartbeat (id int PRIMARY KEY, ts timestamptz)')
cur.execute('INSERT INTO _repl_heartbeat VALUES (1, now()) ON CONFLICT (id) DO UPDATE SET ts = now()')
primary.commit()
cur.execute('SELECT ts FROM _repl_heartbeat WHERE id = 1')
write_ts = cur.fetchone()[0]
time.sleep(0.5)
# 从备节点读取心跳
with replica.cursor() as cur:
cur.execute('SELECT ts FROM _repl_heartbeat WHERE id = 1')
read_ts = cur.fetchone()[0]
lag = (write_ts - read_ts).total_seconds() if write_ts > read_ts else 0
print(f'应用程序级别复制延迟:{lag:.3f}s')
print(f'评估:{\"HEALTHY\" if lag < 1 else \"WARNING\" if lag < 10 else \"CRITICAL\"}')"
步骤 3:检查复制冲突和错误
# PostgreSQL:检查复制冲突(备节点上的查询被取消)
psql -h "$REPLICA_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT datname, confl_tablespace, confl_lock, confl_snapshot, confl_bufferpin, confl_deadlock FROM pg_stat_database_conflicts WHERE datname = '$DB_NAME';"
# PostgreSQL:检查WAL归档健康状况
psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT archived_count, failed_count, last_archived_wal, last_archived_time, last_failed_time FROM pg_stat_archiver;"
# MySQL:检查复制错误
mysql -h "$REPLICA_HOST" -u "$DB_USER" -p"$DB_PASS" -e "SELECT LAST_ERROR_NUMBER, LAST_ERROR_MESSAGE, LAST_ERROR_TIMESTAMP FROM performance_schema.replication_applier_status_by_worker WHERE LAST_ERROR_NUMBER != 0;"
步骤 4:评估网络和磁盘瓶颈
# 检查WAL生成率(PostgreSQL)
psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') / (1024
10241024) AS total_wal_gb, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / (1024*1024) AS pending_mb FROM pg_stat_replication;"
# 检查备节点磁盘I/O
iostat -x 1 3 | tail -20
# 检查主节点和备节点之间的网络延迟
ping -c 5 "$REPLICA_HOST" | tail -1
报告模板
复制健康评估
日期: YYYY-MM-DD
引擎: PostgreSQL 16 / MySQL 8 / Redis 7
拓扑: 单主节点带2个异步备节点
复制状态
| 备节点 | 状态 | 字节延迟 | 时间延迟 | 冲突 | 结论 |
|---------|-------|----------|----------|-----------|---------|
| replica-1 | streaming | 1.2 MB | 0.3s | 0 | HEALTHY |
| replica-2 | streaming | 45 MB | 8.2s | 12 | WARNING |
风险评估
- 数据丢失窗口(RPO): ~8s(最坏的复制延迟)
- 故障转移时间估计(RTO):