Example Device Health — IOS-XE体检

Name: Example Device Health — IOS-XE体检
Author: Vahagn Madatyan

Vahagn Madatyan

🔍 Example Device Health — IOS-XE体检

v1.0.0

提供Cisco IOS-XE路由器/交换机健康检查与故障排查流程，一键执行CPU、内存、接口等只读show命令，输出结构化报告并附带阈值表与升级建议，适用于故障、审计、变更后验证等场景。

0· 98·1 当前·1 累计

by @vahagn-madatyan (Vahagn Madatyan)·MIT-0

网络工具系统工具测试工具生产力工具工作流

下载技能包

License

MIT-0

最后更新

2026/3/22

安全扫描

VirusTotal

Pending

查看报告

OpenClaw

安全

high confidence

该技能为纯指令、只读Cisco IOS‑XE排查清单；其命令与范围与声明目的一致，不请求额外凭证也不安装任何内容，但SSH需求的元数据存在轻微不一致。

评估建议

此技能为纯指令型Cisco IOS‑XE排查清单，与其声明目的一致。安装/使用前：1)确认如何安全提供SSH/控制台凭证（技能需设备访问但未声明凭证需求），避免将凭证粘贴至不受信通道；2)注意“show tech-support”等输出可能含敏感配置与日志，仅限授权支持渠道共享；3)在生产设备执行前需获得变更控制/维护窗口批准；4)SKILL.md元数据提及“ssh”而注册元数据未提及——大概率无害，但若需严格库存/证明，请与技能作者核实。若需确保技能不外传数据，请确认无隐藏外发端点（未声明），并优先在本地或独立管理跳板机执行。...

详细分析 ▾

✓ 用途与能力

名称/描述（IOS‑XE健康与排查）与内容匹配：仅含IOS show命令、阈值与升级指导。流程需SSH/控制台访问目标设备。一处轻微不一致：SKILL.md嵌入元数据在requires.bins列出“ssh”，而注册级需求未列出所需二进制——应为元数据错配，而非恶意行为。

✓ 指令范围

SKILL.md指示运行只读IOS show命令，收集计数器、环境与路由信息并生成结构化报告。未指示读取本地文件、环境变量或联系外部端点。建议收集较大输出（如“show tech-support”）用于升级，此为TAC标准流程，但可能含敏感设备数据。

✓ 安装机制

无安装规范且无代码文件——纯指令。技能本身不会下载或写入磁盘，为最低风险安装姿态。

ℹ 凭证需求

技能未声明或请求环境变量或凭证，但实际操作需SSH/控制台凭证（设备管理凭据、跳板机密钥）才能运行所述检查。此需求与任务成比例，但技能未声明凭据如何提供或处理——使用前需自行验证。

✓ 持久化与权限

always为false，技能由用户调用且非强制包含。不含修改其他技能、代理配置或系统级设置的指令。

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.0.02026/3/22

example-device-health 1.0.0 初始发布： - 为Cisco IOS-XE路由器与交换机提供逐步排查流程。 - 涵盖CPU、内存、接口、路由协议与平台环境健康检查。 - 包含推荐的show命令、升级标准与故障排查决策树。 - 提供健康指标阈值表供解读。 - 生成含严重级别与可操作建议的结构化设备健康报告。 - 适用于中断、审计、变更后验证与事件响应场景。

● Pending

安装命令点击复制

官方npx clawhub@latest install example-device-health

镜像加速npx clawhub@latest install example-device-health --registry https://cn.clawhub-mirror.com

技能文档

针对 Cisco IOS-XE 设备健康评估的结构化分流流程。生成按优先级排序的发现报告，包含严重性分级与建议措施。

何时使用

设备被报告为缓慢、无响应或丢包
对 IOS-XE 路由器或交换机进行定期健康审计
配置或软件更新后的变更验证
收集 CPU、内存、接口利用率用于容量规划
当怀疑设备为故障域时进行事件响应

前置条件

通过 SSH 或控制台访问目标 IOS-XE 设备（最低 privilege 1）
设备运行 IOS-XE 16.x 或 17.x（命令已在 17.3+ 验证）
网络可达性已确认（ping/traceroute 管理 IP 成功）
了解设备的正常基线（典型 CPU、内存、流量水平）
若在维护窗口内执行，需获得变更控制批准

流程

按此顺序执行。每一步都为最终报告提供数据。除非设备无响应（跳至步骤 6 进行崩溃恢复），否则不要跳过。

步骤 1：建立基线上下文

收集设备身份与运行时间，为健康检查提供背景。 ``

  
show version | include uptime|Version|bytes of memory  
show inventory | include PID  
show clock

`

  
记录：主机名、软件版本、运行时间、硬件型号、当前时间。若运行时间意外短——提示近期重启或崩溃——需标记。  
步骤 2：CPU 利用率评估

`

  
show processes cpu sorted | head 20  
show processes cpu history  
show processes cpu platform sorted 5sec

`

  
将 5 秒、1 分钟、5 分钟平均值与阈值对比。若 5 秒均值超过 80%，立即定位最耗 CPU 的进程。需重点关注：  
IP Input — 高值表示流量处理过载  
Crypto IKMP — VPN 协商风暴  
SNMP ENGINE — 频繁轮询  
BGP Router — 大量表翻动或路由震荡  
IOSD — 一般控制面拥塞  
步骤 3：内存利用率评估

`

  
show memory statistics  
show memory platform information  
show processes memory sorted | head 15

`

  
计算已用百分比：

计算错误率：errors / (input packets + output packets) 100


错误率高于 0.1% 为警告，高于 1% 为严重  
CRC 错误提示 Layer 1 问题（线缆、光模块、SFP）  
无 CRC 的输入错误提示缓存或超限问题  
输出丢包表示拥塞 —— 检查 QoS 策略

`步骤 5：路由表健康`

`

  
show ip route summary  
show ip bgp summary（若配置 BGP）  
show ip ospf neighbor（若配置 OSPF）  
show ip eigrp neighbors（若配置 EIGRP）

`

  
验证：预期路由数量存在、无意外路由撤销、所有路由协议邻居处于 established/full 状态。标记：过去一小时内邻居状态变化、路由计数与基线显著差异、任何经意外下一跳的路由。  
步骤 6：平台与环境

`

  
show environment all  
show platform software status control-processor brief  
show logging | include %|Error|Warning|traceback（最近 50 行）

`

  
检查：电源状态、风扇状态、温度读数。任何环境告警均为立即升级触发器。查看最近 syslog 中有无崩溃特征（traceback、CPUHOG、MALLOCFAIL）。  
阈值表  
参考：

references/threshold-tables.md

 获取各参数详细阈值。  
| 参数 | 正常 | 警告 | 严重 |  
|-----------|--------|---------|----------|  
| CPU 5 分钟均值 | < 40% | 40–70% | > 70% |  
| CPU 5 秒峰值 | < 80% | 80–90% | > 90% |  
| 内存已用 | < 70% | 70–85% | > 85% |  
| 内存碎片 | > 10% 最大/总 | 5–10% | < 5% |  
| 接口错误率 | < 0.01% | 0.01–0.1% | > 0.1% |  
| 接口输出丢包 | < 100/小时 | 100–1000/小时 | > 1000/小时 |  
| 路由邻居 | 全部建立 | 翻动 | 宕机 |  
| 温度 | 规格内 | 距上限 5°C 内 | 达到或超上限 |  
决策树  
分流优先级

`

  
设备是否可达？  
├── 否 → 立即升级。检查控制台访问、电源、环境。  
└── 是  
    ├── CPU 严重？→ 定位最耗进程 → 按进程类型执行缓解  
    │   ├── IP Input → 检查流量风暴、ACL 优化  
    │   ├── BGP Router → 检查路由翻动、对等体震荡、表规模  
    │   └── 其他 → 收集 'show tech-support' 供 TAC 升级  
    ├── 内存严重？→ 检查内存泄漏  
    │   ├── 最大空闲 < 5% 总空闲 → 可能碎片化，计划重启  
    │   └── 随时间稳定增长 → 内存泄漏，收集 'show mem alloc'  
    ├── 接口错误？→ 分类错误类型  
    │   ├── CRC/输入错误 → Layer 1（线缆、光模块、SFP）  
    │   └── 输出丢包 → QoS 策略或拥塞  
    └── 全部在阈值内 → 记录健康状态，安排下次检查

`

  
升级标准  
当下列任一条件满足时，升级至高级工程师或 TAC：  
CPU 持续高于 90% 超过 15 分钟且无可识别原因  
内存低于 15% 空闲且近期无变更可解释消耗  
过去 24 小时内日志出现 traceback 或 CPUHOG 消息  
存在环境告警（电源、风扇、温度）  
过去一小时内路由邻居状态变化超过 3 次  
报告模板  
生成包含以下章节的结构化报告：

`

  
设备健康报告  
====================  
设备：[hostname]  
型号：[inventory 中的 PID]  
软件：[version]  
运行时间：[uptime string]  
检查时间：[timestamp]  
执行人：[operator/agent]  
摘要：[HEALTHY | WARNING | CRITICAL]  
发现：  
[严重性] [组件] — [描述]  
   观测值：[metric value]  
   阈值：[正常/警告/严重范围]  
   措施：[建议操作]  
...  
建议：  
[按优先级排列的措施列表]

下次检查：[根据发现严重性安排的日期]`发现严重性级别：

INFO — 在阈值内，记录基线


WARNING — 接近阈值，需密切监控  
CRITICAL — 已超阈值，需采取措施  
EMERGENCY — 设备有故障风险，需立即行动

`故障排查`

`SSH 无响应`


尝试控制台访问。若控制台也无响应，远程检查电源与环境（智能 PDU、带外管理）。若设备已崩溃，恢复后收集 crashinfo：

dir crashinfo:

。  
健康检查期间 CPU 尖峰  
SNMP 轮询或 show 命令本身可短暂拉高 CPU。连接后等待 30 秒再采集 CPU 数据。使用

terminal length 0

 避免分页停顿延长会话时间。  
内存读数不一致  
内存值在正常操作时会波动。以 30 秒间隔采集三次样本并取平均。检查

show memory dead

 查看已分配但无法访问的内存（泄漏指标）。  
接口计数器解读  
计数器自上次清除后累积。使用

show interfaces [name] 查看上次清除时间。计算速率时，在已知间隔内采集两次计数器：(counter2 - counter1) / interval_seconds`。

路由协议邻居问题

若 OSPF 邻居卡在 INIT/2WAY，检查 MTU 不匹配与区域配置。若 BGP 对等体显示 "Active" 状态，验证 179 端口 TCP 连通性并检查 ACL 阻断。EIGRP stuck-in-active 表示下游收敛问题。

Structured triage procedure for assessing Cisco IOS-XE device health. Produces a prioritized findings report with severity classifications and recommended actions.

When to Use

Device is reported as slow, unresponsive, or dropping traffic
Scheduled health audit of IOS-XE routers or switches
Post-change verification after configuration or software updates
Capacity planning data collection for CPU, memory, and interface utilization
Incident response when a device is suspected as the fault domain

Prerequisites

SSH or console access to the target IOS-XE device (privilege level 1 minimum)
Device running IOS-XE 16.x or 17.x (commands validated against 17.3+)
Network reachability confirmed (ping/traceroute to management IP succeeds)
Knowledge of the device's normal baseline (typical CPU, memory, traffic levels)
Change control approval if performing checks during a maintenance window

Procedure

Follow this sequence. Each step produces data for the final report. Do not skip steps unless the device is unresponsive (jump to Step 6 for crash recovery).

Step 1: Establish Baseline Context

Collect device identity and uptime to frame the health check.

show version | include uptime|Version|bytes of memory
show inventory | include PID
show clock

Record: hostname, software version, uptime, hardware model, current time. Flag if uptime is unexpectedly short — indicates recent reload or crash.

Step 2: CPU Utilization Assessment

show processes cpu sorted | head 20
show processes cpu history
show processes cpu platform sorted 5sec

Compare 5-second, 1-minute, and 5-minute averages against thresholds. If 5-second average exceeds 80%, identify the top process immediately.

Key processes to watch:

IP Input — high values indicate traffic processing overload
Crypto IKMP — VPN negotiation storms
SNMP ENGINE — aggressive polling
BGP Router — large table churn or route oscillation
IOSD — general control plane congestion

Step 3: Memory Utilization Assessment

show memory statistics
show memory platform information
show processes memory sorted | head 15

Calculate used percentage: (Total - Free) / Total 100. Check for memory fragmentation: compare Largest Free block to Total Free. If largest free block is less than 10% of total free, fragmentation is a concern.

Step 4: Interface Health

show interfaces summary
show interfaces counters errors
show interfaces | include line protocol|drops|error|CRC|collision

For each interface with errors:

Calculate error rate: errors / (input packets + output packets) 100
Error rate above 0.1% is warning, above 1% is critical
CRC errors suggest Layer 1 issues (cabling, optics, SFP)
Input errors with no CRC suggest buffer or overrun issues
Output drops indicate congestion — check QoS policy

Step 5: Routing Table Health

show ip route summary
show ip bgp summary (if BGP is configured)
show ip ospf neighbor (if OSPF is configured)
show ip eigrp neighbors (if EIGRP is configured)

Verify: expected number of routes present, no unexpected route withdrawals, all routing protocol neighbors in established/full state.

Flag: neighbor state changes in the last hour, route count significantly different from baseline, any routes via unexpected next-hops.

Step 6: Platform and Environment

show environment all
show platform software status control-processor brief
show logging | include %|Error|Warning|traceback (last 50 lines)

Check: power supply status, fan status, temperature readings. Any environmental alarm is an immediate escalation trigger. Review recent syslog for crash signatures (traceback, CPUHOG, MALLOCFAIL).

Threshold Tables

Reference: references/threshold-tables.md for detailed per-parameter thresholds.

Parameter	Normal	Warning	Critical
CPU 5-min avg	< 40%	40–70%	> 70%
CPU 5-sec spike	< 80%	80–90%	> 90%
Memory used	< 70%	70–85%	> 85%
Memory fragmentation	> 10% largest/total	5–10%	< 5%
Interface error rate	< 0.01%	0.01–0.1%	> 0.1%
Interface output drops	< 100/hr	100–1000/hr	> 1000/hr
Routing neighbors	All established	Flapping	Down
Temperature	Within spec	Within 5°C of max	At or above max

Decision Trees

Triage Priority

Is the device reachable?
├── No → Escalate immediately. Check console access, power, environment.
└── Yes
    ├── CPU critical? → Identify top process → Apply mitigation per process
    │   ├── IP Input → Check for traffic storm, ACL optimization
    │   ├── BGP Router → Check for route churn, peer flap, table size
    │   └── Other → Collect 'show tech-support' for TAC escalation
    ├── Memory critical? → Check for memory leak
    │   ├── Largest free < 5% of total → Likely fragmentation, schedule reload
    │   └── Steady growth over time → Memory leak, collect 'show mem alloc'
    ├── Interface errors? → Classify error type
    │   ├── CRC/input errors → Layer 1 (cable, optic, SFP)
    │   └── Output drops → QoS policy or congestion
    └── All within thresholds → Document clean health, schedule next check

Escalation Criteria

Escalate to senior engineer or TAC when any of these conditions are met:

CPU sustained above 90% for more than 15 minutes with no identifiable cause
Memory below 15% free with no recent change to explain consumption
Traceback or CPUHOG messages in logs within last 24 hours
Environmental alarm (power, fan, temperature) present
More than 3 routing neighbor state changes in last hour

Report Template

Generate a structured report with these sections:

DEVICE HEALTH REPORT
====================
Device: [hostname]
Model: [PID from inventory]
Software: [version]
Uptime: [uptime string]
Check Time: [timestamp]
Performed By: [operator/agent]
SUMMARY: [HEALTHY | WARNING | CRITICAL]
FINDINGS:
[Severity] [Component] — [Description]
   Observed: [metric value]
   Threshold: [normal/warning/critical range]
   Action: [recommended action]
...
RECOMMENDATIONS:
[Prioritized list of actions]NEXT CHECK: [scheduled date based on findings severity]

Severity levels for findings:

INFO — within normal thresholds, noted for baseline
WARNING — approaching threshold, monitor closely
CRITICAL — threshold exceeded, action required
EMERGENCY — device at risk of failure, immediate action

Troubleshooting

Device Unresponsive to SSH

Try console access. If console is also unresponsive, check power and environment remotely (smart PDU, out-of-band management). If the device has crashed, collect crashinfo: dir crashinfo: after recovery.

CPU Spikes During Health Check

SNMP polling or show commands themselves can briefly spike CPU. Wait 30 seconds after connecting before collecting CPU data. Use terminal length 0 to avoid paging pauses that extend session time.

Inconsistent Memory Readings

Memory values fluctuate during normal operation. Collect three samples at 30-second intervals and average them. Check show memory dead for memory that is allocated but unreachable (leak indicator).

Interface Counter Interpretation

Counters are cumulative since last clear. Use show interfaces [name] to see the last clear time. For rate calculations, collect counters twice with a known interval: (counter2 - counter1) / interval_seconds.

Routing Protocol Neighbor Issues

If OSPF neighbors are stuck in INIT/2WAY, check MTU mismatch and area configuration. If BGP peers show "Active" state, verify TCP connectivity on port 179 and check for ACL blocking. EIGRP stuck-in-active indicates a convergence problem downstream.

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

何时使用

前置条件

流程

步骤 1：建立基线上下文

步骤 2：CPU 利用率评估

步骤 3：内存利用率评估

步骤 4：接口健康

步骤 5：路由表健康

步骤 6：平台与环境

阈值表

决策树

分流优先级

升级标准

报告模板

故障排查

SSH 无响应

健康检查期间 CPU 尖峰

内存读数不一致

接口计数器解读

路由协议邻居问题

When to Use

Prerequisites

Procedure

Step 1: Establish Baseline Context

Step 2: CPU Utilization Assessment

Step 3: Memory Utilization Assessment

Step 4: Interface Health

Step 5: Routing Table Health

Step 6: Platform and Environment

Threshold Tables

Decision Trees

Triage Priority

Escalation Criteria

Report Template

Troubleshooting

Device Unresponsive to SSH

CPU Spikes During Health Check

Inconsistent Memory Readings

Interface Counter Interpretation

Routing Protocol Neighbor Issues

安装命令点击复制

`步骤 5：路由表健康`

`故障排查`

`SSH 无响应`