Infra Monitoring — 基础设施监控 — 监控服务器健康、可用性和资源利用率

Name: Infra Monitoring — 基础设施监控 — 监控服务器健康、可用性和资源利用率
Author: GitCanadaBrett

GitCanadaBrett

Infra Monitoring — 基础设施监控 — 监控服务器健康、可用性和资源利用率

v0.1.0

监控服务器健康、可用性、资源利用率、SSL证书过期和事件检测，适用于小团队和自托管用户。提供直观的状态报告，突出需要关注的问题。

0· 53·0 当前·0 累计

by @gitcanadabrett (GitCanadaBrett)·MIT-0

监控工具开发工具系统工具自动化

下载技能包

License

MIT-0

最后更新

2026/4/11

安全扫描

VirusTotal

无害

查看报告

OpenClaw

安全

high confidence

该技能的请求能力、指令和所需资源与其声明的目的一致：它是一个仅指令的监控/报告技能，要求用户提供指标或端点，不请求无关的凭据或安装代码。

评估建议

该技能看起来与其声明的监控目的相一致。使用前：(1) 避免粘贴敏感信息；(2) 只提供您控制的域/端点；(3) 自动修复前请确认；(4) 将基于LLM的分析作为指导，而非自动修复。...

详细分析 ▾

✓ 用途与能力

名称/描述与SKILL.md和参考文件匹配，技能不请求无关二进制文件、环境变量或配置路径。

✓ 指令范围

运行时指令在范围内：解析用户提供的命令输出、对提供的端点运行HTTP/HTTPS检查、分类指标并生成报告。

✓ 安装机制

无安装规格和代码文件 — 仅指令技能，降低了磁盘写入和外部安装风险。

✓ 凭证需求

技能声明无需环境变量、主凭据或配置路径。SKILL.md记录了'无明确配置不访问'的边界。

✓ 持久化与权限

always为false，模型调用默认允许。技能不请求持久存在或修改其他技能/系统设置。

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv0.1.02026/4/11

v0.1.0 — 首次公开发布。为独自操作员和小团队提供基础设施健康报告。服务器健康检查、可用性监控、SSL证书过期跟踪、事件日志记录、严重性分类和复合信号检测。关注优先报告——告诉您需要行动的内容，而非47个数字。12个测试案例中获得43/45的QA评分。

● 无害

安装命令点击复制

官方npx clawhub@latest install infra-monitoring

镜像加速npx clawhub@latest install infra-monitoring --registry https://cn.clawhub-mirror.com

技能文档

（由于原始内容过长，以下为简略版，完整版请参考原始文档）

name: infra-monitoring ...

# 基础设施与可用性监控监控您的服务器和端点，如一名敏捷的运维工程师，告诉您需要关注的问题，而非数据倾倒。...

Monitor your servers and endpoints like a sharp ops engineer who tells you what needs attention, not a dashboard that dumps 47 numbers.

Trigger conditions

Activate this skill when the user:

Asks to check server health, status, or resource usage
Provides server metrics (CPU, memory, disk, network) for assessment
Asks about uptime, downtime, or availability of endpoints
Asks to check SSL certificate expiry dates
Provides system output from commands like top, htop, df, free, uptime, vmstat, iostat
Asks for a status report on their infrastructure
Mentions monitoring, health checks, or incident detection for servers
Asks about capacity planning or resource trending
Provides ping, curl, or HTTP response data for analysis
Asks to set up monitoring for a new server or endpoint

Do NOT activate when:

The user wants application-level business metrics (suggest data-analysis-reporting skill)
The user needs APM or distributed tracing (suggest Datadog/New Relic)
The user wants to build a monitoring dashboard UI
The user needs real-time streaming metrics ingestion
The user wants network security scanning or penetration testing
The user asks about cloud provider billing or cost optimization without infra context

Work the request in this order

Understand the scope — before running any checks, understand what the user is monitoring and why:

- "What servers or endpoints do you need checked?" - "Is this a routine check or are you investigating a specific issue?" - "What does 'healthy' look like for your setup?" If the user provides clear context (paste of system metrics, specific endpoint to check), skip to step 2.

Gather the data — collect or parse the infrastructure data:

- Parse user-provided system command output (top, df, free, uptime, etc.) - Execute HTTP/HTTPS checks against provided endpoints - Parse provided log snippets or monitoring data exports - Detect the data type and validate completeness - If data is insufficient for a meaningful assessment, ask for specifics before proceeding

Assess health — evaluate each metric against thresholds:

- Classify each metric as healthy, warning, or critical using references/metrics-thresholds.md - Consider context: 85% disk on a 20GB volume is more urgent than on a 2TB volume - Detect change direction: is the metric stable, climbing, or dropping - Identify correlations: high CPU + high swap often means memory pressure, not CPU problem - Check for compound risk: multiple warnings that individually are fine but together signal trouble

Build the status report — structured output following the default format below:

- Lead with what needs attention, not what's fine - Group related issues (don't spam 5 separate disk alerts for one full server) - Include temporal context: is this new, ongoing, or recurring - Note what has improved since last check (if prior context available)

Recommend actions — concrete next steps prioritized by urgency:

- Immediate actions for critical items - Scheduled maintenance for warning items - Monitoring adjustments for better visibility - Capacity planning notes for trending concerns

Default output structure

Use this structure unless the user clearly wants a different format:

Attention required — the critical and warning items, sorted by severity then urgency. Each item:

- What is the issue (plain language) - How severe (critical / warning) - What action to take - How urgent (act now / schedule this week / monitor)

If nothing needs attention: "All systems healthy. No action required."

Server health summary — per-server overview:

- Hostname / identifier - Overall status: healthy / warning / critical - Key metrics: CPU, memory, disk, uptime - One-line assessment ("Running well, disk growing steadily — ~45 days until 90%")

Endpoint status — per-endpoint overview:

- URL / endpoint identifier - Status: up / degraded / down - Response time and status code - SSL certificate days remaining (if HTTPS) - Uptime percentage over monitoring window

Resource trends — directional indicators for key metrics:

- Which metrics are climbing, stable, or dropping - Rate of change where meaningful - Projected thresholds (e.g., "disk will hit 90% in ~30 days at current growth") - Comparison to prior check if available

Incident timeline — recent events if any:

- When the incident started and ended (or "ongoing") - What triggered detection - Impact assessment - Resolution or current mitigation status

Recommended actions — 3 concrete next steps:

- One immediate action (if anything is critical or warning) - One preventive measure (based on trends) - One monitoring improvement (better visibility for next time)

System details — raw metric values for reference:

- Full metric breakdown per server - Presented in a clean table format - Thresholds shown alongside actuals - This section is for the user who wants the numbers after reading the summary

Health assessment logic

Apply thresholds from references/metrics-thresholds.md with these principles:

Context matters more than absolute numbers. A web server at 70% CPU during peak hours is different from 70% at 3 AM.
Trends matter more than snapshots. 60% disk usage climbing 2% per day is more urgent than 80% stable for months.
Compound signals. High CPU + high memory + high disk I/O together = investigate. Any one alone at warning level = monitor.
Volume-aware thresholds. Percentage thresholds must account for absolute capacity. 90% of 10GB needs action sooner than 90% of 1TB.
Uptime context. A 12-minute outage matters more for an API endpoint than for a weekly batch job server.

Severity classification

Read references/alert-severity.md for the full classification system. Summary:

Severity	Meaning	Response
Critical	Service impacted or imminent failure	Act now
Warning	Approaching threshold or degraded but functional	Schedule fix this week
Healthy	Within normal operating parameters	No action needed
Unknown	Insufficient data to classify	Investigate or provide more data

SSL certificate monitoring

When checking HTTPS endpoints:

Report days until certificate expiry
Determine the renewal type before applying thresholds:

Auto-renew certs (Let's Encrypt, managed cloud certs, etc.):

Critical: <7 days remaining (renewal has almost certainly failed)
Warning: <14 days remaining (renewal should have triggered — investigate)
Healthy: >14 days remaining

Manual renewal certs (purchased certs, enterprise CA, self-managed):

Critical: <14 days remaining (not enough lead time for procurement/install)
Warning: <45 days remaining (start renewal process now)
Healthy: >45 days remaining

Unknown renewal type (cannot determine auto vs. manual):

Critical: <7 days remaining
Warning: <30 days remaining
Healthy: >30 days remaining

How to determine renewal type: check the certificate issuer. Let's Encrypt, AWS ACM, Cloudflare, and Google-managed certs are auto-renew. Enterprise CAs (DigiCert, Sectigo, internal PKI) and self-signed certs are typically manual. When in doubt, classify as unknown and note the ambiguity.

Flag certificate chain issues, mismatched hostnames, or expired intermediate certs

Incident detection and grouping

When multiple alerts fire for the same root cause:

Group them into a single incident narrative
Identify the likely root cause ("disk full caused the database to stop, which caused the API to return 500s" = one incident, not three)
Track the timeline: first detection, escalation, peak impact, resolution
Generate a plain-language post-incident summary

Sparse-data and partial-check handling

When the user provides incomplete data:

Assess what you can — don't refuse the whole check because one metric is missing
Name the gaps — "I can assess CPU and memory but you didn't provide disk usage — want me to check?"
Adjust confidence — partial data gets a qualified assessment, not a definitive one
Suggest the full picture — "For a complete health check, I'd also need: [list]"

No-data gate

When the user asks for monitoring but provides no server details or metrics:

Ask what they're monitoring (server, endpoint, or both)
Suggest the minimum viable check: provide hostname/IP and what services run on it
Offer a starter monitoring checklist from references/monitoring-checklists.md
Provide a sample health check output so they know what to expect

Do not generate fictional server metrics or pretend to check nonexistent infrastructure.

Boundaries

No access without explicit configuration. Do not connect to servers, endpoints, or services unless the user explicitly provides connection details.
No credential storage in skill files. Never write passwords, API keys, SSH keys, or connection strings to output. Reference environment variables or secret managers only.
No direct remediation — guidance only. This skill provides monitoring, diagnosis, and guidance. It does not execute remediation actions (restarting services, deleting files, scaling resources, modifying configurations) directly. When the user requests remediation, provide step-by-step guidance and commands they can run themselves. Explain risks before each step.
Monitoring data vs. diagnostic speculation. Clearly separate observed facts ("CPU is at 92%") from inferences ("likely caused by the Java process using 4.2GB heap"). Label each.
No real-time detection guarantee. The skill runs on-demand or at check intervals. It is not a kernel-level monitor or hardware watchdog. State this clearly.
No PII in monitoring output. If server responses contain user data, exclude it from reports and incident logs.
Scope: infrastructure, not application. This skill monitors servers and endpoints, not business-level KPIs. Recommend the data-analysis-reporting skill for business data analysis.

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

Trigger conditions

Work the request in this order

Default output structure

Health assessment logic

Severity classification

SSL certificate monitoring

Incident detection and grouping

Sparse-data and partial-check handling

No-data gate

Boundaries

安装命令点击复制