详细分析 ▾
运行时依赖
版本
v0.1.0 — 首次公开发布。为独自操作员和小团队提供基础设施健康报告。服务器健康检查、可用性监控、SSL证书过期跟踪、事件日志记录、严重性分类和复合信号检测。关注优先报告——告诉您需要行动的内容,而非47个数字。12个测试案例中获得43/45的QA评分。
安装命令 点击复制
技能文档
(由于原始内容过长,以下为简略版,完整版请参考原始文档)
name: infra-monitoring ...
# 基础设施与可用性监控 监控您的服务器和端点,如一名敏捷的运维工程师,告诉您需要关注的问题,而非数据倾倒。...
Monitor your servers and endpoints like a sharp ops engineer who tells you what needs attention, not a dashboard that dumps 47 numbers.
Trigger conditions
Activate this skill when the user:
- Asks to check server health, status, or resource usage
- Provides server metrics (CPU, memory, disk, network) for assessment
- Asks about uptime, downtime, or availability of endpoints
- Asks to check SSL certificate expiry dates
- Provides system output from commands like
top,htop,df,free,uptime,vmstat,iostat - Asks for a status report on their infrastructure
- Mentions monitoring, health checks, or incident detection for servers
- Asks about capacity planning or resource trending
- Provides ping, curl, or HTTP response data for analysis
- Asks to set up monitoring for a new server or endpoint
Do NOT activate when:
- The user wants application-level business metrics (suggest data-analysis-reporting skill)
- The user needs APM or distributed tracing (suggest Datadog/New Relic)
- The user wants to build a monitoring dashboard UI
- The user needs real-time streaming metrics ingestion
- The user wants network security scanning or penetration testing
- The user asks about cloud provider billing or cost optimization without infra context
Work the request in this order
- Understand the scope — before running any checks, understand what the user is monitoring and why:
- Gather the data — collect or parse the infrastructure data:
- Assess health — evaluate each metric against thresholds:
references/metrics-thresholds.md
- Consider context: 85% disk on a 20GB volume is more urgent than on a 2TB volume
- Detect change direction: is the metric stable, climbing, or dropping
- Identify correlations: high CPU + high swap often means memory pressure, not CPU problem
- Check for compound risk: multiple warnings that individually are fine but together signal trouble- Build the status report — structured output following the default format below:
- Recommend actions — concrete next steps prioritized by urgency:
Default output structure
Use this structure unless the user clearly wants a different format:
- Attention required — the critical and warning items, sorted by severity then urgency. Each item:
If nothing needs attention: "All systems healthy. No action required."
- Server health summary — per-server overview:
- Endpoint status — per-endpoint overview:
- Resource trends — directional indicators for key metrics:
- Incident timeline — recent events if any:
- Recommended actions — 3 concrete next steps:
- System details — raw metric values for reference:
Health assessment logic
Apply thresholds from references/metrics-thresholds.md with these principles:
- Context matters more than absolute numbers. A web server at 70% CPU during peak hours is different from 70% at 3 AM.
- Trends matter more than snapshots. 60% disk usage climbing 2% per day is more urgent than 80% stable for months.
- Compound signals. High CPU + high memory + high disk I/O together = investigate. Any one alone at warning level = monitor.
- Volume-aware thresholds. Percentage thresholds must account for absolute capacity. 90% of 10GB needs action sooner than 90% of 1TB.
- Uptime context. A 12-minute outage matters more for an API endpoint than for a weekly batch job server.
Severity classification
Read references/alert-severity.md for the full classification system. Summary:
| Severity | Meaning | Response |
|---|---|---|
| Critical | Service impacted or imminent failure | Act now |
| Warning | Approaching threshold or degraded but functional | Schedule fix this week |
| Healthy | Within normal operating parameters | No action needed |
| Unknown | Insufficient data to classify | Investigate or provide more data |
SSL certificate monitoring
When checking HTTPS endpoints:
- Report days until certificate expiry
- Determine the renewal type before applying thresholds:
Auto-renew certs (Let's Encrypt, managed cloud certs, etc.):
- Critical: <7 days remaining (renewal has almost certainly failed)
- Warning: <14 days remaining (renewal should have triggered — investigate)
- Healthy: >14 days remaining
Manual renewal certs (purchased certs, enterprise CA, self-managed):
- Critical: <14 days remaining (not enough lead time for procurement/install)
- Warning: <45 days remaining (start renewal process now)
- Healthy: >45 days remaining
Unknown renewal type (cannot determine auto vs. manual):
- Critical: <7 days remaining
- Warning: <30 days remaining
- Healthy: >30 days remaining
How to determine renewal type: check the certificate issuer. Let's Encrypt, AWS ACM, Cloudflare, and Google-managed certs are auto-renew. Enterprise CAs (DigiCert, Sectigo, internal PKI) and self-signed certs are typically manual. When in doubt, classify as unknown and note the ambiguity.
- Flag certificate chain issues, mismatched hostnames, or expired intermediate certs
Incident detection and grouping
When multiple alerts fire for the same root cause:
- Group them into a single incident narrative
- Identify the likely root cause ("disk full caused the database to stop, which caused the API to return 500s" = one incident, not three)
- Track the timeline: first detection, escalation, peak impact, resolution
- Generate a plain-language post-incident summary
Sparse-data and partial-check handling
When the user provides incomplete data:
- Assess what you can — don't refuse the whole check because one metric is missing
- Name the gaps — "I can assess CPU and memory but you didn't provide disk usage — want me to check?"
- Adjust confidence — partial data gets a qualified assessment, not a definitive one
- Suggest the full picture — "For a complete health check, I'd also need: [list]"
No-data gate
When the user asks for monitoring but provides no server details or metrics:
- Ask what they're monitoring (server, endpoint, or both)
- Suggest the minimum viable check: provide hostname/IP and what services run on it
- Offer a starter monitoring checklist from
references/monitoring-checklists.md - Provide a sample health check output so they know what to expect
Do not generate fictional server metrics or pretend to check nonexistent infrastructure.
Boundaries
- No access without explicit configuration. Do not connect to servers, endpoints, or services unless the user explicitly provides connection details.
- No credential storage in skill files. Never write passwords, API keys, SSH keys, or connection strings to output. Reference environment variables or secret managers only.
- No direct remediation — guidance only. This skill provides monitoring, diagnosis, and guidance. It does not execute remediation actions (restarting services, deleting files, scaling resources, modifying configurations) directly. When the user requests remediation, provide step-by-step guidance and commands they can run themselves. Explain risks before each step.
- Monitoring data vs. diagnostic speculation. Clearly separate observed facts ("CPU is at 92%") from inferences ("likely caused by the Java process using 4.2GB heap"). Label each.
- No real-time detection guarantee. The skill runs on-demand or at check intervals. It is not a kernel-level monitor or hardware watchdog. State this clearly.
- No PII in monitoring output. If server responses contain user data, exclude it from reports and incident logs.
- Scope: infrastructure, not application. This skill monitors servers and endpoints, not business-level KPIs. Recommend the data-analysis-reporting skill for business data analysis.
免费技能或插件可能存在安全风险,如需更匹配、更安全的方案,建议联系付费定制