运行时要求
此技能包提供 Kubernetes/OpenShift 集群管理能力。凭证是模块化的——只需为您特定的用例配置所需内容。
始终必需
| 需求 | 描述 | 环境变量 |
|---|
| Kubeconfig | 具有集群访问权限的有效 kubeconfig | KUBECONFIG 或 ~/.kube/config |
| kubectl | Kubernetes CLI | 必须在 PATH 中 |
条件性 - 按需启用
| 平台 | 启用条件... | 凭证 |
|---|
| AWS/EKS/ROSA | 管理 AWS 托管的 Kubernetes | AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY |
| Azure/ARO | 管理 Azure 托管的 Kubernetes | AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID |
| GCP/GKE | 管理 GCP 托管的 Kubernetes | GOOGLE_APPLICATION_CREDENTIALS |
| ArgoCD | 使用 GitOps 代理 | ARGOCD_AUTH_TOKEN, ARGOCD_SERVER |
| Vault | 使用密钥管理 | VAULT_TOKEN |
| GitHub | 推送到 git 仓库 | GITHUB_TOKEN |
会话设置
在使用代理之前,您必须设置会话上下文:
# 为您的环境设置会话上下文
bash skills/orchestrator/scripts/setup-session.sh [context-name]
# 环境:dev, qa, staging, prod
# 注意:prod 需要人工审批才能进行所有修改
安全注意事项
- 代理默认以最小权限运行
- 所有凭证访问都会被记录
- 生产环境修改需要人工审批
- 密钥永远不会记录或存储在代码中
The Swarm — 智能体名册
| 智能体 | 代号 | 会话密钥 | 领域 |
|---|
| Orchestrator | Jarvis | agent:platform:orchestrator | 任务路由、协调、每日站会 |
| Cluster Ops | Atlas | agent:platform:cluster-ops | 集群生命周期、节点、升级 |
| GitOps | Flow | agent:platform:gitops | ArgoCD、Helm、Kustomize、部署 |
| Security | Shield | agent:platform:security | RBAC、策略、密钥、扫描 |
| Observability | Pulse | agent:platform:observability | 指标、日志、告警、事件 |
| Artifacts | Cache | agent:platform:artifacts | 镜像仓库、SBOM、升级、CVE |
| Developer Experience | Desk | agent:platform:developer-experience | 命名空间、入职、支持 |
智能体能力摘要
智能体可以做的事情
- 读取集群状态(
kubectl get、kubectl describe、oc get)
- 通过 GitOps 部署(
argocd app sync、Flux 协调)
- 创建文档和报告
- 调查和分类事件
- 置备标准资源(命名空间、配额、RBAC)
- 运行健康检查和审计
- 扫描镜像并生成 SBOM
- 查询指标和日志
- 执行预批准的操作手册
智能体不能做的事情(需要人工介入)
- 删除生产资源(生产环境中的
kubectl delete)
- 修改集群范围的策略(NetworkPolicy、OPA、Kyverno 集群策略)
- 在没有轮换工作流的情况下直接修改密钥
- 修改网络路由或服务网格配置
- 超出定义的资源限制进行扩展
- 执行不可逆的集群升级
- 批准生产部署(可以准备,由人工批准)
- 在 cluster-admin 级别更改 RBAC
参考文档
有关每个代理的详细功能,请参阅各个 SKILL.md 文件:
skills/orchestrator/SKILL.md - 完整编排器文档
skills/cluster-ops/SKILL.md - 完整集群运维文档
skills/gitops/SKILL.md - 完整 GitOps 文档
skills/security/SKILL.md - 完整安全文档
skills/observability/SKILL.md - 完整可观测性文档
skills/artifacts/SKILL.md - 完整制品库文档
skills/developer-experience/SKILL.md - 完整开发者体验文档
Runtime Requirements
This skill package provides Kubernetes/OpenShift cluster management capabilities. Credentials are modular - only configure what you need for your specific use case.
Always Required
| Requirement | Description | Environment Variable |
|---|
| Kubeconfig | Valid kubeconfig with cluster access | KUBECONFIG or ~/.kube/config |
| kubectl | Kubernetes CLI | Must be in PATH |
Conditional - Enable Only As Needed
| Platform | Enable If... | Credentials |
|---|
| AWS/EKS/ROSA | Managing AWS-hosted Kubernetes | AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY |
| Azure/ARO | Managing Azure-hosted Kubernetes | AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID |
| GCP/GKE | Managing GCP-hosted Kubernetes | GOOGLE_APPLICATION_CREDENTIALS |
| ArgoCD | Using GitOps agent | ARGOCD_AUTH_TOKEN, ARGOCD_SERVER |
| Vault | Using secrets management | VAULT_TOKEN |
| GitHub | Pushing to git repositories | GITHUB_TOKEN |
Session Setup
Before using the agents, you
MUST set up a session context:
# Set up session context for your environment
bash skills/orchestrator/scripts/setup-session.sh [context-name]# Environments: dev, qa, staging, prod
# Note: prod requires human approval for all modifications
Security Considerations
- Agents operate with least privilege by default
- All credential access is logged
- Production modifications require human approval
- Secrets are never logged or stored in code
Security Assessment - Read Before Installing
Source Verification
- This skill pulls code from a third-party GitHub repository
- Verify the source URL before installing:
https://github.com/kcns008/cluster-agent-swarm-skills
- Pin to a specific version - never use
main branch in production:
git clone https://github.com/kcns008/cluster-agent-swarm-skills.git
cd cluster-agent-swarm-skills
git fetch --tags
git checkout v1.0.0 # Use verified release tag or commit hash
Third-Party Script Execution Warning
- This is a scripted skill - it will write executable bash scripts to disk
- Scripts perform cluster operations including: deployments, scaling, scanning, configuration
- Some scripts can be destructive - review before running:
- Scripts with
-delete,
-cleanup in name may remove resources
- Scripts with
-promote,
-deploy modify cluster state
- Always test in non-production first
Install Mechanism
- Installing via
npx skills add downloads and executes code from GitHub
- The skill cannot verify integrity of external scripts
- Audit all scripts locally before running in production
- Consider maintaining a verified, offline copy of trusted scripts
- ALWAYS PIN TO VERIFIED COMMIT HASH for production - NEVER use floating URLs like
tree/main or untagged branches
- Use manual git clone with verified checkout for highest security
Persistence & Blast Radius
- Agents maintain persistent state across sessions via:
-
WORKING.md - session progress tracking
-
LOGS.md - action audit trail
-
MEMORY.md - long-term learnings
- Agents are configured to commit changes to these files as part of normal operation
- This persistence increases blast radius if misused - limit repository write access if concerned
Human Approval Enforcement
- The skill documentation claims human approval required for production changes
- This is a procedural control, NOT a technical enforcement
- Your platform MUST enforce an approval gate before allowing production operations
- Do not rely on agent self-restriction for production safety
Principle of Least Privilege - Required
- DO NOT provide owner/root-level cloud credentials
- Create dedicated, minimal-permission service accounts for:
- Kubernetes namespace-level access (not cluster-admin)
- AWS IAM roles with limited EKS permissions
- Azure service principals with limited subscription access
- GCP service accounts with limited project permissions
- Never provide production credentials until you have audited the code in non-production
Sandbox Before Production
- Run this skill in an isolated/non-production environment first
- Manually step through scripts to understand their behavior
- Pay special attention to:
-
-cleanup.sh scripts - may delete resources
-
-promote.sh scripts - may promote artifacts
-
-delete.sh scripts - explicitly destructive
- Verify no unexpected network calls to external endpoints
Supply Chain Tools
- Scripts may download binaries (syft, cosign, trivy, etc.)
- Only allow downloads from trusted release sources (official GitHub releases, package managers)
- Consider curating offline toolchains if your environment requires it
Additional Documentation
- OPERATIONAL_RISKS.md - Complete documentation of operational risks, inconsistencies, and mitigations
- SECURITY.md - Security policy, external dependencies, and verification requirements
This is the complete cluster-agent-swarm skill package. When you add this skill, you get
access to ALL 7 specialized agents working together as a coordinated swarm.
Installation
Security Warning - Read Before Installing
⚠️ CRITICAL SECURITY WARNING
>
The installation commands below use GitHub URLs that fetch and execute code on your system.
This is a supply chain risk - you must verify the repository and commit before use.
>
For production deployments:
1. ALWAYS pin to a specific, verified commit hash
2. Review the commit: git show
3. Verify GPG signatures if available: git verify-commit
4. Use the manual clone method below for highest security
>
NEVER use floating URLs (tree/main, main branch) in production.
Install All Skills (Development Only)
⚠️ NOT FOR PRODUCTION: Uses floating URL without commit pinning.
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills
Install All Skills (Production - Pinned)
✅ RECOMMENDED: Pins to verified commit hash.
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/tree/91c362dba2911f7523f179e7dcc374cf4335814e
Verification steps:
# Verify the commit before installing
git clone https://github.com/kcns008/cluster-agent-swarm-skills
cd cluster-agent-swarm-skills
git checkout 91c362dba2911f7523f179e7dcc374cf4335814e
git show --stat # Review what changed
# Then install using the pinned URL above
Install Individual Skills
⚠️ ALWAYS PIN TO VERIFIED COMMIT - Do not use tree/main in production.
# Orchestrator - Jarvis (task routing, coordination)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/tree/91c362dba2911f7523f179e7dcc374cf4335814e/skills/orchestrator# Cluster Ops - Atlas (cluster lifecycle, nodes, upgrades)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/tree/91c362dba2911f7523f179e7dcc374cf4335814e/skills/cluster-ops
# GitOps - Flow (ArgoCD, Helm, Kustomize)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/tree/91c362dba2911f7523f179e7dcc374cf4335814e/skills/gitops
# Security - Shield (RBAC, policies, CVEs)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/tree/91c362dba2911f7523f179e7dcc374cf4335814e/skills/security
# Observability - Pulse (metrics, alerts, incidents)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/tree/91c362dba2911f7523f179e7dcc374cf4335814e/skills/observability
# Artifacts - Cache (registries, SBOM, promotions)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/tree/91c362dba2911f7523f179e7dcc374cf4335814e/skills/artifacts
# Developer Experience - Desk (namespaces, onboarding)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/tree/91c362dba2911f7523f179e7dcc374cf4335814e/skills/developer-experience
Manual Installation (Highest Security)
✅ MOST SECURE: No remote code execution, full audit trail.
# Clone and verify
git clone https://github.com/kcns008/cluster-agent-swarm-skills
cd cluster-agent-swarm-skills# Checkout verified commit
git checkout 91c362dba2911f7523f179e7dcc374cf4335814e
# Verify (optional, if GPG signed)
git verify-commit 91c362dba2911f7523f179e7dcc374cf4335814e
# Review scripts BEFORE copying
# ls skills//scripts/
# cat skills//scripts/.sh
# Copy manually reviewed scripts
cp -r skills/orchestrator ~/.claude/skills/
cp -r skills/cluster-ops ~/.claude/skills/
# ... add other skills as needed
The Swarm — Agent Roster
| Agent | Code Name | Session Key | Domain |
|---|
| Orchestrator | Jarvis | agent:platform:orchestrator | Task routing, coordination, standups |
| Cluster Ops | Atlas | agent:platform:cluster-ops | Cluster lifecycle, nodes, upgrades |
| GitOps | Flow | agent:platform:gitops | ArgoCD, Helm, Kustomize, deploys |
| Security | Shield | agent:platform:security | RBAC, policies, secrets, scanning |
| Observability | Pulse | agent:platform:observability | Metrics, logs, alerts, incidents |
| Artifacts | Cache | agent:platform:artifacts | Registries, SBOM, promotion, CVEs |
| Developer Experience | Desk | agent:platform:developer-experience | Namespaces, onboarding, support |
Agent Capabilities Summary
What Agents CAN Do
- Read cluster state (
kubectl get, kubectl describe, oc get)
- Deploy via GitOps (
argocd app sync, Flux reconciliation)
- Create documentation and reports
- Investigate and triage incidents
- Provision standard resources (namespaces, quotas, RBAC)
- Run health checks and audits
- Scan images and generate SBOMs
- Query metrics and logs
- Execute pre-approved runbooks
What Agents CANNOT Do (Human-in-the-Loop Required)
- Delete production resources (
kubectl delete in prod)
- Modify cluster-wide policies (NetworkPolicy, OPA, Kyverno cluster policies)
- Make direct changes to secrets without rotation workflow
- Modify network routes or service mesh configuration
- Scale beyond defined resource limits
- Perform irreversible cluster upgrades
- Approve production deployments (can prepare, human approves)
- Change RBAC at cluster-admin level
Communication Patterns
@Mentions
Agents communicate via @mentions in shared task comments:
@Shield Please review the RBAC for payment-service v3.2 before I sync.
@Pulse Is the CPU spike related to the deployment or external traffic?
@Atlas The staging cluster needs 2 more worker nodes.
Thread Subscriptions
- Commenting on a task → auto-subscribe
- Being @mentioned → auto-subscribe
- Being assigned → auto-subscribe
- Once subscribed → receive ALL future comments on heartbeat
Escalation Path
- Agent detects issue
- Agent attempts resolution within guardrails
- If blocked → @mention another agent or escalate to human
- P1 incidents → all relevant agents auto-notified
Heartbeat Schedule
Agents wake on staggered 5-minute intervals:
/5 Atlas (Cluster Ops - needs fast response for incidents)
/5 Pulse (Observability - needs fast response for alerts)
/5 Shield (Security - fast response for CVEs and threats)
/10 Flow (GitOps - deployments can wait a few minutes)
/10 Cache (Artifacts - promotions are scheduled)
/15 Desk (DevEx - developer requests aren't usually urgent)
/15 * Orchestrator (Coordination - overview and standups)
Key Principles
- Roles over genericism — Each agent has a defined SOUL with exactly who they are
- Files over mental notes — Only files persist between sessions
- Staggered schedules — Don't wake all agents at once
- Shared context — One source of truth for tasks and communication
- Heartbeat, not always-on — Balance responsiveness with cost
- Human-in-the-loop — Critical actions require approval
- Guardrails over freedom — Define what agents can and cannot do
- Audit everything — Every action logged to activity feed
- Reliability first — System stability always wins over new features
- Security by default — Deny access, approve by exception
Detailed Agent Capabilities
Orchestrator (Jarvis)
- Task routing: determining which agent should handle which request
- Workflow orchestration: coordinating multi-agent operations
- Daily standups: compiling swarm-wide status reports
- Priority management: determining urgency and sequencing of work
- Cross-agent communication: facilitating collaboration
- Accountability: tracking what was promised vs what was delivered
Cluster Ops (Atlas)
- OpenShift/Kubernetes cluster operations (upgrades, scaling, patching)
- Node pool management and autoscaling
- Resource quota management and capacity planning
- Network troubleshooting (OVN-Kubernetes, Cilium, Calico)
- Storage class management and PVC/CSI issues
- etcd backup, restore, and health monitoring
- Multi-platform expertise (OCP, EKS, AKS, GKE, ROSA, ARO)
GitOps (Flow)
- ArgoCD application management (sync, rollback, sync waves, hooks)
- Helm chart development, debugging, and templating
- Kustomize overlays and patch generation
- ApplicationSet templates for multi-cluster deployments
- Deployment strategy management (canary, blue-green, rolling)
- Git repository management and branching strategies
- Drift detection and remediation
- Secrets management integration (Vault, Sealed Secrets, External Secrets)
Security (Shield)
- RBAC audit and management
- NetworkPolicy review and enforcement
- Security policy validation (OPA, Kyverno)
- Vulnerability scanning (image scanning, CVE triage)
- Secret rotation workflows
- Security incident investigation
- Compliance reporting
Observability (Pulse)
- Prometheus/Grafana metric queries
- Log aggregation and search (Loki, Elasticsearch)
- Alert triage and investigation
- SLO tracking and error budget monitoring
- Incident response coordination
- Dashboards and visualization
- Telemetry pipeline troubleshooting
Artifacts (Cache)
- Container registry management
- Image scanning and CVE analysis
- SBOM generation and tracking
- Artifact promotion workflows
- Version management
- Registry caching and proxying
Developer Experience (Desk)
- Namespace provisioning
- Resource quota and limit range management
- Developer onboarding
- Template generation
- Developer support and troubleshooting
- Documentation generation
File Structure
cluster-agent-swarm-skills/
├── SKILL.md # This file - combined swarm
├── AGENTS.md # Swarm configuration and protocols
├── skills/
│ ├── orchestrator/ # Jarvis - task routing
│ │ └── SKILL.md
│ ├── cluster-ops/ # Atlas - cluster operations
│ │ └── SKILL.md
│ ├── gitops/ # Flow - GitOps
│ │ └── SKILL.md
│ ├── security/ # Shield - security
│ │ └── SKILL.md
│ ├── observability/ # Pulse - monitoring
│ │ └── SKILL.md
│ ├── artifacts/ # Cache - artifacts
│ │ └── SKILL.md
│ └── developer-experience/ # Desk - DevEx
│ └── SKILL.md├── scripts/ # Shared scripts
└── references/ # Shared documentation
Reference Documentation
For detailed capabilities of each agent, refer to individual SKILL.md files:
skills/orchestrator/SKILL.md - Full Orchestrator documentation
skills/cluster-ops/SKILL.md - Full Cluster Ops documentation
skills/gitops/SKILL.md - Full GitOps documentation
skills/security/SKILL.md - Full Security documentation
skills/observability/SKILL.md - Full Observability documentation
skills/artifacts/SKILL.md - Full Artifacts documentation
skills/developer-experience/SKILL.md - Full Developer Experience documentation