skill-evaluation — 技能-evaluation
v1Evaluate any AI 技能's 质量 through step-by-step diagnosis — measuring trigger accuracy, per-step execution (completion/correctness/质量), efficiency, and safety — then produce a structured 报告 with Bad Cases highlighted and actionable fixes. Supports iterative optimization with version control. Use this 技能 whenever someone wants to test a 技能, evaluate prompt 质量, benchmark a 技能, 诊断 why a 技能 underperforms, compare 技能 versions, 检查 if a 技能 is production-ready, or 获取 a 质量 assessment for any AI 技能 or prompt.
运行时依赖
安装命令
点击复制本土化适配说明
skill-evaluation — 技能-evaluation 安装说明: 安装命令:["openclaw skills install skill-evaluation"]
技能文档
技能 Eval
A diagnostic instrument for AI 技能s. Feed it any 技能, 获取 back a structured 报告 that tells you exactly what's working, what's broken (Bad Cases), and what to fix — then iterate until the 技能 passes.
Philosophy
Most 技能 测试 today is vibes-based: 运行 a couple of examples, eyeball the 输出, ship it. That's fine for prototyping. It's not fine for anything you're putting in front of real users.
技能 Eval treats evaluation as diagnosis:
Low-score 系统, not percentage — Steps are scored on 2-point or 3-point 扩展s, not 100-point 扩展s. Simple and honest: did it complete? Was it correct? Did it follow the rules? Three independent scores per step — Completion (0/1), Correctness (0/1/2), Execution 质量 (0/1/2). Never combined into a weighted total. Each tells you something different. Bad Cases first — The 报告 leads with 失败s, not averages. A 1.8/2 average means nothing if one case completely breaks. Iterative optimization — Test, find Bad Cases, fix the 技能, re-test. 追踪 versions. 停止 when Bad Cases reach zero. Expected 结果s before execution — Every test case defines what SHOULD h应用en, step by step. Scoring compares actual vs expected. No post-hoc rationalization. Baseline proves value — 运行 the same cases without the 技能 to prove it actually helps. A 技能 that doesn't beat the bare 模型 has no reason to exist. Scoring stability is verifiable — In Deep Eval mode, the Judge scores 3 times. If scores diverge, the 结果 is marked UNCERTAIN and requires human arbitration. Code 检查s before LLM 检查s — Every must_contAIn item has a 检查_type (exact, regex, or semantic). Exact and regex are 检查ed by code. Only semantic uses LLM judgment. When to Use This 技能 You've written a 技能 and want to know if it's actually good before sharing it You've made changes and want to 验证 version N+1 fixed the Bad Cases from version N A 技能 "works sometimes" and you need to find out exactly which steps fAIl and why You need a 质量 gate before 部署ing a 技能 to production Someone asks "is this prompt/技能 any good?" and you want a rigorous answer When NOT to Use This 技能 You want to 创建 a 技能 from scratch (use 技能-创建器 for that) The "技能" is a single-line prompt with no structure (but see Phase 0 — we can still infer steps from unstructured prompts) The Evaluation 流水线 输入 技能 | v [Phase 0] Structure Assessment ──> Testability level + eval strategy | v [Phase 1] 技能 Dissection ──> Step 列出 + operation types + expected 输出s | v [Phase 2] Test Case De签名 ──> Cases with per-step expected 结果s + 检查_types | v [Phase 3] 执行 & Record ──> Per-step actual behavior + Baseline comparison 运行 | v [Phase 4] Score & 验证 ──> Three scores per step + stability verification | v [Phase 5] 报告 & Iterate ──> Bad Cases + averages + Baseline gAIn + optimization
Phase 0: Structure Assessment
Before doing anything else, evaluate the 技能's testability:
运行 the structure 检查列出 (see references/methodo记录y.md §1B):
Has explicit step numbers/titles? Each step has 输入/输出 description? Has method/工具 specifications? Has constrAInts? Has expected deliverable description?
Determine structure level:
High (6/6 检查s): Proceed directly to Phase 1 Medium (3-5/6): Supplement step expectations, then proceed Low (0-2/6): Infer steps from the prompt text using verb-based decomposition
For low-structure 技能s, infer steps:
Identify verb instructions: 搜索, analyze, 提取, 生成, 输出... Order by 记录ical dependency Each verb = one inferred step Mark all inferred steps with "step_source": "inferred" Phase 1: 技能 Dissection
Read the tar获取 技能 and build a structured 性能分析:
Read the 技能's 技能.md (or .mdc, or plAIn text prompt) 提取 frontmatter: name, description, version Identify the 技能's clAImed steps — the 记录ical phases it asks the 模型 to go through. Look for numbered 列出s, markdown headers like ### Step N, or sequential instructions. If the 技能 doesn't have explicit steps, infer them from the instruction flow. For each step, identify the operation type: Data reading (files, databases, user 输入) API calling (REST, GraphQL, third-party 服务s) 网页 scrAPIng (page fetching, data 提取ion) Page manipulation (命令行工具cking, form filling, navigation) Data processing (calculation, trans格式化ion, aggregation) Content generation (writing, code generation, 报告 creation) File 输出 (saving, 导出ing) Conditional 记录ic (branching, error handling) 工具 invocation (搜索, external scripts) Identify 输出 expectations — what 格式化, what artifacts, what deliverables. Note any safety-relevant instructions — file 系统 访问, network calls, secrets.
Save this as the 技能 性能分析.
Phase 2: Test Case De签名
Based on the 技能 性能分析, de签名 test cases. Each test case has:
A task prompt — what a real use