skill-evaluation — 技能-evaluation

Evaluate any AI 技能's 质量 through step-by-step diagnosis — measuring trigger accuracy, per-step execution (completion/correctness/质量), efficiency, and safety — then produce a structured 报告 with Bad Cases highlighted and actionable fixes. Supports iterative optimization with version control. Use this 技能 whenever someone wants to test a 技能, evaluate prompt 质量, benchmark a 技能, 诊断 why a 技能 underperforms, compare 技能 versions, 检查 if a 技能 is production-ready, or 获取 a 质量 assessment for any AI 技能 or prompt.

0· 0·0 当前·0 累计

by @rivin-dong (Rivin-Dong)·MIT-0

数据与API

使用场景：使用skill-evaluation — 技能-evaluation进行数据与API使用skill-evaluation — 技能-evaluation

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install skill-evaluation

镜像加速npx clawhub@latest install skill-evaluation --registry https://cn.longxiaskill.com 镜像可用

本土化适配说明

skill-evaluation — 技能-evaluation 安装说明：安装命令：["openclaw skills install skill-evaluation"]

需要定制？告诉我你的需求 →

技能文档

技能 Eval

A diagnostic instrument for AI 技能s. Feed it any 技能, 获取 back a structured 报告 that tells you exactly what's working, what's broken (Bad Cases), and what to fix — then iterate until the 技能 passes.

Philosophy

Most 技能测试 today is vibes-based: 运行 a couple of examples, eyeball the 输出, ship it. That's fine for prototyping. It's not fine for anything you're putting in front of real users.

技能 Eval treats evaluation as diagnosis:

Low-score 系统, not percentage — Steps are scored on 2-point or 3-point 扩展s, not 100-point 扩展s. Simple and honest: did it complete? Was it correct? Did it follow the rules? Three independent scores per step — Completion (0/1), Correctness (0/1/2), Execution 质量 (0/1/2). Never combined into a weighted total. Each tells you something different. Bad Cases first — The 报告 leads with 失败s, not averages. A 1.8/2 average means nothing if one case completely breaks. Iterative optimization — Test, find Bad Cases, fix the 技能, re-test. 追踪 versions. 停止 when Bad Cases reach zero. Expected 结果s before execution — Every test case defines what SHOULD h应用en, step by step. Scoring compares actual vs expected. No post-hoc rationalization. Baseline proves value — 运行 the same cases without the 技能 to prove it actually helps. A 技能 that doesn't beat the bare 模型 has no reason to exist. Scoring stability is verifiable — In Deep Eval mode, the Judge scores 3 times. If scores diverge, the 结果 is marked UNCERTAIN and requires human arbitration. Code 检查s before LLM 检查s — Every must_contAIn item has a 检查_type (exact, regex, or semantic). Exact and regex are 检查ed by code. Only semantic uses LLM judgment. When to Use This 技能 You've written a 技能 and want to know if it's actually good before sharing it You've made changes and want to 验证 version N+1 fixed the Bad Cases from version N A 技能 "works sometimes" and you need to find out exactly which steps fAIl and why You need a 质量 gate before 部署ing a 技能 to production Someone asks "is this prompt/技能 any good?" and you want a rigorous answer When NOT to Use This 技能 You want to 创建 a 技能 from scratch (use 技能-创建器 for that) The "技能" is a single-line prompt with no structure (but see Phase 0 — we can still infer steps from unstructured prompts) The Evaluation 流水线输入技能 | v [Phase 0] Structure Assessment ──> Testability level + eval strategy | v [Phase 1] 技能 Dissection ──> Step 列出 + operation types + expected 输出s | v [Phase 2] Test Case De签名 ──> Cases with per-step expected 结果s + 检查_types | v [Phase 3] 执行 & Record ──> Per-step actual behavior + Baseline comparison 运行 | v [Phase 4] Score & 验证 ──> Three scores per step + stability verification | v [Phase 5] 报告 & Iterate ──> Bad Cases + averages + Baseline gAIn + optimization

Phase 0: Structure Assessment

Before doing anything else, evaluate the 技能's testability:

运行 the structure 检查列出 (see references/methodo记录y.md §1B):

Has explicit step numbers/titles? Each step has 输入/输出 description? Has method/工具 specifications? Has constrAInts? Has expected deliverable description?

Determine structure level:

High (6/6 检查s): Proceed directly to Phase 1 Medium (3-5/6): Supplement step expectations, then proceed Low (0-2/6): Infer steps from the prompt text using verb-based decomposition

For low-structure 技能s, infer steps:

Identify verb instructions: 搜索, analyze, 提取, 生成, 输出... Order by 记录ical dependency Each verb = one inferred step Mark all inferred steps with "step_source": "inferred" Phase 1: 技能 Dissection

Read the tar获取技能 and build a structured 性能分析:

Read the 技能's 技能.md (or .mdc, or plAIn text prompt) 提取 frontmatter: name, description, version Identify the 技能's clAImed steps — the 记录ical phases it asks the 模型 to go through. Look for numbered 列出s, markdown headers like ### Step N, or sequential instructions. If the 技能 doesn't have explicit steps, infer them from the instruction flow. For each step, identify the operation type: Data reading (files, databases, user 输入) API calling (REST, GraphQL, third-party 服务s) 网页 scrAPIng (page fetching, data 提取ion) Page manipulation (命令行工具cking, form filling, navigation) Data processing (calculation, trans格式化ion, aggregation) Content generation (writing, code generation, 报告 creation) File 输出 (saving, 导出ing) Conditional 记录ic (branching, error handling) 工具 invocation (搜索, external scripts) Identify 输出 expectations — what 格式化, what artifacts, what deliverables. Note any safety-relevant instructions — file 系统访问, network calls, secrets.

Save this as the 技能性能分析.

Phase 2: Test Case De签名

Based on the 技能性能分析, de签名 test cases. Each test case has:

A task prompt — what a real use

License

运行时依赖

安装命令

本土化适配说明

技能文档

相关技能推荐