Benchmark Model Provider — AI 模型基准测试与评估

Name: Benchmark Model Provider — AI 模型基准测试与评估
Author: tankisstank

tankisstank

📊 Benchmark Model Provider — AI 模型基准测试与评估

v1.0.5

根据用户的具体用途、领域和使用频率，构建基准测试套件，评估和排名 AI 提供商/模型。帮助用户选择最适合其工作流的模型，提供可审阅、可分享的报告。

0· 115·0 当前·0 累计

by @tankisstank·MIT-0

AI模型访问 API工具开发工具自动化

下载技能包

License

MIT-0

最后更新

2026/4/2

安全扫描

VirusTotal

无害

查看报告

OpenClaw

安全

high confidence

该技能内部一致：通过向用户配置的 OpenAI 兼容端点发送提示来基准测试模型，仅要求 python3 和 BENCHMARK_API_KEY，其脚本和指令与此目的相符。

评估建议

该技能看似合理用于模型基准测试，但会将提示、输出和 BENCHMARK_API_KEY 发送到配置的 base_url。运行前，请（1）验证 base_url 为信任的 OpenAI 兼容端点，（2）先使用非敏感提示进行测试，（3）在隔离环境中运行并从 requirements.txt 安装 PyYAML/reportlab，（4）仅在明确需要自动发布时提供 Vercel/Netlify/GitHub 令牌。如果需要更严格的防护，请查看 run_benchmark.py 和 publish_report.py 以确认凭证和工件的使用/存储方式。...

详细分析 ▾

✓ 用途与能力

Name/description, required binary (python3), required env (BENCHMARK_API_KEY), example specs, and scripts all align with a benchmarking tool that calls OpenAI‑compatible endpoints. The listed optional publishing helpers (Vercel/Netlify) are consistent with the report-publishing feature.

ℹ 指令范围

SKILL.md and scripts explicitly perform network I/O to the base_url from a benchmark spec and use the BENCHMARK_API_KEY for auth. This is expected for the stated purpose, but means prompts, model outputs, and the API key will be sent to whichever endpoint the user configures — the skill warns about this. The instructions do not ask for unrelated secrets or arbitrary system files.

ℹ 安装机制

There is no platform install spec (no remote downloads). The repo includes Python scripts and a small requirements.txt (PyYAML, reportlab). This is low risk; packages are standard and the code is shipped with the skill. Users should still install dependencies in an isolated environment before running.

✓ 凭证需求

Only BENCHMARK_API_KEY is required (declared as primary). References mention an optional VERCEL_TOKEN for non-interactive publishing, but that is not required by default. No unrelated credentials or excessive env requests are present.

✓ 持久化与权限

The skill does not request always:true or system-wide privileges. It stores run artifacts (raw outputs, metrics, reports) locally for audit/reranking — consistent with its purpose. Publishing to web hosts is explicit and documented; it only occurs when the user chooses that step.

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.0.52026/4/1

安全强化：移除 Vercel 自动部署；添加 base_url 安全检查；更新文档以包含静态托管建议。

● 无害

安装命令点击复制

官方npx clawhub@latest install benchmark-model-provider

镜像加速npx clawhub@latest install benchmark-model-provider --registry https://cn.clawhub-mirror.com

技能文档

（由于原始内容中 SKILL.md 文档部分已经包含中文说明和英文原文，以下仅提供必要的中文摘要，如果需要完整的中文 SKILL.md 请另外提供原始英文 SKILL.md 文件）

中文说明 当用户想知道“哪个模型更聪明、更便宜、更适合日常工作流、更适合研究/写报告/编程”时，使用这个技能。它不会给出泛泛而谈的“最佳模型”建议，而是根据用户自己的实际任务构建基准测试，保留原始结果、重新排序，并生成可审阅、可分享的报告。

Use this skill to help users choose the most suitable model for their own workflow instead of giving generic “best model” advice.

Tiếng Việt Dùng skill này khi Boss muốn biết model nào thật sự đáng dùng cho workflow hằng ngày: model nào research tốt hơn, viết báo cáo ổn hơn, code ngon hơn, rẻ hơn, nhanh hơn, hay đáng dùng lâu dài hơn. Skill này không trả lời kiểu cảm tính, mà dựng benchmark theo đúng nhu cầu thực tế của người dùng rồi chấm, rerank và xuất report rõ ràng.

中文说明 当用户想知道“哪个模型更聪明、更便宜、更适合日常工作流、更适合研究/写报告/编程”时，使用这个技能。它不会给出泛泛而谈的“最佳模型”建议，而是根据用户自己的实际任务构建基准测试，保留原始结果、重新排序，并生成可审阅、可分享的报告。

Treat the benchmark as a personal decision framework:

derive the benchmark from the user's real work
keep the run auditable
preserve raw outputs for reranking
generate outputs that can be reviewed, shared, and published cleanly

What this skill is for

People often ask questions like:

Which model is smarter?
Which model is cheaper to run daily?
Which model is deeper or more useful for my job?
Should I use a local model or a service model?

This skill exists to answer those questions with a repeatable benchmark process, not with vague preferences.

Core operating flow

Collect benchmark context

- purpose - domain - usage frequency

Build or select a benchmark spec with 5–10 domain-specific questions
List currently available providers/models from trusted local OpenClaw context when allowed
Ask whether the user wants to use the current list or add more models
Verify every user-supplied model before running; if the name does not match, ask again or suggest the closest valid model id
Run each model independently on the same benchmark set
Preserve raw outputs and metrics so the run can be audited and reranked later
Score results across quality, depth, cost, and speed metrics
Build reports in markdown / HTML / PDF
Optionally suggest simple ways to publish the generated HTML report (Vercel, Netlify, Cloudflare Pages, GitHub Pages) if the user wants a shareable link

Default decisions

Area	Default
Benchmark mode	`prompt_only`
Overall scoring	quality + depth + cost
Speed handling	measured and reported, excluded from default overall
Execution strategy	`sequential` unless orchestration is needed
Web publish target	(no built-in publish) — suggest Vercel / Netlify / Cloudflare Pages / GitHub Pages

Workflow rules

Benchmark input rules

Default to prompt_only unless the user explicitly wants agent_context.
In prompt_only, send only the raw prompt.
Do not inject extra context, memory, few-shot examples, or hidden scaffolding in prompt_only mode.
In agent_context, use one fixed shared system/context layer for all compared models and record it in metadata.

Execution rules

Support both sequential and subagent_orchestrated execution strategies.
Allow bounded parallel execution for subagents (for example --max-parallel 4) when the endpoint can tolerate it.
Treat rerank as a first-class operation; do not rerun models when only the scoring formula changes.
Report progress at every major step so the user never feels the process is hanging.
During batch execution, surface a clear update whenever one agent/model finishes.
Normalize model ids before calling the endpoint when the provider catalog exposes raw model ids but the user/runtime spec may contain provider-prefixed names.
If the endpoint returns naming/provider mismatch errors, explain the mismatch clearly instead of leaving only a raw 502/unknown-provider error.

Output rules

Mark every estimated metric clearly.
Rewrite reports/landing pages to the newest snapshot.
Do not append patch fragments to stale output.
Reports should include: ranking table, cost table, executive summary, overall assessment, recommended model selection, and full answer details.
Default the report language to the user's current conversation language.
Only switch the report language when the user explicitly asks for a different language or a bilingual output.
PDF output must use Unicode-capable fonts so Vietnamese, Chinese, and multilingual content render correctly.
Multilingual support means the renderer can display multiple languages correctly; it does not mean the skill should arbitrarily change the report language.
Ask before delivering externally via Vercel or other web publishing.

Safety and trust boundary

This skill may perform network I/O depending on how the benchmark spec is configured.

Safe-by-design intent

Example specs should use placeholder endpoints, not a private hardcoded runtime.
The user should supply only trusted API endpoints and credentials.
Publishing should happen only when the user explicitly wants delivery.

Important runtime notes

run_benchmark.py sends prompts to the base_url configured in the benchmark spec.
This skill does not publish to Vercel/Netlify/Cloudflare/GitHub automatically. It only generates local HTML/PDF artifacts.
If you want a shareable link, publish the generated HTML folder using one of these services: Vercel, Netlify, Cloudflare Pages, or GitHub Pages.
Only run the skill with endpoints, tokens, and outputs you trust.

For detailed runtime assumptions, read:

references/runtime-safety.md
references/environment-vars.md
references/pricing-sources.md

What to read

Read only what you need:

references/initial-project-spec.md — authoritative design baseline
references/benchmark-schema.md — benchmark spec structure, run artifacts, file layout
references/scoring-rubric.md — scoring model, normalization rules, default weights
references/pricing-sources.md — pricing precedence and estimation policy
references/execution-modes.md — benchmark modes, execution strategies, operational modes
references/output-modes.md — delivery choices, publish rules, progress feedback rules
references/runtime-safety.md — trust boundaries, network behavior, safe usage guidance
references/environment-vars.md — expected environment variables and dependency notes
examples/*.yaml — benchmark context templates and ready-made examples in multiple languages

Scripts

Script	Purpose
`scripts/build_benchmark_spec.py`	Build a benchmark spec from benchmark context
`scripts/run_benchmark.py`	Execute benchmark runs and write raw outputs/metrics
`scripts/estimate_tokens.py`	Estimate token counts when provider usage is missing
`scripts/resolve_pricing.py`	Resolve pricing sources and compute estimated/official pricing
`scripts/score_models.py`	Combine raw metrics and rubric scores into rankings
`scripts/build_report.py`	Build markdown, HTML, and PDF report artifacts
`scripts/publish_report.py`	No deployment automation. Export/copy PDF and print suggested static hosting options (Vercel/Netlify/Cloudflare Pages/GitHub Pages).

Output contract

Try to produce these artifacts whenever possible:

versioned benchmark spec
raw per-model answer files
raw metrics JSON
score breakdown JSON
markdown summary report
HTML landing page
PDF output when requested
publish result metadata when delivery occurs

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

What this skill is for

Core operating flow

Default decisions

Workflow rules

Benchmark input rules

Execution rules

Output rules

Safety and trust boundary

Safe-by-design intent

Important runtime notes

What to read

Scripts

Output contract

安装命令点击复制