Llm Eval Router — 本地LLM模型云评估路由器

Name: Llm Eval Router — 本地LLM模型云评估路由器
Author: Nissan Dookeran

Nissan Dookeran

🧪 Llm Eval Router — 本地LLM模型云评估路由器

v1.2.2

该技能通过多评判员集成在本地对Ollama模型进行影子测试，相比云端基准模型。统计证明等效后自动提升模型，依据证据降低API成本。

0· 440·2 当前·2 累计

by @nissan (Nissan Dookeran)·MIT-0

AI模型访问云服务自动化 API工具测试工具

下载技能包

License

MIT-0

最后更新

2026/4/12

安全扫描

VirusTotal

可疑

查看报告

OpenClaw

安全

high confidence

技能所请求的工具和凭证与其声明的目的相符（影子评估本地Ollama模型对云基准），虽然它会向云API发送提示并存储本地评分数据，但请求的内容与目的相称。

评估建议

该技能与其声明的目的相符，但在安装前，请注意：(1) 样本提示和评判调用将发送到Anthropic/OpenAI（以及可能的Gemini），这些提供商可能会记录请求，避免发送敏感数据；(2) 如果启用Gemini或Langfuse，准备提供额外的凭证；(3) 查看和控制本地存储路径（data/scores/*.json）；(4) 确认Anthropic/OpenAI账户的计费和速率限制；(5) 在可信机器上运行，因为本地模型推理和评分文件存储在本地。...

详细分析 ▾

✓ 用途与能力

名称/描述需要本地Ollama推理加上云端评判/基准线 — 声明的二进制文件（ollama, python3）和环境变量（ANTHROPIC_API_KEY, OPENAI_API_KEY）与该目的相一致。

ℹ 指令范围

SKILL.md 指示代理执行本地推理、运行验证器、调用Anthropic/OpenAI/Gemini进行采样评判，并将评分运行的JSON写入data/scores/*.json。该范围与描述相匹配，但注意：发送提示到云提供商用于基准真相/评判，并可能向这些提供商暴露任务提示（技能的文本声称“无遥测”，但这不会阻止云提供商记录请求）。

✓ 安装机制

无安装规格（仅指令）。从安装角度来说这是最低风险的 — 除了代理运行时在遵循SKILL.md时做的任何事情外，安装步骤不会下载或写入任何内容。

ℹ 凭证需求

请求Anthropic和OpenAI API密钥对于基准真相和评判调用是合理的。小不一致：SKILL.md引用Gemini作为可选的决策者和Langfuse用于可观察性，但Gemini凭证（或Google身份验证）和Langfuse连接细节未列在requires.env中 — 启用这些功能将需要技能未预先声明的额外凭证。

✓ 持久化与权限

always为false，无配置路径请求，该技能也不请求代理全局的永久权限。它将在正常操作的一部分中以data/scores/*.json的形式存储评分数据。

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.2.22026/2/26

添加安全注意事项：无遥测，所有API调用使用用户自己的密钥，本地Ollama永远不会向外部发送数据

● 可疑

安装命令点击复制

官方npx clawhub@latest install llm-eval-router

镜像加速npx clawhub@latest install llm-eval-router --registry https://cn.clawhub-mirror.com

技能文档

（由于原始内容过长，仅提供部分翻译，完整内容请参考原始SKILL.md）

# llm-eval-router 设置一个生产级的影子评估管道，自动提升本地Ollama模型，当它们统计学上证明与云模型质量匹配 — 通过证据而非希望降低推理成本。

核心理念

并行运行每个任务通过您的最佳本地模型（影子）和云基准线（基准真相）。一个轻量级的评判集体评分本地输出。经过200+次运行，如果本地模型达到0.95的平均评分，将其提升到处理该任务类型的生产环境中。质量下降时自动降级。

Set up a production-quality shadow evaluation pipeline that automatically promotes local Ollama models when they statistically prove they match cloud model quality — reducing inference costs with evidence, not hope.

The core idea

Run every task through your best local model (shadow) in parallel with your cloud baseline (ground truth). A lightweight judge ensemble scores the local output. After 200+ runs, if the local model hits 0.95 mean score, promote it to handle that task type in production. Demote it automatically if quality drops.

When to use

You're paying for Claude/GPT API calls on tasks that don't need that quality
You have Ollama running locally with capable models (qwen2.5, phi4, mistral, etc.)
You want evidence-based cost reduction, not blind routing
You have defined task types: summarize, classify, extract, format, analyze, RAG

When NOT to use

Tasks that require real-time web knowledge (use cloud)
Tasks with strict latency requirements < 2 seconds (local models on CPU are slow)
Tasks with high safety stakes (always use cloud with safety filters)
You don't have Ollama or a Mac/Linux machine with enough RAM (8GB+ per model)

Prerequisites

Ollama installed and running (ollama.com)
At least one capable model: ollama pull qwen2.5 or ollama pull phi4
Python 3.10+
API keys: Anthropic (ground truth) + OpenAI (judge) — Gemini optional (tiebreaker)
Langfuse for observability (self-hosted or cloud) — optional but strongly recommended

Network & Privacy

This skill makes outbound API calls to:

Anthropic API — to generate ground truth baseline responses (every accumulation cycle)
OpenAI API — for judge scoring (sampled at 15% of runs)
Google Gemini API — tiebreaker judge only (when primary judges disagree by ≥0.20)

What stays local:

All Ollama model inference runs entirely on your device
Scored run data is stored on disk in data/scores/*.json
No telemetry, analytics, or data collection of any kind
No data is sent anywhere other than the explicit API calls above

Langfuse (optional) can be self-hosted or cloud. If self-hosted, all observability data stays on your network.

Core concepts

6-Dimension Evaluation

Every response is scored on:

Dimension	Default weight	Analyze weight	What it measures
Structural	25%	10%	Format compliance, required keys present
Semantic	25%	40%	Meaning equivalence to ground truth
Factual	20%	25%	No hallucinated facts/numbers/entities
Completion	15%	18%	Task fully addressed
Tool use	10%	4%	Correct tool/format selection
Latency	5%	3%	Within acceptable bounds

Important: Use per-task weight overrides. The default 25/25 split treats structural accuracy equally with semantic similarity — which works for extract/classify/format tasks (where exact format matters) but is wrong for open-ended analysis. difflib.SequenceMatcher on two prose analyses of the same question scores ~0.29 even when they're semantically identical. With structural weight at 25%, this alone caps analyze scores at ~0.59.

# src/evaluator.py — per-task weight profiles
TASK_WEIGHT_OVERRIDES = {
    "analyze": {
        "structural_accuracy": 0.10,   # difflib is NOT meaningful for prose
        "semantic_similarity": 0.40,   # cosine over embeddings captures meaning
        "factual_drift": 0.25,
        "task_completion": 0.18,
        "tool_correctness": 0.04,
        "latency_score": 0.03,
    },
    "code_transform": {
        "structural_accuracy": 0.15,
        "semantic_similarity": 0.35,
        "factual_drift": 0.20,
        "task_completion": 0.20,
        "tool_correctness": 0.07,
        "latency_score": 0.03,
    },
}

Also: For analyze tasks, constrain output structure via system_prompt so GT and candidates produce comparably-formatted responses (Finding/Recommendation/Confidence/Reasoning). This reduces Layer 2 drift and improves difflib scores even at reduced weight.

Judge ensemble

Primary judges (15% sampling rate): Claude Sonnet + gpt-4o-mini score independently
Tiebreaker (only when |score_A - score_B| ≥ 0.20): Gemini 2.5-flash
Unsampled runs (85%): Layer 1+2 validators only (deterministic, free)
Promotion gates always trigger full judge evaluation regardless of sampling rate

Layer 1+2 validators (free, deterministic)

Layer 1: JSON validity, required key presence, forbidden pattern check
Layer 2: Drift detection — novel entities/numbers/URLs not in ground truth

These run on every response at zero cost. Judges only run when L1+L2 pass and the sampling rate triggers.

Promotion / Demotion

Promote: 200+ runs, rolling mean ≥ 0.95 for a model/task pair
Demote: rolling 7-day pass rate < 0.92
Control floor: one model (phi4, granite4, or similar) serves as the measured floor —

any model scoring below it should be flagged, not promoted

Implementation steps

Step 1 — Define your task types

Create config/task_types.yaml:

tasks: - id: summarize description: "Summarize a document in N sentences" require_json: false judge_dimensions: [semantic, factual, completion] - id: classify description: "Classify text into one of N categories" require_json: true # response must be valid JSON judge_dimensions: [structural, semantic, completion] - id: extract description: "Extract structured data from unstructured text" require_json: true judge_dimensions: [structural, factual, completion]

- id: format description: "Reformat content to match a template" require_json: false judge_dimensions: [structural, semantic, completion]

Step 2 — Set up the router

The router assigns each task to a model using a round-robin strategy during burn-in (building n), then switches to confidence-weighted routing after promotion.

# src/router.py — simplified version
class Router:
    def __init__(self, candidates: list[str], control_floor: str):
        self.candidates = candidates
        self.control_floor = control_floor
        self._rr_counters = defaultdict(int)
    def route(self, task_type: str, confidence_tracker: ConfidenceTracker) -> str:
        """Return the best model for this task type."""
        promoted = confidence_tracker.get_promoted(task_type)
        if promoted:
            return promoted  # use promoted model directly        # Round-robin during burn-in for fair exposure
        idx = self._rr_counters[task_type] % len(self.candidates)
        self._rr_counters[task_type] += 1
        return self.candidates[idx]

Step 3 — Ground truth comparison

For each task, run it through BOTH the local model (candidate) and the cloud baseline (ground truth). Never use the ground truth response in production — it's only for evaluation.

async def evaluate_pair(prompt: str, local_response: str, gt_response: str,
                        task_type: str) -> float:
    # Layer 1: deterministic
    l1_score = validators.layer1(local_response, task_type)
    if l1_score == 0.0:
        return 0.0  # hard fail — safety or format violation
    # Layer 2: heuristic drift
    l2_score = validators.layer2(local_response, gt_response)    # Sample judges (15%)
    if random.random() < JUDGE_SAMPLE_RATE:
        sonnet_score = await judge_sonnet(prompt, local_response, gt_response)
        mini_score = await judge_gpt4o_mini(prompt, local_response, gt_response)
        if abs(sonnet_score - mini_score) >= 0.20:
            gemini_score = await judge_gemini(prompt, local_response, gt_response)
            final = median([sonnet_score, mini_score, gemini_score])
        else:
            final = (sonnet_score + mini_score) / 2
        return weighted_score(l1_score, l2_score, final)
    else:
        return weighted_score(l1_score, l2_score, judge_score=None)

Step 4 — Confidence tracker

Track scores per model/task pair on disk (so restarts don't lose data):

# src/scoring/confidence.py — simplified
@dataclass
class ModelStats:
    model_id: str
    task_type: str
    scores: list[float]   # all scores (None excluded)
    promoted: bool = False
    demoted: bool = False
    @property
    def mean(self) -> float:
        return sum(self.scores) / len(self.scores) if self.scores else 0.0
    @property
    def n(self) -> int:
        return len(self.scores)
    def should_promote(self) -> bool:
        return self.n >= 200 and self.mean >= 0.95 and not self.promoted    def should_demote(self) -> bool:
        recent = self.scores[-50:]  # last 50
        pass_rate = sum(1 for s in recent if s >= 0.85) / len(recent)
        return pass_rate < 0.92 and not self.demoted

Step 5 — Accumulator loop

Run this on a cron (every 10-20 minutes via launchd/systemd):

# run_accumulate.py
async def accumulate():
    task_type = pick_next_task()  # round-robin across task types
    prompt, gt_response = generate_task(task_type)  # call cloud baseline
    for candidate in router.get_candidates(task_type):
        local_response = await ollama_client.complete(candidate, prompt)
        score = await evaluate_pair(prompt, local_response, gt_response, task_type)
        confidence_tracker.record(candidate, task_type, score)        if confidence_tracker.should_promote(candidate, task_type):
            router.promote(candidate, task_type)
            langfuse.log_promotion(candidate, task_type, confidence_tracker.stats(candidate, task_type))

Step 6 — Routing policy

# config/routing_policy.yaml control_floor_model: phi4:latest # never promote below this model's score task_policies: policy_check_high_risk: never_local: true # these tasks always use cloud model summarize: min_score_for_routing: 0.85 fallback_chain: [qwen2.5, llama3.1, phi4]

classify: min_score_for_routing: 0.90 # higher bar for classification fallback_chain: [qwen2.5, granite4, llama3.1]

Step 7 — API

Expose a simple HTTP API (FastAPI):

POST /run          — route a task through the best available model
GET  /health       — service status + promoted models + ollama connectivity
GET  /status       — full scoreboard (model × task × mean × n)
GET  /report       — cost heatmap + efficiency analysis

Key lessons learned (from 900+ production runs)

What worked:

phi4 as control floor: a measured floor model prevents "promoted because everyone

else is also bad" errors. If the floor model beats a candidate, flag it — don't promote.

Thinking token stripping: CoT models (deepseek-r1, qwen2.5-coder with reasoning)

must have ... blocks stripped before evaluation. Otherwise Layer 2 drift detection flags the reasoning chain as hallucinated content.

None ≠ 0.0 for unsampled runs: a run where no judge scored is not a failing run.

Store None, exclude from mean. Mixing None with 0.0 poisons the mean.

require_json: False for plain-text tasks: classify and extract tasks that return

formatted text (not JSON objects) will fail Layer 1 if you require JSON. Separate the "is the format correct" check from "is it valid JSON."

Per-task weight overrides: do not use one weight profile for all task types.

Structural accuracy (difflib) is wrong for prose analysis — use semantic similarity as the primary signal for open-ended tasks. This lifted analyze mean from 0.44–0.59 to 0.70.

Structured output prompts for analyze tasks: add a system_prompt that specifies

an exact output format (Finding/Recommendation/Confidence/Reasoning). Both GT and candidates follow the same template, improving structural alignment and reducing drift penalty. Without this, Layer 2 drift fires on differently-phrased but correct analyses.

MCP server for agentic access: expose CP as MCP tools (run_task, get_status,

get_champions, get_promotion_timeline, get_cost_heatmap). Lets an LLM agent query evaluation state without bespoke integration work.

What didn't work:

Large models (>9GB): gpt-oss:20b and similar required 39+ second inference —

the latency dimension alone tanks the composite score. Practical ceiling is ~9GB models on 24GB unified memory to avoid GPU memory swapping.

100% judge sampling: runs through the full Claude+GPT+Gemini panel on every evaluation

costs more in judge API fees than you save by routing locally. Sample at 15%.

Chroma 1.5.1 with Python 3.14: Pydantic V1 BaseSettings incompatibility. Use

qdrant or numpy cosine store instead.

One-size-fits-all weight profiles: defining global weights at system init and never

overriding per task type led to all analyze evals silently failing for 112+ runs. Lesson: evaluate your evaluator's scores by task type early — if a whole task type caps at a suspicious ceiling (e.g. 0.59), the metric is wrong, not the models.

Expected timeline

With a 20-minute accumulator cadence and 9 candidates × 7 task types:

First 50 runs per model: ~5 hours
First promotions (200 runs): ~1-2 days per model/task pair
Stable routing layer: 1-2 weeks

Cost estimate

Per accumulation cycle (one task, one model):

Ground truth: ~$0.002 (Claude Sonnet, ~500 input + 200 output tokens)
Judge sample (15%): ~$0.003 (Sonnet + GPT-4o-mini)
Local model: $0 (Ollama, on-device)

At 6 runs/hour × 24 hours: ~$0.70/day during burn-in. After first promotions: drops to ~$0.10/day (90%+ of task volume local).

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

核心理念

The core idea

When to use

When NOT to use

Prerequisites

Network & Privacy

Core concepts

6-Dimension Evaluation

Judge ensemble

Layer 1+2 validators (free, deterministic)

Promotion / Demotion

Implementation steps

Step 1 — Define your task types

Step 2 — Set up the router

Step 3 — Ground truth comparison

Step 4 — Confidence tracker

Step 5 — Accumulator loop

Step 6 — Routing policy

Step 7 — API

Key lessons learned (from 900+ production runs)

Expected timeline

Cost estimate

安装命令点击复制