Eval

v1.1.1

Evaluate everything the PA agent manages — tasks, skills, PA network health, billing, calendar connections, and memory quality. Use when: owner asks for an e...

0· 100·1 当前·1 累计
by @netanel-abergel (Netanel Abergel)·MIT-0
下载技能包
License
MIT-0
最后更新
2026/4/5
安全扫描
VirusTotal
Pending
查看报告
OpenClaw
可疑
high confidence
The skill's instructions read many local files and secrets (API keys, token files, billing and workspace files) but the skill metadata declares no required credentials or config paths — this mismatch is incoherent and could expose sensitive data unless run in a tightly controlled environment.
评估建议
This skill will read many local files and tokens (billing-status.json, workspace files, .context, $HOME/.credentials/monday-api-token.txt, and environment variables like the Anthropic API key) but the package metadata doesn't declare those requirements — that's a red flag. Before installing or enabling this skill: 1) Ask the publisher to explicitly list required env vars and config paths and justify why each is needed. 2) Inspect the .context file and any referenced credential files to see what ...
详细分析 ▾
用途与能力
The skill's stated purpose is to evaluate the agent's tasks, integrations, billing, calendar, and memory — that purpose legitimately requires checking local state and integration tokens. However, the skill metadata declares no required environment variables, credentials, or config paths while the instructions reference many local files and tokens (e.g., /opt/ocana/... files, $HOME/.credentials/monday-api-token.txt, ANT HROPIC_API_KEY). The omission of these required inputs in metadata is disproportionate and misleading.
指令范围
SKILL.md explicitly instructs the agent to source a local .context file and to read numerous files and run shell/python/curl/git commands against paths like /opt/ocana/openclaw/workspace/* and $HOME/.openclaw/workspace/*, and to read token files and env vars to test APIs. These actions go beyond a simple checklist and access potentially sensitive credentials and owner data (owner phone, tokens, billing JSON). The instructions grant broad discretion to read system state and secrets without documenting limits.
安装机制
There is no install spec and no code files — the skill is instruction-only. That reduces the risk of arbitrary code being fetched or executed from untrusted sources.
凭证需求
Although the skill metadata lists no required env vars or config paths, the runtime steps depend on env vars and files (e.g., ANT HROPIC_API_KEY, $HOME/.credentials/monday-api-token.txt, various workspace files and .context values like GOG_CREDS or MONDAY_TOKEN_FILE). Requesting access to multiple local credential sources without declaring them is disproportionate and increases the potential for secret exposure.
持久化与权限
The skill is not always-on and is user-invocable (defaults). It does not request permanent inclusion or declare modifications to other skills or system-wide settings. Autonomous invocation is allowed by platform default, which increases blast radius in general, but this skill does not request extra persistence privileges.
安全有层次,运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发,无需署名。

运行时依赖

无特殊依赖

版本

latestv1.1.12026/4/1

**Minor update with behavioral checks and context loading improvements.** - Added explicit "Load Local Context" step for sourcing environment variables. - Introduced "Pre-Eval Behavioral Checks" section, including automatic reactions when eval is triggered and completed. - Clarified use of PA directory file and requirement to use direct calendar API. - Removed non-English trigger phrases for clarity and consistency. - No structural changes to main eval logic or scoring.

● Pending

安装命令 点击复制

官方npx clawhub@latest install eval
镜像加速npx clawhub@latest install eval --registry https://cn.clawhub-mirror.com

技能文档

Load Local Context

CONTEXT_FILE="/opt/ocana/openclaw/workspace/skills/eval/.context"
[ -f "$CONTEXT_FILE" ] && source "$CONTEXT_FILE"
# Then use: $OWNER_PHONE, $WORKSPACE, $TASKS_FILE, $MONDAY_TOKEN_FILE, $GOG_CREDS, etc.

# Eval Skill

Structured evaluation of everything the agent manages.


When to Use

Trigger phrases:

  • "run eval"
  • "what's working and what isn't"
  • "rate yourself"
  • "check everything"

Pre-Eval Behavioral Checks (Always)

  • React 👍 when owner triggers eval
  • React ✅ when report is complete
  • PA directory source: /opt/ocana/openclaw/workspace/PA_LIST.md
  • Calendar check: use direct API (NOT gog CLI)

Eval Report Format

📋 Full Eval — [DATE]

━━━ SELF PERFORMANCE ━━━ Execution: [1-5] [comment] Accuracy: [1-5] [comment] Memory: [1-5] [comment] Proactivity: [1-5] [comment] Communication: [1-5] [comment] TOTAL: [X]/25

━━━ ACTIVE TASKS ━━━ ✅ Done today: [count] 🟡 In progress: [count] ❌ Stalled: [count] — [list stalled tasks]

━━━ PA NETWORK ━━━ ✅ Working: [list] ⚠️ Issues: [list with issue] ❌ Down: [list]

━━━ SKILLS ━━━ Installed: [count] Used today: [list] Unused (7+ days): [list]

━━━ INTEGRATIONS ━━━ Calendar (owner): [connected ✅ / broken ❌ / unknown ?] monday.com: [connected ✅ / broken ❌] Email (gog): [connected ✅ / broken ❌] GitHub backup: [last push: X ago] WhatsApp: [connected ✅ / disconnected ❌]

━━━ MEMORY HEALTH ━━━ Daily notes: [today's file exists? ✅/❌] Long-term: [MEMORY.md size — OK / bloated] Learnings: [count this week] Last backup: [X ago]

━━━ RECOMMENDATIONS ━━━

  • [Most important thing to fix]
  • [Second priority]
  • [Optional improvement]

Running the Eval

Step 1 — Self Performance Score

Score each dimension 1–5 based on today's activity:

Execution (1–5):
  • 5: All tasks completed without reminders
  • 3: Most tasks done, some follow-up needed
  • 1: Multiple tasks missed or forgotten

Accuracy (1–5):

  • 5: No corrections from owner
  • 3: 1–2 corrections
  • 1: Multiple errors or wrong outputs

Memory (1–5):

  • 5: Recalled context correctly every time
  • 3: Missed some context, caught on
  • 1: Repeated the same mistakes

Proactivity (1–5):

  • 5: Acted before being asked multiple times
  • 3: Responded to requests, minimal initiative
  • 1: Only reacted, no proactive actions

Communication (1–5):

  • 5: Clear, concise, no unnecessary narration
  • 3: Occasionally verbose or unclear
  • 1: Shared reasoning, listed options, narrated steps

Step 2 — Task Audit

TASKS_FILE="$HOME/.openclaw/workspace/memory/tasks.md"

echo "Tasks done:" grep -c "\[x\]" "$TASKS_FILE" 2>/dev/null || echo 0

echo "Tasks in progress:" grep -c "\[ \]" "$TASKS_FILE" 2>/dev/null || echo 0

# Stalled = in progress for 2+ days echo "Stalled tasks (2+ days old):" grep "\[ \]" "$TASKS_FILE" | grep -v "$(date +%Y-%m-%d)" | grep -v "$(date -u -d '1 day ago' +%Y-%m-%d 2>/dev/null)" || echo "none"

Step 3 — PA Network Health

BILLING_FILE="$HOME/.openclaw/workspace/memory/billing-status.json"

echo "PA Network Status:" python3 << 'PYEOF' import json data = json.load(open('/opt/ocana/openclaw/workspace/memory/billing-status.json')) for pa in data['issues']: status = "✅" if pa['status'] == 'resolved' else "⚠️" print(f" {status} {pa['pa']} ({pa['owner']}): {pa['status']}") PYEOF

Step 4 — Skills Audit

SKILLS_DIR="$HOME/.openclaw/workspace/skills"

echo "Installed skills:" ls "$SKILLS_DIR" | grep -v README | wc -l

echo "Skills list:" ls "$SKILLS_DIR" | grep -v README

Step 5 — Integration Health

# Test Anthropic billing
API_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  -H "x-api-key: ${ANTHROPIC_API_KEY:-none}" \
  -H "anthropic-version: 2023-06-01" \
  https://api.anthropic.com/v1/models 2>/dev/null)

# Interpret result if [ "$API_STATUS" = "200" ]; then echo "Billing: ✅ OK" elif [ "$API_STATUS" = "402" ]; then echo "Billing: ❌ OUT OF CREDITS" elif [ "$API_STATUS" = "401" ]; then echo "Billing: ❌ Invalid key" else echo "Billing: ? HTTP $API_STATUS" fi

# Test GitHub backup LAST_PUSH=$(git -C "$HOME/.openclaw/workspace" log -1 --format="%ar" 2>/dev/null) echo "Last backup: $LAST_PUSH"

# Test monday.com if [ -f "$HOME/.credentials/monday-api-token.txt" ]; then MONDAY_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \ -X POST https://api.monday.com/v2 \ -H "Authorization: $(cat $HOME/.credentials/monday-api-token.txt)" \ -H "Content-Type: application/json" \ -d '{"query": "{ me { id } }"}' 2>/dev/null) [ "$MONDAY_STATUS" = "200" ] && echo "monday.com: ✅" || echo "monday.com: ❌ ($MONDAY_STATUS)" else echo "monday.com: ? (no token found)" fi

Step 6 — Memory Health

TODAY=$(date -u +%Y-%m-%d)
WORKSPACE="$HOME/.openclaw/workspace"

# Check daily notes exist [ -f "$WORKSPACE/memory/$TODAY.md" ] \ && echo "Daily notes: ✅" \ || echo "Daily notes: ❌ not created yet"

# Check MEMORY.md size (warn if >200 lines) MEMORY_LINES=$(wc -l < "$WORKSPACE/MEMORY.md" 2>/dev/null || echo 0) if [ "$MEMORY_LINES" -gt 200 ]; then echo "MEMORY.md: ⚠️ Large ($MEMORY_LINES lines) — consider pruning" else echo "MEMORY.md: ✅ ($MEMORY_LINES lines)" fi

# Count learnings this week LEARNINGS=$(grep -c "^##" "$WORKSPACE/.learnings/LEARNINGS.md" 2>/dev/null || echo 0) echo "Total learnings logged: $LEARNINGS"


Recommendations Logic

After running all steps, generate recommendations:

If any PA has billing_error AND status != resolved:
  → "Fix billing for [PA list] — they can't function"

If any task has status in_progress for 2+ days: → "Follow up on stalled task: [task name]"

If MEMORY.md > 200 lines: → "Prune MEMORY.md — it's getting bloated"

If daily notes don't exist: → "Create today's memory file"

If last backup > 6 hours ago: → "Run git backup"

If API billing = 402: → "My own API key is out of credits — alert the admin immediately"


Scheduling

Run eval:

  • On demand — when owner asks
  • Weekly — every Sunday at 09:00
  • After major incidents — billing crisis, WA disconnect, etc.

Cost Tips

  • Cheap: Reading files, scoring, formatting — any small model
  • Expensive: Summarizing large memory files — skip if not asked
  • Avoid: Running all API health checks every hour — cache for 30 min
  • Batch: Run all health checks in one pass, not one at a time

Minimum Model

Any model that can:

  • Read files
  • Apply if/then scoring rules
  • Format a structured report

No advanced reasoning needed.


PA Performance Scoring (Merged from pa-eval skill)

Use this section when evaluating individual PA agents (weekly self-eval or on-demand when owner gives feedback).

Scoring Dimensions (1–5 each, max 40 points)

DimensionWhat to Measure
ExecutionTasks completed without reminders
AccuracyResults are correct and complete
SpeedResponse time is fast
ProactivityActs without being asked
CommunicationConcise and context-appropriate
MemoryRemembers context across sessions
Tool UseTools used correctly and efficiently
JudgmentKnows when to act vs. when to ask
Grade: A (36–40), B (28–35), C (20–27), D (<20)

Owner Feedback Signals

Log these automatically when detected:

SignalAction
👍 reaction / "thanks" / "great"Log +1 positive
👎 reaction / "wrong" / "not good"Log -1, record the correction
Owner re-asks the same questionLog -1 memory gap
Owner does the task themselvesLog -1 initiative gap
Owner surprised by proactive actionLog +2 proactivity
Rule: Log feedback signals immediately — don't batch them.

Weekly Eval File

Save to .learnings/eval/YYYY-MM-DD.md with: scores table, owner feedback, tasks completed/failed, what went well, what to improve, actions for next week.

Benchmark Tests (Run Monthly)

  • Task Completion Rate: completed / assigned × 100% — Target: >90%
  • Accuracy Rate: (tasks - corrections) / tasks × 100% — Target: >95%
  • Memory Retention: Ask about something discussed 7+ days ago — Target: >80% recall
数据来源:ClawHub ↗ · 中文优化:龙虾技能库
OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险,如需更匹配、更安全的方案,建议联系付费定制

了解定制服务