Eval

Name: Eval
Author: Netanel Abergel

Netanel Abergel

Eval

v1.1.1

Evaluate everything the PA agent manages — tasks, skills, PA network health, billing, calendar connections, and memory quality. Use when: owner asks for an e...

0· 100·1 当前·1 累计

by @netanel-abergel (Netanel Abergel)·MIT-0

下载技能包

License

MIT-0

最后更新

2026/4/5

安全扫描

VirusTotal

Pending

查看报告

OpenClaw

可疑

high confidence

The skill's instructions read many local files and secrets (API keys, token files, billing and workspace files) but the skill metadata declares no required credentials or config paths — this mismatch is incoherent and could expose sensitive data unless run in a tightly controlled environment.

评估建议

This skill will read many local files and tokens (billing-status.json, workspace files, .context, $HOME/.credentials/monday-api-token.txt, and environment variables like the Anthropic API key) but the package metadata doesn't declare those requirements — that's a red flag. Before installing or enabling this skill: 1) Ask the publisher to explicitly list required env vars and config paths and justify why each is needed. 2) Inspect the .context file and any referenced credential files to see what ...

详细分析 ▾

⚠ 用途与能力

The skill's stated purpose is to evaluate the agent's tasks, integrations, billing, calendar, and memory — that purpose legitimately requires checking local state and integration tokens. However, the skill metadata declares no required environment variables, credentials, or config paths while the instructions reference many local files and tokens (e.g., /opt/ocana/... files, $HOME/.credentials/monday-api-token.txt, ANT HROPIC_API_KEY). The omission of these required inputs in metadata is disproportionate and misleading.

⚠ 指令范围

SKILL.md explicitly instructs the agent to source a local .context file and to read numerous files and run shell/python/curl/git commands against paths like /opt/ocana/openclaw/workspace/* and $HOME/.openclaw/workspace/*, and to read token files and env vars to test APIs. These actions go beyond a simple checklist and access potentially sensitive credentials and owner data (owner phone, tokens, billing JSON). The instructions grant broad discretion to read system state and secrets without documenting limits.

✓ 安装机制

There is no install spec and no code files — the skill is instruction-only. That reduces the risk of arbitrary code being fetched or executed from untrusted sources.

⚠ 凭证需求

Although the skill metadata lists no required env vars or config paths, the runtime steps depend on env vars and files (e.g., ANT HROPIC_API_KEY, $HOME/.credentials/monday-api-token.txt, various workspace files and .context values like GOG_CREDS or MONDAY_TOKEN_FILE). Requesting access to multiple local credential sources without declaring them is disproportionate and increases the potential for secret exposure.

✓ 持久化与权限

The skill is not always-on and is user-invocable (defaults). It does not request permanent inclusion or declare modifications to other skills or system-wide settings. Autonomous invocation is allowed by platform default, which increases blast radius in general, but this skill does not request extra persistence privileges.

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.1.12026/4/1

**Minor update with behavioral checks and context loading improvements.** - Added explicit "Load Local Context" step for sourcing environment variables. - Introduced "Pre-Eval Behavioral Checks" section, including automatic reactions when eval is triggered and completed. - Clarified use of PA directory file and requirement to use direct calendar API. - Removed non-English trigger phrases for clarity and consistency. - No structural changes to main eval logic or scoring.

● Pending

安装命令点击复制

官方npx clawhub@latest install eval

镜像加速npx clawhub@latest install eval --registry https://cn.clawhub-mirror.com

技能文档

Load Local Context

CONTEXT_FILE="/opt/ocana/openclaw/workspace/skills/eval/.context"
[ -f "$CONTEXT_FILE" ] && source "$CONTEXT_FILE"
# Then use: $OWNER_PHONE, $WORKSPACE, $TASKS_FILE, $MONDAY_TOKEN_FILE, $GOG_CREDS, etc.

# Eval Skill

Structured evaluation of everything the agent manages.

When to Use

Trigger phrases:

"run eval"
"what's working and what isn't"
"rate yourself"
"check everything"

Pre-Eval Behavioral Checks (Always)

React 👍 when owner triggers eval
React ✅ when report is complete
PA directory source: /opt/ocana/openclaw/workspace/PA_LIST.md
Calendar check: use direct API (NOT gog CLI)

Eval Report Format

📋 Full Eval — [DATE]
━━━ SELF PERFORMANCE ━━━
Execution:      [1-5] [comment]
Accuracy:       [1-5] [comment]
Memory:         [1-5] [comment]
Proactivity:    [1-5] [comment]
Communication:  [1-5] [comment]
TOTAL: [X]/25
━━━ ACTIVE TASKS ━━━
✅ Done today:   [count]
🟡 In progress:  [count]
❌ Stalled:      [count] — [list stalled tasks]
━━━ PA NETWORK ━━━
✅ Working:  [list]
⚠️ Issues:   [list with issue]
❌ Down:     [list]
━━━ SKILLS ━━━
Installed: [count]
Used today: [list]
Unused (7+ days): [list]
━━━ INTEGRATIONS ━━━
Calendar (owner):     [connected ✅ / broken ❌ / unknown ?]
monday.com:           [connected ✅ / broken ❌]
Email (gog):          [connected ✅ / broken ❌]
GitHub backup:        [last push: X ago]
WhatsApp:             [connected ✅ / disconnected ❌]
━━━ MEMORY HEALTH ━━━
Daily notes:     [today's file exists? ✅/❌]
Long-term:       [MEMORY.md size — OK / bloated]
Learnings:       [count this week]
Last backup:     [X ago]
━━━ RECOMMENDATIONS ━━━
[Most important thing to fix]
[Second priority]
[Optional improvement]

Running the Eval

Step 1 — Self Performance Score

Score each dimension 1–5 based on today's activity:

Execution (1–5):
5: All tasks completed without reminders
3: Most tasks done, some follow-up needed
1: Multiple tasks missed or forgotten
Accuracy (1–5):
5: No corrections from owner
3: 1–2 corrections
1: Multiple errors or wrong outputs
Memory (1–5):
5: Recalled context correctly every time
3: Missed some context, caught on
1: Repeated the same mistakes
Proactivity (1–5):
5: Acted before being asked multiple times
3: Responded to requests, minimal initiative
1: Only reacted, no proactive actions
Communication (1–5):
5: Clear, concise, no unnecessary narration
3: Occasionally verbose or unclear
1: Shared reasoning, listed options, narrated steps

Step 2 — Task Audit

TASKS_FILE="$HOME/.openclaw/workspace/memory/tasks.md"
echo "Tasks done:"
grep -c "\[x\]" "$TASKS_FILE" 2>/dev/null || echo 0
echo "Tasks in progress:"
grep -c "\[ \]" "$TASKS_FILE" 2>/dev/null || echo 0# Stalled = in progress for 2+ days
echo "Stalled tasks (2+ days old):"
grep "\[ \]" "$TASKS_FILE" | grep -v "$(date +%Y-%m-%d)" | grep -v "$(date -u -d '1 day ago' +%Y-%m-%d 2>/dev/null)" || echo "none"

Step 3 — PA Network Health

BILLING_FILE="$HOME/.openclaw/workspace/memory/billing-status.json"echo "PA Network Status:"
python3 << 'PYEOF'
import json
data = json.load(open('/opt/ocana/openclaw/workspace/memory/billing-status.json'))
for pa in data['issues']:
    status = "✅" if pa['status'] == 'resolved' else "⚠️"
    print(f"  {status} {pa['pa']} ({pa['owner']}): {pa['status']}")
PYEOF

Step 4 — Skills Audit

SKILLS_DIR="$HOME/.openclaw/workspace/skills" echo "Installed skills:" ls "$SKILLS_DIR" | grep -v README | wc -l

echo "Skills list:" ls "$SKILLS_DIR" | grep -v README

Step 5 — Integration Health

# Test Anthropic billing
API_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  -H "x-api-key: ${ANTHROPIC_API_KEY:-none}" \
  -H "anthropic-version: 2023-06-01" \
  https://api.anthropic.com/v1/models 2>/dev/null)
# Interpret result
if [ "$API_STATUS" = "200" ]; then echo "Billing: ✅ OK"
elif [ "$API_STATUS" = "402" ]; then echo "Billing: ❌ OUT OF CREDITS"
elif [ "$API_STATUS" = "401" ]; then echo "Billing: ❌ Invalid key"
else echo "Billing: ? HTTP $API_STATUS"
fi
# Test GitHub backup
LAST_PUSH=$(git -C "$HOME/.openclaw/workspace" log -1 --format="%ar" 2>/dev/null)
echo "Last backup: $LAST_PUSH"# Test monday.com
if [ -f "$HOME/.credentials/monday-api-token.txt" ]; then
  MONDAY_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
    -X POST https://api.monday.com/v2 \
    -H "Authorization: $(cat $HOME/.credentials/monday-api-token.txt)" \
    -H "Content-Type: application/json" \
    -d '{"query": "{ me { id } }"}' 2>/dev/null)
  [ "$MONDAY_STATUS" = "200" ] && echo "monday.com: ✅" || echo "monday.com: ❌ ($MONDAY_STATUS)"
else
  echo "monday.com: ? (no token found)"
fi

Step 6 — Memory Health

TODAY=$(date -u +%Y-%m-%d)
WORKSPACE="$HOME/.openclaw/workspace"
# Check daily notes exist
[ -f "$WORKSPACE/memory/$TODAY.md" ] \
  && echo "Daily notes: ✅" \
  || echo "Daily notes: ❌ not created yet"
# Check MEMORY.md size (warn if >200 lines)
MEMORY_LINES=$(wc -l < "$WORKSPACE/MEMORY.md" 2>/dev/null || echo 0)
if [ "$MEMORY_LINES" -gt 200 ]; then
  echo "MEMORY.md: ⚠️ Large ($MEMORY_LINES lines) — consider pruning"
else
  echo "MEMORY.md: ✅ ($MEMORY_LINES lines)"
fi# Count learnings this week
LEARNINGS=$(grep -c "^##" "$WORKSPACE/.learnings/LEARNINGS.md" 2>/dev/null || echo 0)
echo "Total learnings logged: $LEARNINGS"

Recommendations Logic

After running all steps, generate recommendations:

If any PA has billing_error AND status != resolved: → "Fix billing for [PA list] — they can't function" If any task has status in_progress for 2+ days: → "Follow up on stalled task: [task name]" If MEMORY.md > 200 lines: → "Prune MEMORY.md — it's getting bloated" If daily notes don't exist: → "Create today's memory file" If last backup > 6 hours ago: → "Run git backup"

If API billing = 402: → "My own API key is out of credits — alert the admin immediately"

Scheduling

Run eval:

On demand — when owner asks
Weekly — every Sunday at 09:00
After major incidents — billing crisis, WA disconnect, etc.

Cost Tips

Cheap: Reading files, scoring, formatting — any small model
Expensive: Summarizing large memory files — skip if not asked
Avoid: Running all API health checks every hour — cache for 30 min
Batch: Run all health checks in one pass, not one at a time

Minimum Model

Any model that can:

Read files
Apply if/then scoring rules
Format a structured report

No advanced reasoning needed.

PA Performance Scoring (Merged from pa-eval skill)

Use this section when evaluating individual PA agents (weekly self-eval or on-demand when owner gives feedback).

Scoring Dimensions (1–5 each, max 40 points)

Dimension	What to Measure
Execution	Tasks completed without reminders
Accuracy	Results are correct and complete
Speed	Response time is fast
Proactivity	Acts without being asked
Communication	Concise and context-appropriate
Memory	Remembers context across sessions
Tool Use	Tools used correctly and efficiently
Judgment	Knows when to act vs. when to ask

Grade: A (36–40), B (28–35), C (20–27), D (<20)

Owner Feedback Signals

Log these automatically when detected:

Signal	Action
👍 reaction / "thanks" / "great"	Log +1 positive
👎 reaction / "wrong" / "not good"	Log -1, record the correction
Owner re-asks the same question	Log -1 memory gap
Owner does the task themselves	Log -1 initiative gap
Owner surprised by proactive action	Log +2 proactivity

Rule: Log feedback signals immediately — don't batch them.

Weekly Eval File

Save to .learnings/eval/YYYY-MM-DD.md with: scores table, owner feedback, tasks completed/failed, what went well, what to improve, actions for next week.

Benchmark Tests (Run Monthly)

Task Completion Rate: completed / assigned × 100% — Target: >90%
Accuracy Rate: (tasks - corrections) / tasks × 100% — Target: >95%
Memory Retention: Ask about something discussed 7+ days ago — Target: >80% recall

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

Load Local Context

When to Use

Pre-Eval Behavioral Checks (Always)

Eval Report Format

Running the Eval

Step 1 — Self Performance Score

Step 2 — Task Audit

Step 3 — PA Network Health

Step 4 — Skills Audit

Step 5 — Integration Health

Step 6 — Memory Health

Recommendations Logic

Scheduling

Cost Tips

Minimum Model

PA Performance Scoring (Merged from pa-eval skill)

Scoring Dimensions (1–5 each, max 40 points)

Owner Feedback Signals

Weekly Eval File

Benchmark Tests (Run Monthly)

安装命令点击复制