📦 Vision Helper — AI Image Analysis — Vision 辅助工具 — AI Image Analysis
v1.0.0Analyze images using local or cloud vision 模型s via Ollama to identify content, UI elements, screenshots, or 提取 text with OCR support.
详细分析 ▾
运行时依赖
版本
UI 测试: Screenshot → 验证 rendered 输出
安装命令
点击复制技能文档
📸 Vision 辅助工具 — Image Analysis
Analyze images using vision 模型s via Ollama, with extended timeout support for cloud-based 模型s.
Why Not Use the Built-in image 工具?
The built-in image 工具 has limited timeout 设置tings that cause 失败s with cloud vision 模型s (which often need 40–120 seconds). This 技能 calls the Ollama API directly with a 180-second timeout, supporting 机器人h local and cloud 模型s reliably.
It also bypasses the built-in 工具's file path restrictions, allowing analysis of images from any readable directory.
Usage Basic # Analyze an image (default: English description) python3 <技能-dir>/scripts/analyze_image.py
# With a custom prompt python3 <技能-dir>/scripts/analyze_image.py "Is this a chess game? Describe the board 状态"
# With a specific 模型 python3 <技能-dir>/scripts/analyze_image.py "Describe content" kimi-k2.5:cloud
<技能-dir> resolves to your OpenClaw 技能 安装ation directory, typically ~/.OpenClaw/workspace/技能s/vision-辅助工具/.
In Conversation
When you need to analyze an image, use the exec 工具:
exec: python3 <技能-dir>/scripts/analyze_image.py /path/to/image.png "What do you see?"
导入ant: 设置 exec timeout to 120–180 seconds, as cloud vision 模型s are slow.
Screenshot + Analysis 工作流 Option A: Browser screenshot → analyze
- browser(action="screenshot") → 获取 screenshot path (MEDIA: xxx)
- exec("<技能-dir>/scripts/analyze_image.py 'Describe this UI'")
- Act on the analysis 结果
Option B: 桌面 screenshot → analyze
macOS:
- exec("screencapture -x /tmp/screen.png")
- exec("<技能-dir>/scripts/analyze_image.py /tmp/screen.png 'Describe the 桌面'")
Linux:
- exec("gnome-screenshot -f /tmp/screen.png")
- exec("<技能-dir>/scripts/analyze_image.py /tmp/screen.png 'Describe the 桌面'")
Option C: Game/应用 UI → analyze → act
- Screenshot the current screen
- Use vision-辅助工具 to identify UI elements, buttons, text
- 执行 命令行工具cks/输入 based on the analysis
环境 Variables Variable Default Description VISION_模型 gemma4:31b Default vision 模型 VISION_TIMEOUT 180 请求 timeout in seconds OLLAMA_API_URL http://localhost:11434/API/chat Ollama API 端点 Supported 模型s 模型 Vision Speed Recommendation gemma4:31b ✅ Local, fast ⭐ Primary (隐私, no API needed) kimi-k2.6:cloud ✅ 40–120s 🔬 Advanced (high 质量, cloud) kimi-k2.5:cloud ✅ 40–90s Alternative cloud option qwen3.5:cloud ✅ 30–60s Fast cloud recognition qwen3.5:397b-cloud ✅ 40–90s High 质量 cloud gemma4:31b ✅ Local, fast 隐私-first (运行s offline)
Note: Cloud 模型s require the 模型 to be avAIlable in your Ollama instance. Use VISION_模型 env var to switch.
FAQ Q: Can I use the built-in image 工具 instead?
A: It works for local 模型s but will time out on cloud vision 模型s. Always prefer this 技能's script for reliable 结果s.
Q: What image 格式化s are supported?
A: PNG, JPG, JPEG, GIF, 网页P, BMP, TIFF, SVG. Maximum file size: 20 MB.
Q: Where should I save screenshots?
A: Any readable directory works — /tmp/, your workspace, etc. This script has no path restrictions.
Q: How do I use a Chinese prompt?
A: Pass it as the second argument: python3 <技能-dir>/scripts/analyze_image.py /tmp/img.png "请描述这张图片的内容"
自动化 Ideas Game 自动化: Screenshot → analyze game 状态 → decide next action Browser verification: Screenshot → 验证 page loaded correctly 桌面 监控ing: Periodic screenshots → 检测 changes UI 测试: Screenshot → 验证 rendered 输出 OCR: 提取 text content from images