Pdf Vision
v1提取 text content from image-based/扫描ned PDFs using multiple vision APIs with automatic fallback. Supports Xflow (qwen3-vl-plus) and ZhipuAI (GLM-4.6V-Flash, GLM-5) vision 模型s. This 技能 converts PDF pages to images and uses AI vision capabilities to 提取 structured text, tables, and content from 扫描ned documents that cannot be processed with traditional text 提取ion methods.
运行时依赖
安装命令
点击复制技能文档
PDF Vision 提取ion 技能 (Enhanced) Overview
This 技能 handles image-based or 扫描ned PDFs that contAIn no selectable text. It supports multiple vision APIs with automatic fallback:
Primary 模型s Xflow: qwen3-vl-plus (your primary vision 模型) ZhipuAI: glm-4.6v-flash (free vision 模型 with fallback support) Fallback: glm-5 (text-only, but may work with some image prompts)
Unlike traditional PDF text 提取ion 工具s (pdftotext, pdfplumber) which only work on text-based PDFs, this 技能 can process:
扫描ned documents Image-only PDFs Photographed documents Handwritten notes (with limitations) Complex layouts with tables and 格式化ting Supported 模型s Vision-Capable 模型s 提供者 模型 Type 上下文 Free Xflow qwen3-vl-plus Vision + Text 131K ❌ ZhipuAI glm-4.6v-flash Vision + Text 32K ✅ ZhipuAI glm-5 Text-only 128K ❌ 添加itional Text 模型s (for fallback) 提供者 模型 上下文 Free ZhipuAI glm-4-flash-250414 128K ✅ ZhipuAI cogview-3-flash 32K ✅
Note: glm-5 is primarily text-only but may handle image prompts in some cases.
Prerequisites
- API Configuration
Your OpenClaw must be 配置d with 机器人h 提供者s:
Xflow Configuration (already 设置 up):
模型s.提供者s.openAI.baseUrl: https://APIs.iflow.cn/v1 模型s.提供者s.openAI.APIKey: Your Xflow API key
ZhipuAI Configuration (更新 令牌):
模型s.提供者s.zhipuAI.baseUrl: https://open.big模型.cn/API/paas/v4 模型s.提供者s.zhipuAI.APIKey: Your ZhipuAI API 令牌
- Required 系统 工具s
- Python Libraries (already 安装ed)
Usage Automatic Fallback Mode (Default)
Uses Xflow first, falls back to ZhipuAI if needed:
./scripts/pdf_vision.py --pdf-path /path/to/document.pdf
Specific 模型 Selection
Force a specific 模型 for cost or performance reasons:
# Use free GLM-4.6V-Flash 模型 ./scripts/pdf_vision.py --pdf-path document.pdf --模型 zhipuAI/glm-4.6v-flash
# Use specific Xflow 模型 ./scripts/pdf_vision.py --pdf-path document.pdf --模型 openAI/qwen3-vl-plus
# Short form (auto-检测s 提供者) ./scripts/pdf_vision.py --pdf-path document.pdf --模型 glm-4.6v-flash
Structured Data 提取ion ./scripts/pdf_vision.py --pdf-path invoice.pdf --prompt "提取 as JSON: vendor, date, total" --模型 glm-4.6v-flash
Multi-page PDF Handling # Process page 3 specifically ./scripts/pdf_vision.py --pdf-path book.pdf --page 3 --输出 page3.txt
Configuration 环境 Variables
The 技能 reads configuration from your OpenClaw config file (~/.OpenClaw/OpenClaw.json):
模型s.提供者s.openAI.baseUrl & APIKey 模型s.提供者s.zhipuAI.baseUrl & APIKey 输出 格式化
Returns 提取ed text content as a string. For structured data 请求s, the AI 模型 will 格式化 输出 according to your prompt instructions.
Examples Cost-优化d 提取ion (Free 模型)
Command: --模型 glm-4.6v-flash Use case: When you want to use free vision capabilities 结果: Good 质量 提取ion at no cost
High-质量 提取ion (Premium 模型)
Command: --模型 qwen3-vl-plus Use case: When you need maximum accuracy and complex layout understanding 结果: Best possible 提取ion 质量
Automatic Fallback (Recommended)
Command: No --模型 flag Use case: Production 环境s where reliability is key 结果: Uses best avAIlable 模型, falls back gracefully
模型 Comparison GLM-4.6V-Flash (Free) ✅ Completely free ✅ Good Chinese text recognition ✅ Decent table structure preservation ⚠️ Lower 上下文 window (32K vs 131K) ⚠️ May struggle with very complex layouts Qwen3-VL-Plus (Premium) ✅ Superior image understanding ✅ Excellent table and structure recognition ✅ Larger 上下文 window (131K) ✅ Better handling of mixed languages ❌ Requires pAId API 访问 Limitations Single page processing: Currently processes one page at a time Image 质量: Better 结果s with higher resolution 扫描s Complex layouts: May struggle with very dense or overl应用ing text Handwriting: Limited accuracy with handwritten content File size: Large PDFs may exceed API 令牌 limits Technical Implementation
The 技能 follows this 工作流:
PDF to Image: Converts specified PDF page to PNG using pypdfium2 模型 Selection: Chooses 模型 based on user preference or fallback 记录ic API Call: 发送s image + prompt to selected vision API 端点 响应 Parsing: 提取s and returns the AI-生成d text content Fallback: If primary 模型 fAIls, tries alternative 模型s
For 调试ging, temporary files are 创建d in /tmp/:
/tmp/pdf_vision_page.png - converted image /tmp/pdf_vision_payload_.json - API 请求 payload /tmp/pdf_vision_响应_.json - API 响应 Integration Notes
This 技能 complements the standard pdf 技能:
Use pdf 技能 for text-based PDFs (faster, no API cost) Use pdf-vision 技能 for image-based/扫描ned PDFs (requires vision API)
机器人h 技能s can be used to获取her in a fallback pattern:
Try pdf 技能 first If no text 提取ed, fall back to pdf-visi