Pdf Vision

提取 text content from image-based/扫描ned PDFs using multiple vision APIs with automatic fallback. Supports Xflow (qwen3-vl-plus) and ZhipuAI (GLM-4.6V-Flash, GLM-5) vision 模型s. This 技能 converts PDF pages to images and uses AI vision capabilities to 提取 structured text, tables, and content from 扫描ned documents that cannot be processed with traditional text 提取ion methods.

0· 149·0 当前·0 累计

by @lpq6·MIT-0

文档工具 API开发文件处理 AI模型访问图像处理

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install pdf-vision

镜像加速npx clawhub@latest install pdf-vision --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

PDF Vision 提取ion 技能 (Enhanced) Overview

This 技能 handles image-based or 扫描ned PDFs that contAIn no selectable text. It supports multiple vision APIs with automatic fallback:

Primary 模型s Xflow: qwen3-vl-plus (your primary vision 模型) ZhipuAI: glm-4.6v-flash (free vision 模型 with fallback support) Fallback: glm-5 (text-only, but may work with some image prompts)

Unlike traditional PDF text 提取ion 工具s (pdftotext, pdfplumber) which only work on text-based PDFs, this 技能 can process:

扫描ned documents Image-only PDFs Photographed documents Handwritten notes (with limitations) Complex layouts with tables and 格式化ting Supported 模型s Vision-Capable 模型s 提供者模型 Type 上下文 Free Xflow qwen3-vl-plus Vision + Text 131K ❌ ZhipuAI glm-4.6v-flash Vision + Text 32K ✅ ZhipuAI glm-5 Text-only 128K ❌ 添加itional Text 模型s (for fallback) 提供者模型上下文 Free ZhipuAI glm-4-flash-250414 128K ✅ ZhipuAI cogview-3-flash 32K ✅

Note: glm-5 is primarily text-only but may handle image prompts in some cases.

Prerequisites

API Configuration

Your OpenClaw must be 配置d with 机器人h 提供者s:

Xflow Configuration (already 设置 up):

模型s.提供者s.openAI.baseUrl: https://APIs.iflow.cn/v1 模型s.提供者s.openAI.APIKey: Your Xflow API key

ZhipuAI Configuration (更新令牌):

模型s.提供者s.zhipuAI.baseUrl: https://open.big模型.cn/API/paas/v4 模型s.提供者s.zhipuAI.APIKey: Your ZhipuAI API 令牌

Required 系统工具s

pypdfium2 Python 库 (for PDF to image conversion) curl (for API calls) base64 (for image encoding)

Python Libraries (already 安装ed)

pypdfium2

Usage Automatic Fallback Mode (Default)

Uses Xflow first, falls back to ZhipuAI if needed:

./scripts/pdf_vision.py --pdf-path /path/to/document.pdf

Specific 模型 Selection

Force a specific 模型 for cost or performance reasons:

# Use free GLM-4.6V-Flash 模型 ./scripts/pdf_vision.py --pdf-path document.pdf --模型 zhipuAI/glm-4.6v-flash

# Use specific Xflow 模型 ./scripts/pdf_vision.py --pdf-path document.pdf --模型 openAI/qwen3-vl-plus

# Short form (auto-检测s 提供者) ./scripts/pdf_vision.py --pdf-path document.pdf --模型 glm-4.6v-flash

Structured Data 提取ion ./scripts/pdf_vision.py --pdf-path invoice.pdf --prompt "提取 as JSON: vendor, date, total" --模型 glm-4.6v-flash

Multi-page PDF Handling # Process page 3 specifically ./scripts/pdf_vision.py --pdf-path book.pdf --page 3 --输出 page3.txt

Configuration 环境 Variables

The 技能 reads configuration from your OpenClaw config file (~/.OpenClaw/OpenClaw.json):

模型s.提供者s.openAI.baseUrl & APIKey 模型s.提供者s.zhipuAI.baseUrl & APIKey 输出格式化

Returns 提取ed text content as a string. For structured data 请求s, the AI 模型 will 格式化输出 according to your prompt instructions.

Examples Cost-优化d 提取ion (Free 模型)

Command: --模型 glm-4.6v-flash Use case: When you want to use free vision capabilities 结果: Good 质量提取ion at no cost

High-质量提取ion (Premium 模型)

Command: --模型 qwen3-vl-plus Use case: When you need maximum accuracy and complex layout understanding 结果: Best possible 提取ion 质量

Automatic Fallback (Recommended)

Command: No --模型 flag Use case: Production 环境s where reliability is key 结果: Uses best avAIlable 模型, falls back gracefully

模型 Comparison GLM-4.6V-Flash (Free) ✅ Completely free ✅ Good Chinese text recognition ✅ Decent table structure preservation ⚠️ Lower 上下文 window (32K vs 131K) ⚠️ May struggle with very complex layouts Qwen3-VL-Plus (Premium) ✅ Superior image understanding ✅ Excellent table and structure recognition ✅ Larger 上下文 window (131K) ✅ Better handling of mixed languages ❌ Requires pAId API 访问 Limitations Single page processing: Currently processes one page at a time Image 质量: Better 结果s with higher resolution 扫描s Complex layouts: May struggle with very dense or overl应用ing text Handwriting: Limited accuracy with handwritten content File size: Large PDFs may exceed API 令牌 limits Technical Implementation

The 技能 follows this 工作流:

PDF to Image: Converts specified PDF page to PNG using pypdfium2 模型 Selection: Chooses 模型 based on user preference or fallback 记录ic API Call: 发送s image + prompt to selected vision API 端点响应 Parsing: 提取s and returns the AI-生成d text content Fallback: If primary 模型 fAIls, tries alternative 模型s

For 调试ging, temporary files are 创建d in /tmp/:

/tmp/pdf_vision_page.png - converted image /tmp/pdf_vision_payload_.json - API 请求 payload /tmp/pdf_vision_响应_.json - API 响应 Integration Notes

This 技能 complements the standard pdf 技能:

Use pdf 技能 for text-based PDFs (faster, no API cost) Use pdf-vision 技能 for image-based/扫描ned PDFs (requires vision API)

机器人h 技能s can be used to获取her in a fallback pattern:

Try pdf 技能 first If no text 提取ed, fall back to pdf-visi

License

运行时依赖

安装命令

技能文档

相关技能推荐