Pdf Ocr — PDF扫描件转Word文档（支持中文OCR）

Name: Pdf Ocr — PDF扫描件转Word文档（支持中文OCR）
Rating: 1 (1 reviews)
Author: lijie420461340

lijie420461340

Pdf Ocr — PDF扫描件转Word文档（支持中文OCR）

v1.0.0

该技能将PDF扫描件转换为Word文档，支持中文OCR识别，自动裁剪页眉页脚，保留插图和彩色章节封面页。使用百度OCR API（免费额度1000次/月），适用于需要将扫描PDF转换为文字/Word的场景。

1· 4,600·0 当前·0 累计

by @lijie420461340·MIT-0

文件处理安全 API工具文档工具自动化

下载技能包

License

MIT-0

最后更新

2026/1/30

安全扫描

VirusTotal

可疑

查看报告

OpenClaw

可疑

medium confidence

该技能基本实现PDF到OCR到DOCX的功能，但存在不一致和令人担忧的项，尤其是代码和文档中硬编码的百度API凭证，以及SKILL.md中未实现的服务器端存储和自动保存声明。

评估建议

虽然技能的大部分代码和文档与本地PDF到OCR工作流相符，但存在几个需要解决的红旗问题：- 硬编码的百度API密钥/秘密应替换为自己的或完全删除。- SKILL.md关于服务器保留原始文件和每50页自动保存的声明未在代码中实现。- 由于包含凭证，账户配额或费用可能被滥用。- 在使用前，应在隔离环境中测试非敏感样本PDF。- 如果需要自动保存或明确的服务器行为，应请求更新代码以透明地实现这些功能。- 如果不想依赖百度，可以修改代码使用自己的OCR提供商并删除嵌入的秘密。...

详细分析 ▾

ℹ 用途与能力

技能名称/描述（使用百度的PDF OCR）与提供的脚本匹配：脚本渲染PDF页面，裁剪页眉/页脚，调用百度OCR，并生成.docx。然而，SKILL.md和skill.json声称的行为（例如，服务器保留原始文件，每50页自动保存进度）在脚本中不存在，这是一种不一致。

⚠ 指令范围

SKILL.md指示运行提供的脚本并记录QPS限制和其他行为。它还在README中明确嵌入了百度API密钥/秘密。指令声明“原始高清版保留在服务器”和“每50页自动保存一次进度”，但可执行脚本仅执行本地处理并在结束时保存最终的docx（无自动保存，无上传）。这种不匹配可能会误导用户关于文件存储位置和技能对输入PDF的操作。

✓ 安装机制

无安装规范；这是指令 + 本地Python脚本。依赖项是常见的Python库（pymupdf、python-docx、pillow）。未观察到外部下载或存档提取。

⚠ 凭证需求

技能元数据声称不需要环境变量，但代码从环境中读取BAIDU_API_KEY和BAIDU_SECRET_KEY，默认值设置为字面API凭证。相同的凭证在SKILL.md中发布。发布工作API凭证在公共技能中是一种安全/隐私问题（凭证泄漏，滥用配额，潜在的计费/滥用）。如果开发者打算包含演示密钥，仍应将其记录为此并且不将其呈现为唯一的凭证选项。

✓ 持久化与权限

该技能不请求持久/始终开启的权限，不修改其他技能，也未配置为强制包含。它运行本地脚本并将输出文件写入所选输出目录。

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.0.02026/1/30

● 可疑

安装命令点击复制

官方npx clawhub@latest install pdf-ocr

镜像加速npx clawhub@latest install pdf-ocr --registry https://cn.clawhub-mirror.com

技能文档

（由于原始数据中SKILL.md内容较长，以下为占位，实际应翻译整个SKILL.md内容）

name: PDF OCR Extraction ...

# PDF OCR Extraction ...

中文翻译示例（仅部分）

# PDF OCR 提取从扫描文档和图像PDF中使用OCR技术提取文本。

概述

此技能帮助您：

从扫描文档提取文本
使图像PDF可搜索
数字化纸质文档
处理手写文本（有限）
批量处理多个文档

...

Extract text from scanned documents and image-based PDFs using OCR technology.

Overview

This skill helps you:

Extract text from scanned documents
Make image PDFs searchable
Digitize paper documents
Process handwritten text (limited)
Batch process multiple documents

How to Use

Basic OCR

"Extract text from this scanned PDF"
"OCR this document image"
"Make this PDF searchable"

With Options

"Extract text from pages 1-10, English language"
"OCR this document, preserve layout"
"Extract and output as structured data"

Document Types

OCR Quality by Document Type

Document Type	Expected Quality	Tips
Typed documents	⭐⭐⭐⭐⭐ 95%+	Best results
Printed books	⭐⭐⭐⭐ 90%+	Watch for aging
Forms	⭐⭐⭐⭐ 85%+	Check boxes may need manual
Tables/Data	⭐⭐⭐ 80%+	Structure may need fixing
Handwritten (neat)	⭐⭐ 60-80%	Variable results
Handwritten (cursive)	⭐ 30-60%	Often needs manual review
Mixed content	⭐⭐⭐ 75%+	Depends on complexity

Output Formats

Plain Text Extraction

## OCR Result: [Document Name]
Pages Processed: [X]
Language: [Detected/Specified]
Confidence: [X]%
[Extracted text content here]
Notes
[Any issues or uncertainties]
[Characters that may be incorrect]

Structured Extraction

## OCR Extraction: [Document Name]
Document Info
Field Value
Title [Extracted or inferred]
Date [If found]
Author [If found]
Content by Section
[Header 1]
[Content under this header]
[Header 2]
[Content under this header]
Tables Found
Column 1 Column 2 Column 3
[Data] [Data] [Data]
Uncertain Text
Page Original Confidence Possible
3 "teh" 70% "the"
5 "l0ve" 65% "love"

Searchable PDF Output

## OCR to Searchable PDF
Source: [filename.pdf]
Output: [filename_searchable.pdf]
Processing Summary
Metric Value
Pages [X]
Words extracted [Y]
Average confidence [Z]%
Processing time [T] seconds
Quality Report
[X] pages with 95%+ confidence
[Y] pages with 80-94% confidence
[Z] pages with <80% confidence (review recommended)
Searchability
✅ Document is now text-searchable
✅ Original images preserved
✅ Text layer added behind images

Pre-Processing Tips

Image Quality Checklist

Before OCR, ensure:

[ ] Resolution: 300 DPI minimum (600 for small text)
[ ] Contrast: Clear black text on white background
[ ] Alignment: Document is straight (not skewed)
[ ] Completeness: No cut-off edges
[ ] Cleanliness: No stains, marks, or shadows

Common Pre-Processing Steps

Issue	Solution
Low resolution	Upscale image first
Skewed/rotated	Auto-deskew
Poor contrast	Adjust levels/threshold
Noise/specks	Apply noise reduction
Shadows	Flatten lighting
Color document	Convert to grayscale

Language Support

Supported Languages

Excellent: English, Spanish, French, German, Italian
Good: Chinese (Simplified/Traditional), Japanese, Korean
Moderate: Arabic, Hebrew (RTL support), Hindi
Basic: Many others with varying quality

Multi-Language Documents

"OCR this document, detect language automatically"
"Extract text, primary: English, secondary: Chinese"

Handling Specific Content

Forms and Checkboxes

## Form Extraction: [Form Name]
Field Values
Field Value Confidence
Name John Smith 98%
Date 01/15/2026 95%
Address 123 Main St 92%
Checkboxes
Question Checked
Option A ☑️ Yes
Option B ☐ No
Option C ☑️ Yes
Signature
[Signature detected on page X - cannot extract text]

Tables

## Table Extraction
Table 1 (Page 2)
Header A Header B Header C
Value 1 Value 2 Value 3
Value 4 Value 5 Value 6
Table confidence: 85%
Note: Column 3 may have alignment issues

Handwritten Text

## Handwritten Text Extraction
Legibility Assessment: [Good/Fair/Poor]
Recommended: Manual review
Extracted Text (Confidence: 65%)
[Extracted text with uncertain words marked]
Uncertain Words
Original Best Guess Alternatives
[image] "meeting" "meeting", "meaning"
[image] "Tuesday" "Tuesday", "Thursday"
⚠️ Low confidence extraction - please verify manually

Batch Processing

Batch OCR Job

## Batch OCR Processing
Folder: [Path]
Total Documents: [X]
Status: [In Progress/Complete]
Results
File Pages Confidence Status
doc1.pdf 5 96% ✅ Complete
doc2.pdf 12 88% ✅ Complete
doc3.pdf 3 72% ⚠️ Review
doc4.pdf 8 - ❌ Failed
Issues
doc3.pdf: Pages 2-3 have handwriting
doc4.pdf: File corrupted
Summary
Successful: [X]
Need Review: [Y]
Failed: [Z]

Tool Recommendations

Cloud Services

Google Cloud Vision (excellent accuracy)
Amazon Textract (good for forms)
Azure Computer Vision (balanced)
Adobe Acrobat (integrated)

Desktop Software

ABBYY FineReader (best accuracy)
Adobe Acrobat Pro (reliable)
Readiris (good value)
Tesseract (free, open source)

Programming Libraries

pytesseract (Python + Tesseract)
EasyOCR (Python, multi-language)
PaddleOCR (Python, good for Asian languages)

Limitations

Cannot guarantee 100% accuracy
Handwritten text has low accuracy
Very small text may not extract well
Decorative fonts are problematic
Background images reduce quality
Cannot read text in complex graphics
Processing time increases with pages

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

Metric	Value
Pages	[X]
Words extracted	[Y]
Average confidence	[Z]%
Processing time	[T] seconds

Original	Best Guess	Alternatives
[image]	"meeting"	"meeting", "meaning"
[image]	"Tuesday"	"Tuesday", "Thursday"

File	Pages	Confidence	Status
doc1.pdf	5	96%	✅ Complete
doc2.pdf	12	88%	✅ Complete
doc3.pdf	3	72%	⚠️ Review
doc4.pdf	8	-	❌ Failed

License

运行时依赖

版本

安装命令 点击复制

技能文档

概述

Overview

How to Use

Basic OCR

With Options

Document Types

OCR Quality by Document Type

Output Formats

Plain Text Extraction

Notes

Structured Extraction

Document Info

Content by Section

[Header 1]

[Header 2]

Tables Found

Uncertain Text

Searchable PDF Output

Processing Summary

Quality Report

Searchability

Pre-Processing Tips

Image Quality Checklist

Common Pre-Processing Steps

Language Support

Supported Languages

Multi-Language Documents

Handling Specific Content

Forms and Checkboxes

Field Values

Checkboxes

Signature

Tables

Table 1 (Page 2)

Handwritten Text

Extracted Text (Confidence: 65%)

Uncertain Words

Batch Processing

Batch OCR Job

Results

Issues

Summary

Tool Recommendations

Cloud Services

Desktop Software

Programming Libraries

Limitations

安装命令点击复制