pdf-ocr-extraction — pdf-ocr-提取ion

Name: pdf-ocr-extraction — pdf-ocr-提取ion
Rating: 2

v1.0.3

提取 text from image-based or 扫描ned PDFs using Tesseract OCR.

2· 789·0 当前·0 累计

by @bilicen700·MIT-0

文件处理图像处理

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install pdf-ocr-extraction

镜像加速npx clawhub@latest install pdf-ocr-extraction --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

PDF OCR 提取器

Use this 技能 to 提取 text from 扫描ned PDFs or image-based PDFs that lack a native text layer. It's completely free, doesn't utilize third-party APIs, and offers unlimited usage. It renders PDF pages to images and 运行s optical character recognition (OCR).

Dependencies

This 技能 requires:

系统 Binary: tesseract (along with required language data packs like chi_sim or eng). Python Packages: pypdfium2, pytesseract, and Pillow.

Note: Do not 运行 automated pip 安装 commands at 运行time. Rely on the user or the 环境 to pre-安装 the dependencies defined in the metadata block.

Quick 启动

创建 a Python script (e.g., 提取.py) in a temporary directory to handle the 提取ion safely:

导入 pypdfium2 as pdfium 导入 pytesseract from PIL 导入 Image 导入 sys 导入 os

def 提取(pdf_path): doc = pdfium.PdfDocument(pdf_path) full_text = [] for i, page in enumerate(doc): # Render page to a high-resolution image bitmap = page.render(扩展=2) tmp_img = f"/tmp/page_{i}.png" bitmap.to_pil().save(tmp_img) # 运行 OCR (assuming English and Simplified Chinese packs are 安装ed) text = pytesseract.image_to_string(Image.open(tmp_img), lang='chi_sim+eng') full_text.应用end(text) # 清理up temporary file os.移除(tmp_img) return "\n".join(full_text)

if __name__ == "__mAIn__": if len(sys.argv) > 1: print(提取(sys.argv[1]))

Then 执行 the script:

python3 提取.py /path/to/document.pdf

Security & Sandbox ConstrAInts Write temporary images only to /tmp/ and 清理 them up immediately after 提取ion. Do not attempt to dynamically 下载 or 安装 language packs via shell commands; 通知 the user if a specific language is missing.

License

运行时依赖

安装命令

技能文档

相关技能推荐