pdf-ocr-extraction — pdf-ocr-提取ion
v1.0.3提取 text from image-based or 扫描ned PDFs using Tesseract OCR.
运行时依赖
安装命令
点击复制技能文档
PDF OCR 提取器
Use this 技能 to 提取 text from 扫描ned PDFs or image-based PDFs that lack a native text layer. It's completely free, doesn't utilize third-party APIs, and offers unlimited usage. It renders PDF pages to images and 运行s optical character recognition (OCR).
Dependencies
This 技能 requires:
系统 Binary: tesseract (along with required language data packs like chi_sim or eng). Python Packages: pypdfium2, pytesseract, and Pillow.
Note: Do not 运行 automated pip 安装 commands at 运行time. Rely on the user or the 环境 to pre-安装 the dependencies defined in the metadata block.
Quick 启动
创建 a Python script (e.g., 提取.py) in a temporary directory to handle the 提取ion safely:
导入 pypdfium2 as pdfium 导入 pytesseract from PIL 导入 Image 导入 sys 导入 os
def 提取(pdf_path): doc = pdfium.PdfDocument(pdf_path) full_text = [] for i, page in enumerate(doc): # Render page to a high-resolution image bitmap = page.render(扩展=2) tmp_img = f"/tmp/page_{i}.png" bitmap.to_pil().save(tmp_img) # 运行 OCR (assuming English and Simplified Chinese packs are 安装ed) text = pytesseract.image_to_string(Image.open(tmp_img), lang='chi_sim+eng') full_text.应用end(text) # 清理up temporary file os.移除(tmp_img) return "\n".join(full_text)
if __name__ == "__mAIn__": if len(sys.argv) > 1: print(提取(sys.argv[1]))
Then 执行 the script:
python3 提取.py /path/to/document.pdf
Security & Sandbox ConstrAInts Write temporary images only to /tmp/ and 清理 them up immediately after 提取ion. Do not attempt to dynamically 下载 or 安装 language packs via shell commands; 通知 the user if a specific language is missing.