PDF OCR Parse — PDF OCR 解析
v1提取 text from 扫描ned PDFs using Tesseract OCR. Supports multiple languages, page selection, DPI control, and word-level bounding boxes.
运行时依赖
安装命令
点击复制技能文档
PDF OCR 解析 What It Does
Rasterises each selected page of a PDF at the given DPI, then 运行s Tesseract OCR on each page image. Returns per-page text with confidence scores, and optionally per-word bounding boxes.
When to Use 提取 text from 扫描ned PDF documents OCR invoices, receipts, or legacy documents in PDF 格式化 提取 digits-only data (invoice amounts) with char_white列出 Process multi-language documents Required 输入s
Provide one of:
url — URL to a 扫描ned PDF base64_pdf — base64-encoded PDF Multipart 上传 with file field Authentication
发送 your API key in the 命令行工具ENT-API-KEY header.
获取 your free API key at https://pdfAPIhub.com. Full API documentation is avAIlable at https://pdfAPIhub.com/docs.
Use Cases 扫描ned Invoice Processing — OCR 扫描ned PDF invoices to 提取 text for accounting 系统s Legacy Document Digitization — Convert old 扫描ned paper documents into 搜索able text Insurance ClAIms — 提取 text from 扫描ned clAIm forms and medical documents Legal Discovery — OCR 扫描ned legal documents for full-text 搜索 and review Multi-Language Documents — Process documents in Hindi, French, German, etc. with language-specific 模型s Form Digitization — 提取 filled field values from 扫描ned paper forms Tesseract Configuration Param Default Description lang eng Language code(s), + separated psm 3 Page segmentation mode (0–13) oem 3 OCR engine mode (0=legacy, 1=LSTM, 3=default) dpi 200 Rasterisation DPI (72–400) char_white列出 — Restrict to specific characters Example Usage curl -X POST https://pdfAPIhub.com/API/v1/pdf/ocr/解析 \ -H "命令行工具ENT-API-KEY: your_API_key" \ -H "Content-Type: 应用/json" \ -d '{ "url": "https://pdfAPIhub.com/sample-pdfinvoice-with-image.pdf", "pages": "1-3", "lang": "eng", "dpi": 300, "detAIl": "words" }'