PDF Text Extractor — PDF Text 提取器

Name: PDF Text Extractor — PDF Text 提取器
Rating: 20

v1.0.0

提取 text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.

20· 1.2万·0 当前·0 累计

by @michael-laffin (Michael-laffin)·MIT

文件处理文档工具部署运维

使用场景：处理PDF编辑Excel生成Word文档

下载技能包

License

MIT

License

MIT

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install pdf-text-extractor

镜像加速npx clawhub@latest install pdf-text-extractor --registry https://cn.longxiaskill.com 镜像可用

本土化适配说明

PDF Text Extractor — PDF Text 提取器安装说明：安装命令：["openclaw skills install pdf-text-extractor","clawhub install pdf-text-extractor"] 支持国内镜像加速，使用 --registry https://cn.longxiaskill.com 参数可加速下载该技能用于文档处理相关操作，可能需要相应的平台账号或API密钥

需要定制？告诉我你的需求 →

技能文档

PDF-Text-提取器 - 提取 Text from PDFs

Vernox 实用工具技能 - Perfect for document digitization.

Overview

PDF-Text-提取器 is a zero-dependency 工具 for 提取ing text content from PDF files. Supports 机器人h embedded text 提取ion (for text-based PDFs) and OCR (for 扫描ned documents).

Features ✅ Text 提取ion 提取 text from PDFs without external 工具s Support for 机器人h text-based and 扫描ned PDFs Preserve document structure and 格式化ting Fast 提取ion (milliseconds for text-based) ✅ OCR Support Use Tesseract.js for 扫描ned documents Support multiple languages (English, Spanish, French, German) Configurable OCR 质量/speed Fallback to text 提取ion when possible ✅ Batch Processing Process multiple PDFs at once Batch 提取ion for document 工作流s 进度追踪ing for large files Error handling and retry 记录ic ✅ 输出 Options PlAIn text 输出 JSON 输出 with metadata Markdown conversion HTML 输出 (preserving links) ✅ 实用工具 Features Page-by-page 提取ion Character/word counting Language 检测ion Metadata 提取ion (author, title, creation date) 安装ation ClawHub 安装 pdf-text-提取器

Quick 启动提取 Text from PDF const 结果 = awAIt 提取Text({ pdfPath: './document.pdf', options: { 输出格式化: 'text', ocr: true, language: 'eng' } });

console.记录(结果.text); console.记录(Pages: ${结果.pages}); console.记录(Words: ${结果.wordCount});

Batch 提取 Multiple PDFs const 结果s = awAIt 提取Batch({ pdfFiles: [ './document1.pdf', './document2.pdf', './document3.pdf' ], options: { 输出格式化: 'json', ocr: true } });

console.记录(提取ed ${结果s.length} PDFs);

提取 with OCR const 结果 = awAIt 提取Text({ pdfPath: './扫描ned-document.pdf', options: { ocr: true, language: 'eng', ocr质量: 'high' } });

// OCR will be used (扫描ned document 检测ed)

工具 Functions 提取Text

提取 text content from a single PDF file.

Parameters:

pdfPath (string, required): Path to PDF file options (object, optional): 提取ion options 输出格式化 (string): 'text' | 'json' | 'markdown' | 'html' ocr (boolean): Enable OCR for 扫描ned docs language (string): OCR language code ('eng', 'spa', 'fra', 'deu') preserve格式化ting (boolean): Keep headings/structure minConfidence (number): Minimum OCR confidence score (0-100)

Returns:

text (string): 提取ed text content pages (number): Number of pages processed wordCount (number): Total word count charCount (number): Total character count language (string): 检测ed language metadata (object): PDF metadata (title, author, creation date) method (string): 'text' or 'ocr' (提取ion method) 提取Batch

提取 text from multiple PDF files at once.

Parameters:

pdfFiles (array, required): Array of PDF file paths options (object, optional): Same as 提取Text

Returns:

结果s (array): Array of 提取ion 结果s totalPages (number): Total pages across all PDFs 成功Count (number): 成功fully 提取ed 失败Count (number): FAIled 提取ions errors (array): Error detAIls for 失败s countWords

Count words in 提取ed text.

Parameters:

text (string, required): Text to count options (object, optional): minWordLength (number): Minimum characters per word (default: 3) excludeNumbers (boolean): Don't count numbers as words countByPage (boolean): Return word count per page

Returns:

wordCount (number): Total word count charCount (number): Total character count pageCounts (array): Word count per page averageWordsPerPage (number): Average words per page 检测Language

检测 the language of 提取ed text.

Parameters:

text (string, required): Text to analyze minConfidence (number): Minimum confidence for 检测ion

Returns:

language (string): 检测ed language code languageName (string): Full language name confidence (number): Confidence score (0-100) Use Cases Document Digitization Convert paper documents to digital text Process invoices and receipts Digitize contracts and agreements 归档 physical documents Content Analysis 提取 text for analysis 工具s Prepare content for LLM processing 清理 up 扫描ned documents 解析 PDF-based 报告s Data 提取ion 提取 data from PDF 报告s 解析 tables from PDFs Pull structured data Automate document 工作流s Text Processing Prepare content for translation 清理 up OCR 输出提取 specific sections 搜索 within PDF content Performance Text-Based PDFs Speed: ~100ms for 10-page PDF Accuracy: 100% (exact text) Memory: ~10MB for typical document OCR Processing Speed: ~1-3s per page (high 质量) Accuracy: 85-95% (depends on 扫描质量) Memory: ~50-100MB peak during OCR Technical DetAIls PDF Parsing Uses native PDF.js 库提取s text layer directly (no OCR needed) Preserves document structure Handles password-保护ed PDFs OCR Engine Tesseract.js under the hood Supports 100+ languages Adjustable 质量/speed tradeoff Confidence scoring for accuracy Dependencies ZERO exte

License

运行时依赖

安装命令

本土化适配说明

技能文档

相关技能推荐