详细分析 ▾
运行时依赖
版本
Initial release with streamlined document parsing and table extraction. - Added support for parsing .docx, .pdf, and .txt files to extract plain text and tables. - New command-line script: `scripts/parse_document.py` for parsing documents and generating JSON/text output. - Sample usage examples and quick-start documentation provided. - Separate installation scripts for Linux/macOS and Windows environments. - Updated output format to include text, tables, and basic document metadata. - Removed legacy multi-language documentation and configuration files.
安装命令 点击复制
技能文档
Extract text and tables from documents (.docx, .pdf, .txt) for analysis and question-answering.
Quick Start
Parse a document:
python scripts/parse_document.py /path/to/document.pdf
Output is JSON with extracted text, tables, and metadata.
Installation
First use only: Install dependencies by running:
- Linux/macOS:
bash scripts/install_dependencies.sh - Windows:
scripts\install_dependencies.bat
This installs: python-docx, PyPDF2, pdfplumber
Supported Formats
| Format | Text | Tables | Notes |
|---|---|---|---|
| .txt | ✅ | ❌ | Direct text extraction |
| .docx | ✅ | ✅ | Paragraphs + structured tables |
| ✅ | ✅ | Page-by-page extraction |
Workflow
- Parse the document using
scripts/parse_document.py - Analyze the output (text and tables in JSON)
- Answer the user's question using extracted content
Example: Answering questions about a document
User: "What's the total revenue in quarterly_report.docx?"
Steps:
- Run:
python scripts/parse_document.py quarterly_report.docx - Locate tables in output
- Find revenue column and calculate total
- Reply with answer
Output Format
Default JSON output:
{
"text": "Full document text...",
"tables": [
[["Header 1", "Header 2"], ["Data 1", "Data 2"]]
],
"metadata": {
"format": "pdf",
"pages": 3,
"tables": 1
}
}
Human-readable format (add --format text):
==========================================================
EXTRACTED TEXT:
==========================================================
Document content here...==========================================================
TABLES FOUND: 2
==========================================================
Table 1:
Name | Age | City
John | 30 | NYC
Jane | 25 | LA
Advanced Usage
For detailed examples and edge cases, see references/usage_examples.md.
Error Handling
If dependencies are missing, the script returns an error with installation instructions. Run the appropriate install script to resolve.
Notes
- Large PDFs: Processing may take time for documents >50 pages
- Scanned PDFs: OCR not supported; text must be selectable
- Complex tables: PDF table extraction works best with clear borders
免费技能或插件可能存在安全风险,如需更匹配、更安全的方案,建议联系付费定制