首页龙虾技能列表 › document-parser — 技能工具

document-parser — 技能工具

v1.0.0

[自动翻译] Parse and extract content from .docx, .pdf, and .txt documents. Extracts plain text and tables for analysis. Use when the user uploads a document file...

0· 104·0 当前·0 累计
by @mjk39966-glitch·MIT-0
下载技能包
License
MIT-0
最后更新
2026/3/24
安全扫描
VirusTotal
无害
查看报告
OpenClaw
可疑
medium confidence
The skill's stated purpose (document parsing) matches the included scripts, but there are inconsistencies and obvious bugs in the package (broken main.py, mismatched entry point vs runtime instructions) that indicate sloppy or incomplete packaging and warrant caution before use.
评估建议
This skill appears to implement a legitimate document parser, but packaging issues make it risky to install and run blindly. Before using: (1) inspect main.py — it currently contains a syntax error (the if __name__ guard is malformed) and differs from parse_document.py; (2) prefer running scripts/parse_document.py (the README and SKILL.md point to it) in a sandboxed environment to confirm behavior; (3) run the install script in a virtualenv rather than system Python; (4) test with non-sensitive ...
详细分析 ▾
用途与能力
The name/description match the included code: parse_document.py, README, and SKILL.md all describe extracting text and tables from .docx/.pdf/.txt. However, skill.yaml lists main.py as the entry point while the runtime instructions and README instruct using scripts/parse_document.py. main.py provides a much more limited parser (returns only text and status) and contains a syntax error, so the declared entry conflicts with the actual usable parser.
指令范围
SKILL.md instructs running the local parse script and an included install script. The instructions stay within the stated purpose (parsing docs) and do not request unrelated files, credentials, or external endpoints. They accurately explain limitations (no OCR) and how to install required Python packages.
安装机制
No external downloads or obscure install hosts. The included scripts/install_dependencies.sh runs pip to install python-docx, PyPDF2, and pdfplumber — standard public Python packages. This is low-to-moderate risk and matches the skill's needs.
凭证需求
The skill requests no environment variables, no credentials, and references no config paths. All required capabilities align with parsing local document files.
持久化与权限
always is false and there is no indication the skill requests permanent system presence or modifies other skills/config. Autonomous invocation is allowed by default but not combined with other concerning flags.
安全有层次,运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发,无需署名。

运行时依赖

无特殊依赖

版本

latestv1.0.02026/3/24

Initial release with streamlined document parsing and table extraction. - Added support for parsing .docx, .pdf, and .txt files to extract plain text and tables. - New command-line script: `scripts/parse_document.py` for parsing documents and generating JSON/text output. - Sample usage examples and quick-start documentation provided. - Separate installation scripts for Linux/macOS and Windows environments. - Updated output format to include text, tables, and basic document metadata. - Removed legacy multi-language documentation and configuration files.

● 无害

安装命令 点击复制

官方npx clawhub@latest install mjk39966-document-parser
镜像加速npx clawhub@latest install mjk39966-document-parser --registry https://cn.clawhub-mirror.com

技能文档

Extract text and tables from documents (.docx, .pdf, .txt) for analysis and question-answering.

Quick Start

Parse a document:

python scripts/parse_document.py /path/to/document.pdf

Output is JSON with extracted text, tables, and metadata.

Installation

First use only: Install dependencies by running:

  • Linux/macOS: bash scripts/install_dependencies.sh
  • Windows: scripts\install_dependencies.bat

This installs: python-docx, PyPDF2, pdfplumber

Supported Formats

FormatTextTablesNotes
.txtDirect text extraction
.docxParagraphs + structured tables
.pdfPage-by-page extraction

Workflow

  • Parse the document using scripts/parse_document.py
  • Analyze the output (text and tables in JSON)
  • Answer the user's question using extracted content

Example: Answering questions about a document

User: "What's the total revenue in quarterly_report.docx?"

Steps:

  • Run: python scripts/parse_document.py quarterly_report.docx
  • Locate tables in output
  • Find revenue column and calculate total
  • Reply with answer

Output Format

Default JSON output:

{
  "text": "Full document text...",
  "tables": [
    [["Header 1", "Header 2"], ["Data 1", "Data 2"]]
  ],
  "metadata": {
    "format": "pdf",
    "pages": 3,
    "tables": 1
  }
}

Human-readable format (add --format text):

==========================================================
EXTRACTED TEXT:
==========================================================
Document content here...

========================================================== TABLES FOUND: 2 ==========================================================

Table 1: Name | Age | City John | 30 | NYC Jane | 25 | LA

Advanced Usage

For detailed examples and edge cases, see references/usage_examples.md.

Error Handling

If dependencies are missing, the script returns an error with installation instructions. Run the appropriate install script to resolve.

Notes

  • Large PDFs: Processing may take time for documents >50 pages
  • Scanned PDFs: OCR not supported; text must be selectable
  • Complex tables: PDF table extraction works best with clear borders
数据来源:ClawHub ↗ · 中文优化:龙虾技能库
OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险,如需更匹配、更安全的方案,建议联系付费定制

了解定制服务