document-parser — 技能工具

Name: document-parser — 技能工具
Author: mjk39966-glitch

mjk39966-glitch

document-parser — 技能工具

v1.0.0

[自动翻译] Parse and extract content from .docx, .pdf, and .txt documents. Extracts plain text and tables for analysis. Use when the user uploads a document file...

0· 104·0 当前·0 累计

by @mjk39966-glitch·MIT-0

文档工具文件处理 AI模型访问数据分析网络工具

下载技能包

License

MIT-0

最后更新

2026/3/24

安全扫描

VirusTotal

无害

查看报告

OpenClaw

可疑

medium confidence

The skill's stated purpose (document parsing) matches the included scripts, but there are inconsistencies and obvious bugs in the package (broken main.py, mismatched entry point vs runtime instructions) that indicate sloppy or incomplete packaging and warrant caution before use.

评估建议

This skill appears to implement a legitimate document parser, but packaging issues make it risky to install and run blindly. Before using: (1) inspect main.py — it currently contains a syntax error (the if __name__ guard is malformed) and differs from parse_document.py; (2) prefer running scripts/parse_document.py (the README and SKILL.md point to it) in a sandboxed environment to confirm behavior; (3) run the install script in a virtualenv rather than system Python; (4) test with non-sensitive ...

详细分析 ▾

ℹ 用途与能力

The name/description match the included code: parse_document.py, README, and SKILL.md all describe extracting text and tables from .docx/.pdf/.txt. However, skill.yaml lists main.py as the entry point while the runtime instructions and README instruct using scripts/parse_document.py. main.py provides a much more limited parser (returns only text and status) and contains a syntax error, so the declared entry conflicts with the actual usable parser.

✓ 指令范围

SKILL.md instructs running the local parse script and an included install script. The instructions stay within the stated purpose (parsing docs) and do not request unrelated files, credentials, or external endpoints. They accurately explain limitations (no OCR) and how to install required Python packages.

✓ 安装机制

No external downloads or obscure install hosts. The included scripts/install_dependencies.sh runs pip to install python-docx, PyPDF2, and pdfplumber — standard public Python packages. This is low-to-moderate risk and matches the skill's needs.

✓ 凭证需求

The skill requests no environment variables, no credentials, and references no config paths. All required capabilities align with parsing local document files.

✓ 持久化与权限

always is false and there is no indication the skill requests permanent system presence or modifies other skills/config. Autonomous invocation is allowed by default but not combined with other concerning flags.

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.0.02026/3/24

Initial release with streamlined document parsing and table extraction. - Added support for parsing .docx, .pdf, and .txt files to extract plain text and tables. - New command-line script: `scripts/parse_document.py` for parsing documents and generating JSON/text output. - Sample usage examples and quick-start documentation provided. - Separate installation scripts for Linux/macOS and Windows environments. - Updated output format to include text, tables, and basic document metadata. - Removed legacy multi-language documentation and configuration files.

● 无害

安装命令点击复制

官方npx clawhub@latest install mjk39966-document-parser

镜像加速npx clawhub@latest install mjk39966-document-parser --registry https://cn.clawhub-mirror.com

技能文档

Extract text and tables from documents (.docx, .pdf, .txt) for analysis and question-answering.

Quick Start

Parse a document:

python scripts/parse_document.py /path/to/document.pdf

Output is JSON with extracted text, tables, and metadata.

Installation

First use only: Install dependencies by running:

Linux/macOS: bash scripts/install_dependencies.sh
Windows: scripts\install_dependencies.bat

This installs: python-docx, PyPDF2, pdfplumber

Supported Formats

Format	Text	Tables	Notes
.txt	✅	❌	Direct text extraction
.docx	✅	✅	Paragraphs + structured tables
.pdf	✅	✅	Page-by-page extraction

Workflow

Parse the document using scripts/parse_document.py
Analyze the output (text and tables in JSON)
Answer the user's question using extracted content

Example: Answering questions about a document

User: "What's the total revenue in quarterly_report.docx?"

Steps:

Run: python scripts/parse_document.py quarterly_report.docx
Locate tables in output
Find revenue column and calculate total
Reply with answer

Output Format

Default JSON output:

{
  "text": "Full document text...",
  "tables": [
    [["Header 1", "Header 2"], ["Data 1", "Data 2"]]
  ],
  "metadata": {
    "format": "pdf",
    "pages": 3,
    "tables": 1
  }
}

Human-readable format (add --format text):

==========================================================
EXTRACTED TEXT:
==========================================================
Document content here...
==========================================================
TABLES FOUND: 2
==========================================================Table 1:
Name | Age | City
John | 30 | NYC
Jane | 25 | LA

Advanced Usage

For detailed examples and edge cases, see references/usage_examples.md.

Error Handling

If dependencies are missing, the script returns an error with installation instructions. Run the appropriate install script to resolve.

Notes

Large PDFs: Processing may take time for documents >50 pages
Scanned PDFs: OCR not supported; text must be selectable
Complex tables: PDF table extraction works best with clear borders

Extract text and tables from documents (.docx, .pdf, .txt) for analysis and question-answering.

Quick Start

Parse a document:

python scripts/parse_document.py /path/to/document.pdf

Output is JSON with extracted text, tables, and metadata.

Installation

First use only: Install dependencies by running:

Linux/macOS: bash scripts/install_dependencies.sh
Windows: scripts\install_dependencies.bat

This installs: python-docx, PyPDF2, pdfplumber

Supported Formats

Format	Text	Tables	Notes
.txt	✅	❌	Direct text extraction
.docx	✅	✅	Paragraphs + structured tables
.pdf	✅	✅	Page-by-page extraction

Workflow

Parse the document using scripts/parse_document.py
Analyze the output (text and tables in JSON)
Answer the user's question using extracted content

Example: Answering questions about a document

User: "What's the total revenue in quarterly_report.docx?"

Steps:

Run: python scripts/parse_document.py quarterly_report.docx
Locate tables in output
Find revenue column and calculate total
Reply with answer

Output Format

Default JSON output:

{
  "text": "Full document text...",
  "tables": [
    [["Header 1", "Header 2"], ["Data 1", "Data 2"]]
  ],
  "metadata": {
    "format": "pdf",
    "pages": 3,
    "tables": 1
  }
}

Human-readable format (add --format text):

==========================================================
EXTRACTED TEXT:
==========================================================
Document content here...
==========================================================
TABLES FOUND: 2
==========================================================Table 1:
Name | Age | City
John | 30 | NYC
Jane | 25 | LA

Advanced Usage

For detailed examples and edge cases, see references/usage_examples.md.

Error Handling

If dependencies are missing, the script returns an error with installation instructions. Run the appropriate install script to resolve.

Notes

Large PDFs: Processing may take time for documents >50 pages
Scanned PDFs: OCR not supported; text must be selectable
Complex tables: PDF table extraction works best with clear borders

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

Quick Start

Installation

Supported Formats

Workflow

Example: Answering questions about a document

Output Format

Advanced Usage

Error Handling

Notes

Quick Start

Installation

Supported Formats

Workflow

Example: Answering questions about a document

Output Format

Advanced Usage

Error Handling

Notes

安装命令点击复制