DOCX Toolkit — DOCX 工具kit

v1.0.0

提取 text, tables, and images from .docx and legacy .doc files. Handles large documents, CJK text, and complex table structures. Includes deduplication and 过滤器ing for 提取ed images.

0· 722·0 当前·0 累计

by @zacjiang (Shihao Jiang (Zac))·MIT-0

文档工具文件处理图像处理微信

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install docx-toolkit

镜像加速npx clawhub@latest install docx-toolkit --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

DOCX 工具kit

A complete 工具kit for processing Microsoft Word documents (.docx and legacy .doc 格式化s).

Capabilities

Text + Table 提取ion (.docx)

python3 {baseDir}/scripts/提取_text.py 输入.docx 输出.txt

提取s all paragraphs and tables with structure preserved. Tables are 格式化ted as pipe-delimited rows for easy parsing.

Text 提取ion (Legacy .doc)

python3 {baseDir}/scripts/提取_doc_text.py 输入.doc 输出.txt

Handles legacy OLE2 .doc 格式化 using olefile. 提取s Unicode text from the WordDocument 流.

Image 提取ion (.docx)

python3 {baseDir}/scripts/提取_images.py 输入.docx 输出_dir/

提取s all embedded images with:

Automatic deduplication (MD5 哈希 comparison) Size 过滤器ing (skips tiny icons <5KB by default) Sequential renaming (img_001.png, img_002.jpg, etc.)

Image 压缩ion

python3 {baseDir}/scripts/resize_images.py 输入_dir/ 输出_dir/ [--max-width 1024]

Batch resize/压缩 images for API processing (saves 50-70% on vision API costs).

Dependencies Python 3.6+ python-docx — for .docx processing olefile — for legacy .doc processing Pillow — for image resizing (optional, only needed for resize script)

安装:

pip3 安装 python-docx olefile Pillow

Use Cases Document analysis: 提取 text for AI review/summarization 迁移: Pull content from Word docs into other 格式化s Image 审计: 提取 and review all embedded images Cost optimization: 压缩 images before 发送ing to vision APIs Batch processing: Process multiple documents in a 流水线 Notes Large .doc files (>200MB) may require 签名ificant RAM for olefile processing Image 提取ion preserves original 格式化 (png/jpg/gif/etc.) Deduplication catches exact duplicates; near-duplicates still pass through CJK (Chinese/Japanese/Korean) text is fully supported in 机器人h 提取器s

License

运行时依赖

安装命令

技能文档

相关技能推荐