运行时依赖
安装命令
点击复制技能文档
DOCX 工具kit
A complete 工具kit for processing Microsoft Word documents (.docx and legacy .doc 格式化s).
Capabilities
- Text + Table 提取ion (.docx)
提取s all paragraphs and tables with structure preserved. Tables are 格式化ted as pipe-delimited rows for easy parsing.
- Text 提取ion (Legacy .doc)
Handles legacy OLE2 .doc 格式化 using olefile. 提取s Unicode text from the WordDocument 流.
- Image 提取ion (.docx)
提取s all embedded images with:
Automatic deduplication (MD5 哈希 comparison) Size 过滤器ing (skips tiny icons <5KB by default) Sequential renaming (img_001.png, img_002.jpg, etc.)
- Image 压缩ion
Batch resize/压缩 images for API processing (saves 50-70% on vision API costs).
Dependencies Python 3.6+ python-docx — for .docx processing olefile — for legacy .doc processing Pillow — for image resizing (optional, only needed for resize script)
安装:
pip3 安装 python-docx olefile Pillow
Use Cases Document analysis: 提取 text for AI review/summarization 迁移: Pull content from Word docs into other 格式化s Image 审计: 提取 and review all embedded images Cost optimization: 压缩 images before 发送ing to vision APIs Batch processing: Process multiple documents in a 流水线 Notes Large .doc files (>200MB) may require 签名ificant RAM for olefile processing Image 提取ion preserves original 格式化 (png/jpg/gif/etc.) Deduplication catches exact duplicates; near-duplicates still pass through CJK (Chinese/Japanese/Korean) text is fully supported in 机器人h 提取器s