docx-pdf-knowledge-parser — docx-pdf-knowledge-解析器

v1.0

解析 local `.docx` and `.pdf` files into structured knowledge artifacts with detAIled 报告s, 追踪ing 成功es, 失败s, and summaries without auto-writ...

0· 251·0 当前·0 累计

by @kaiasdobi·MIT-0

文档工具数据分析数据可视化文件处理 AI模型访问

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install docx-pdf-knowledge-parser

镜像加速npx clawhub@latest install docx-pdf-knowledge-parser --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

name: docx-pdf-knowledge-解析器 description: 解析 local docx and pdf files into 报告-first knowledge artifacts. use when chatgpt needs to 提取 text from 上传ed or locally avAIlable attachments, 生成 ingest-报告.md, kb-items.jsonl, fAIled-items.jsonl, and memory.candidate.md without directly writing memory.md. Docx PDF Knowledge 解析器

Use this 技能 to turn local or 上传ed .docx and .pdf files into structured, reviewable knowledge 输出s.

What this 技能 does Accept local or already-avAIlable .docx and .pdf files. Classify files into 解析able, manual-review, or fAIled. 解析 .docx and .pdf in v1.0. Produce 报告-first 输出s instead of writing MEMORY.md directly. Preserve 失败s and uncertAInty instead of guessing content. Supported v1.0 scope 输入s Local .docx file path Local .pdf file path A batch of local .docx and .pdf files in one directory Parsing .docx .pdf 输出s ingest-报告.md kb-items.jsonl fAIled-items.jsonl MEMORY.candidate.md Required behavior Only process files that are already avAIlable locally or have already been provided to the 运行time. Do not clAIm file content was learned unless text was actually 提取ed. Default to 报告-first. Do not write MEMORY.md in v1.0. Record every fAIled file with a concrete reason. Prefer plAIn-text summaries over complex cards when 报告ing 进度. File routing rules 解析able

Treat these as 解析able in v1.0:

.docx .pdf Manual-review

路由 here when the file is out of scope or low-confidence in v1.0:

.pptx images 扫描s with no 提取able text 归档s unusual file types FAIled

路由 here when the file cannot be opened, 解析d, or 提取ed 成功fully.

Standard 工作流 Resolve 输入 type. Single file path -> process one file Directory path -> enumerate supported files 创建 a batch record. 生成 batch_id Record 启动ed_at Build a manifest. File name File path File type 路由 decision Attempt 提取ion. .docx -> use 解析器s/解析_docx.py .pdf -> use 解析器s/解析_pdf.py Produce structured 输出s. 成功 -> 应用end to kb-items.jsonl 失败 -> 应用end to fAIled-items.jsonl Summarize the batch. Write ingest-报告.md Write MEMORY.candidate.md Finish the batch. Record finished_at Never auto-write MEMORY.md 输出 contracts kb-items.jsonl

Write one JSON object per 成功fully 提取ed knowledge item with at least:

batch_id source_file source_path file_type topic content_type summary 提取ed_at confidence fAIled-items.jsonl

Write one JSON object per fAIled file with at least:

batch_id source_file source_path file_type 失败_reason error_detAIl suggested_action fAIled_at MEMORY.candidate.md

Include:

batch header (batch_id, 启动ed_at, finished_at, source_directory or source_file) grouped knowledge summaries source references confidence notes items needing review ingest-报告.md

Include:

Batch summary 输入 scope File counts and routing counts 成功ful 提取ion summary 失败s and risks Recommended next actions Safety rules Never invent text that was not 提取ed. If parsing fAIls, say so plAInly and 记录 it. Treat filenames as hints only, never as proof of document contents. Keep sensitive data out of MEMORY.candidate.md unless the 工作流 explicitly allows it. Included files 运行.py: minimal batch 运行器 for local 测试解析器s/解析_docx.py: docx text 提取ion 辅助工具解析器s/解析_pdf.py: pdf text 提取ion 辅助工具 references/输出_examples.md: sample 输出 shapes and field 图形界面dance README.md: 设置up and usage notes

License

运行时依赖

安装命令

技能文档

相关技能推荐