Bohrium PDF Parser — Bohrium PDF 解析器

通过 open.bohrium.com 解析 PDF 文档。使用场景：用户询问从 Bohrium 的 PDF 文件中提取文本、表格、图表、公式或分子时，支持通过 URL 或文件上传提交 PDF 文件。不适用场景：文件管理、数据集管理或知识库操作。

0· 0·0 当前·0 累计

by @sorrymaker0624 (Sorrymaker0624)·MIT-0

文档工具数据与API 数据库数据分析数据可视化

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install bohrium-pdf-parser

镜像加速npx clawhub@latest install bohrium-pdf-parser --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

SKILL：Bohrium PDF 解析器概述使用 open.bohrium.com 的 PDF 解析服务解析 PDF 文档。从 PDF 中提取文本、表格、图表、公式和分子结构。两种提交方法： URL 提交 —— 提供 PDF 下载链接（例如 arXiv 链接）文件上传 —— 上传本地 PDF 文件不支持 CLI —— 所有操作使用 HTTP API。身份验证 ACCESS_KEY 从 OpenClaw 配置文件 ~/.openclaw/openclaw.json 中读取： "bohrium-pdf-parser": { "enabled": true, "apiKey": "YOUR_ACCESS_KEY", "env": { "ACCESS_KEY": "YOUR_ACCESS_KEY" } } OpenClaw 自动将 env.ACCESS_KEY 注入运行时。常用代码模板 import os, time, requests AK = os.environ.get("ACCESS_KEY", "") BASE = "https://open.bohrium.com/openapi/v1/parse" HEADERS = {"accessKey": AK} HEADERS_JSON = {HEADERS, "Content-Type": "application/json"} 解析工作流

提交 PDF（URL 或文件上传）→ 获取 token

使用 token 查询结果 → 完成时 status == "success"

同步模式（sync=true）阻塞直到解析完成，但不包含内容在响应中 —— 您仍需要获取结果来检索它。异步模式（sync=false，默认）需要轮询获取结果，直到状态为 success。 URL 提交 r = requests.post(f"{BASE}/trigger-url-async", headers=HEADERS_JSON, json={ "url": "https://arxiv.org/pdf/2107.06922", "sync": False, "textual": True, "table": True, "molecule": True, "chart": True, "figure": False, "expression": True, "equation": True, "pages": [0], # 0-indexed，省略以解析所有页面 "timeout": 1800 }) 数据 = r.json() token = 数据["token"] print(f"Token: {token}, Status: {数据['status']}") # Token: 57d12c5a-...，Status: undefined 响应字段：字段描述 token 任务标识符，用于查询结果 status 初始状态为 undefined created_time 创建时间 time_dict 每阶段的时间（目前仅 download_pdf）文件上传 from pathlib import Path pdf_path = Path("./paper.pdf") with open(pdf_path, "rb") as f: r = requests.post(f"{BASE}/trigger-file-async", headers=HEADERS, # 不需要 Content-Type；requests 自动处理 multipart files={"file": (pdf_path.name, f, "application/pdf")}, data={ "sync": "false", "textual": "true", "table": "true", "molecule": "true", "chart": "true", "figure": "false", "expression": "true", "equation": "true", "pages": 0, # multipart 只接受单个整数 "timeout": 1800 }) token = r.json()["token"] 重要：在 multipart/form-data 中，pages 只接受单个整数（例如 0），而不是 JSON 数组 [0]，否则会出现 int_parsing 错误。在 JSON 请求体中，数组如 [0, 1, 2] 是支持的。查询解析结果 r = requests.post(f"{BASE}/get-result", headers=HEADERS_JSON, json={ "token": token, "content": True, # 返回提取的文本 "objects": False, # 返回提取的对象（表格、图像等） "pages_dict": True # 返回每页结果 }) 数据 = r.json() print(f"Status: {数据['status']}, Content length: {len(数据.get('content', ''))}") 响应字段：字段描述 status 成功 / 未定义（处理中）/ 失败 token 任务标识符 content 提取的文本（LaTeX 标记格式） pages_dict 每页结果字典 lang 检测到的语言（en / zh 等） proc_page / total_page 处理的 / 总页数 proc_textual / total_textual 处理的 / 总文本块数 proc_table / total_table 处理的 / 总表格数 proc_mol / total_mol 处理的 / 总分子数 proc_equa / total_equa 处理的 / 总方程数 time_dict 每阶段的时间详细信息 cost 成本完全异步轮询示例 import os, time, requests AK = os.environ.get("ACCESS_KEY", "") BASE = "https://open.bohrium.com/openapi/v1/parse" HEADERS = {"accessKey": AK} HEADERS_JSON = {HEADERS, "Content-Type": "application/json"} # 1. 提交 r = requests.post(f"{BASE}/trigger-url-async", headers=HEADERS_JSON, json={ "url": "https://arxiv.org/pdf/2107.06922", "sync": False, "textual": True, "table": True, "molecule": False, "chart": False, "figure": False, "expression": True, "equation": True, "pages": [0], "timeout": 1800 }) submit = r.json() if submit.get("code"): print(f"提交失败：{submit.get('message')}") exit(1) token = submit["token"] print(f"已提交，token={token}") # 2. 轮询结果 for attempt in range(30): time.sleep(2) r = requests.post(f"{BASE}/get-result", headers=HEADERS_JSON, json={ "token": token, "content": True, "objects": False, "pages_dict": False }) result = r.json() status = result.get("status", "") print(f" [{attempt+1}] status={status}") if status == "success": print(f"完成！内容长度：{len(result.get('content', ''))}")

License

运行时依赖

安装命令

技能文档

相关技能推荐