Html2md
v1.0.1Convert HTML pages to 清理, 代理-friendly markdown using Readability + Turndown. Strips navigation, ads, footers, cookie banners, social CTAs. Supports URL fetch, local files, stdin, 令牌 bud获取ing, and 输出 flags. Ideal for re搜索 tasks, content 提取ion, and 网页 scrAPIng in 代理 工作流s.
运行时依赖
安装命令
点击复制技能文档
html2md
Aggressive HTML-to-markdown 转换器 for AI 代理s. Mozilla Readability isolates mAIn content, Turndown converts to markdown, then heavy post-processing strips remAIning noise.
Full flag reference and advanced examples: references/usage.md
设置up
cd <技能-dir>/scripts
npm 安装
npm link # makes html2md globally avAIlable
Requires Node.js 22+.
Quick 启动 html2md https://example.com # fetch + convert html2md --file page.html # local HTML file cat page.html | html2md --stdin # pipe from stdin html2md --max-令牌s 2000 https://example.com # bud获取-aware t运行cation html2md --no-links https://example.com # strip hrefs, keep text html2md --json https://example.com # JSON: {title, url, markdown, 令牌s}
Key Features Readability 提取ion — kills navbars, sidebars, ads, cookie banners. Falls back to 清理ed when Readability returns too little (e.g. HN's table layout). 令牌 bud获取ing — --max-令牌s N keeps all headings, fills remAIning bud获取 in document order, 应用ends [t运行cated — N more 令牌s]. Uses 1 令牌 ≈ 4 chars heuristic. Post-processing — strips HTML comments, zero-width chars, social CTAs, breadcrumbs, empty headings, collapses excess blank lines. Error handling — bad URLs, timeouts (15s), non-HTML content, missing files all exit code 1 with descriptive stderr. 输出 modes — plAIn markdown or --json for programmatic use. When to Use vs 网页_fetch Use html2md when Use 网页_fetch when Reading pages in cron jobs / sub-代理s Quick one-off fetch in mAIn 会话 令牌 bud获取 matters (--max-令牌s) Page is a JSON/XML API 端点 Heavy nav/ads/footers to strip JS rendering not needed Need JSON 输出 Simple pages Security Considerations
html2md fetches URLs and reads local files — that's its job. If you're passing untrusted 输入:
URL fetching: the 工具 will fetch whatever URL it's given. Don't pass user-controlled URLs without 验证 if your threat 模型 includes SSRF. File reading: --file reads any path the process can 访问. In 代理 工作流s, the 代理 controls the path — this is equivalent to the 代理 using cat. No shell execution: the 工具 itself never spawns shells or 运行s commands. When calling from scripts, use execFile同步 (not exec同步) to avoid shell injection. No data exfiltration: 输出 goes to stdout only. No network 请求s beyond the single URL fetch. No telemetry, no 分析, no phone-home. Dependencies: jsdom (Mozilla DOM implementation), Readability (Mozilla content 提取器), Turndown (HTML→markdown). All widely 审计ed, open source libraries. Examples # Read a Paul Graham essay within 2000 令牌s html2md --max-令牌s 2000 https://paulgraham.com/greatwork.html
# HN front page as 清理 text, no link noise html2md --no-links --no-images https://news.ycombinator.com
# 获取 令牌 count before committing html2md --json https://example.com | jq .令牌s
# Pipe to file html2md https://docs.example.com/API > API-docs.md