wechat-article-extraction-mp-weixin-qq-com news-webpage-cleaning blog-post-parsing metadata-extraction-title-author-date multiple-output-formats-markdown-json-plain-text batch-processing-support — wechat-article-提取ion-mp-weixin-qq-com news-网页page-清理ing b记录-post-parsing metadata-提取ion-title-author-date multiple-输出-格式化s-markdown-json-plAIn-text batch-processing-support

v1.0.0

基于三引擎设计，从微信文章、新闻和博客网页提取干净内容，支持标题作者日期元数据，多格式和批量处理。

0· 422·0 当前·0 累计

by @3511815125 (Yu Jia Li)·MIT-0

微信生态即时通讯腾讯生态数据与API

使用场景：发微信消息管理微信联系人微信支付微信机器人QQ消息QQ机器人

下载技能包项目主页

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install web-fetch-vx

镜像加速npx clawhub@latest install web-fetch-vx --registry https://cn.longxiaskill.com 镜像可用

本土化适配说明

wechat-article-extraction-mp-weixin-qq-com news-webpage-cleaning blog-post-parsing metadata-extraction-title-author-date multiple-output-formats-markdown-json-plain-text batch-processing-support — wechat-article-提取ion-mp-weixin-qq-com news-网页page-清理ing b记录-post-parsing metadata-提取ion-title-author-date multiple-输出-格式化s-markdown-json-plAIn-text batch-processing-support 安装说明：安装命令：["openclaw skills install web-fetch-vx"] 该技能用于微信、QQ相关操作，可能需要相应的平台账号或API密钥

需要定制？告诉我你的需求 →

技能文档

网页 Content 提取器 - 网页内容提取器

版本: 2.0 作者: OpenClaw Team 更新日期: 2026-03-15 许可证: MIT

📦 技能元数据 name: 网页-content-提取器 version: 2.0.0 description: 从微信文章/博客/新闻网页提取干净内容，去除广告和侧边栏 category: 内容处理 tags: [网页提取，内容清洗，微信文章，Markdown] author: OpenClaw Team license: MIT

🎯 功能概述

基于 Readability + Firecrawl + Defuddle 三引擎的网页内容提取工具，专为中文内容优化。支持微信文章、新闻网站、博客等多种来源，自动去除广告/导航/侧边栏，输出干净的 Markdown 格式。

核心能力：

✅ 微信文章提取（mp.weixin.qq.com） ✅ 新闻网页清洗 ✅ 博客文章解析 ✅ 元数据提取（标题/作者/日期） ✅ 多格式输出（Markdown/JSON/纯文本） ✅ 批量处理支持 🚀 快速开始基础调用 # OpenClaw 工具调用结果 = 网页_fetch( url="https://mp.weixin.qq.com/s/xxx", 提取Mode="markdown", maxChars=8000 )

完整参数参数类型必填默认值说明 url str ✅ - 网页 URL 提取Mode str ❌ "markdown" 输出格式（markdown/text/json） maxChars int ❌ 8000 最大字符数 includeMetadata bool ❌ true 是否包含元数据 timeout int ❌ 30 超时时间（秒） 📤 输入输出输入示例 { "url": "https://mp.weixin.qq.com/s/abcdefg", "提取Mode": "markdown", "maxChars": 8000, "includeMetadata": true }

输出示例 { "成功": true, "url": "https://mp.weixin.qq.com/s/abcdefg", "title": "文章标题", "author": "作者名", "publishDate": "2026-03-15", "content": "Markdown 格式的正文内容...", "wordCount": 2500, "readTime": "10 分钟", "images": ["https://..."], "提取Time": 0.8 }

🔧 技术架构三引擎设计用户请求 ↓ ┌────────────────┐ │ 路由判断层 │ └────────────────┘ ↓ ┌──────────────┼──────────────┐ ↓ ↓ ↓ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ 网页_fetch│ │ defuddle│ │ browser │ │ (快速) │ │ (专业) │ │ (兜底) │ └─────────┘ └─────────┘ └─────────┘ ↓ ↓ ↓ ┌────────────────┐ │ 结果聚合层 │ └────────────────┘ ↓ 返回用户

引擎对比引擎速度成功率适用场景网页_fetch <1s 70% 微信文章/通用网页 defuddle <1s 75% 博客/新闻网站 browser 5-10s 90% 复杂 SPA/动态页面 📋 使用场景场景 1：微信文章提取结果 = 网页_fetch( url="https://mp.weixin.qq.com/s/xxx", 提取Mode="markdown" ) print(结果["content"])

场景 2：批量处理 urls = ["url1", "url2", "url3"] 结果s = [网页_fetch(url=u) for u in urls]

场景 3：带元数据提取结果 = 网页_fetch( url="https://example.com/article", includeMetadata=True ) print(f"标题：{结果['title']}") print(f"作者：{结果['author']}") print(f"字数：{结果['wordCount']}")

⚠️ 限制与注意事项不支持的场景 ❌ 需要登录的页面 ❌ 付费墙内容 ❌ 验证码保护的页面 ❌ 纯 JavaScript 渲染的 SPA（需用 browser 引擎）速率限制域名类型请求间隔并发限制微信文章 2 秒 1 新闻网站 1 秒 3 博客 1 秒 5 合规要求仅提取公开可访问内容尊重 ro机器人s.txt 协议不用于商业用途（除非获得授权）保留原作者署名 🎛️ 高级配置自定义 User-代理结果 = 网页_fetch( url="https://example.com", user代理="Mozilla/5.0 ..." )

代理配置结果 = 网页_fetch( url="https://example.com", proxy="http://proxy:port" )

缓存控制 # 启用缓存（1 小时）结果 = 网页_fetch(url, 缓存=True, ttl=3600)

# 强制刷新结果 = 网页_fetch(url, 缓存=False)

📊 性能指标指标数值平均响应时间 0.8 秒 P95 响应时间 2.5 秒成功率 85% 缓存命中率 60% 🔍 故障排查问题 1：提取内容为空

原因：页面需要 JavaScript 渲染解决：切换到 browser 引擎

问题 2：微信文章提取失败

原因：链接过期或有反爬解决：

检查链接是否有效尝试 browser 引擎手动复制内容问题 3：提取内容不完整

原因：maxChars 限制解决：增加 maxChars 参数或分页处理

📚 依赖项 { "readability": "^0.4.4", "firecrawl": "^1.0.0", "defuddle": "^3.0.0" }

🤝 贡献指南 Fork 本仓库创建功能分支 (git 检查out -b feature/AmazingFeature) 提交更改 (git commit -m '添加 some AmazingFeature') 推送到分支 (git push origin feature/AmazingFeature) 开启 Pull 请求 📄 许可证

MIT License - 详见 LICENSE

📞 支持文档: https://docs.OpenClaw.AI/技能s/网页-content-提取器问题反馈: https://github.com/OpenClaw/OpenClaw/issues 社区: https://discord.com/invite/clawd

最后更新: 2026-03-15 维护状态: ✅ 活跃维护

License

运行时依赖

安装命令

本土化适配说明

技能文档

相关技能推荐