test_skill — BBC 新闻网页爬虫（支持多站点、去重）

Name: test_skill — BBC 新闻网页爬虫（支持多站点、去重）
Author: felixopt17

felixopt17

test_skill — BBC 新闻网页爬虫（支持多站点、去重）

v1.0.9

一个强大的、通用的 Web 爬虫，优化为 BBC 新闻，但也能爬取其他新闻网站。集成了高级爬取技术（Crawl4AI、Playwright）以处理动态内容和反爬机制。支持多方法提取、层次存储、局部图像存档和内容过滤。

0· 360·0 当前·0 累计

by @felixopt17·MIT-0

浏览器自动化 API工具测试工具文档工具 AI模型访问

下载技能包

License

MIT-0

最后更新

2026/3/11

安全扫描

VirusTotal

无害

查看报告

OpenClaw

安全

medium confidence

该技能的文件、安装/运行指令以及请求的资源与用于 BBC 和其他新闻网站的 Web 爬虫一致；包中没有迹象表明有隐秘数据泄漏或无关的权限请求，但建议您审查第三方依赖项并在隔离环境中运行安装。

评估建议

该包似乎是一个连贯的 Web 爬虫。安装或运行前：1) 在虚拟环境或沙盒（非 root）中运行 pip 和 Playwright 浏览器安装，以避免系统包冲突；2) 审查要求（尤其是 'crawl4ai'）并验证其来源和可能需要的凭据；3) 遵守法律/道德规则：尊重 robots.txt 和网站条款，避免激进爬取——使用延迟和域限制；4) 如果需要更高的保证，检查完整的 universal_crawler_v2.py（提供的文件被截断）并在隔离网络环境中运行代码以观察依赖项建立的出站连接。...

详细分析 ▾

✓ 用途与能力

名称/描述（具有反爬取回退的 BBC 集中通用爬虫）与包含的代码和脚本匹配：一种多方法爬虫（crawl4ai、playwright、requests）、去重、图像下载和 Markdown 输出。README 提到 Python 3.8+，SKILL.md 说 3.9+ 的小不一致不改变目的。

✓ 指令范围

SKILL.md 只指示安装 Python 依赖项并使用 CLI 标志运行爬虫。它不指示读取无关的本地文件或环境秘密，也不将收集的数据发送到意外的端点（代码爬取目标站点并写入本地文件）。预计爬虫将执行到目标网站的网络请求。

ℹ 安装机制

注册表中未声明平台安装规格，但存储库包括 install.py / install_dependencies.sh，它们运行 pip install -r requirements.txt 和 'python -m playwright install chromium'。依赖项通过 pip 和 Playwright 的浏览器安装获取（标准机制）。注意：crawl4ai 是第三方包（无固定源），Playwright 将从 web 下载浏览器二进制文件——建议验证包并在隔离环境中运行安装。

✓ 凭证需求

该技能未声明任何必需的环境变量、凭据或配置路径。代码不读取秘密也不请求无关的凭据。依赖项可能稍后需要凭据（例如，如果使用某些可选的第三方服务），因此请检查上游包文档。

✓ 持久化与权限

该技能不是始终启用，并且不请求提升的平台权限。它仅在其工作目录下写入锁文件和输出数据。没有对其他技能或全局代理设置进行修改。

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.0.92026/3/11

bbccrawlermaxclaw 版本 1.0.9 - 未检测到此版本的文件或文档更改。 - 功能、使用方法和指令与前一版本无变化。

● 无害

安装命令点击复制

官方npx clawhub@latest install bbccrawlermaxclaw

镜像加速npx clawhub@latest install bbccrawlermaxclaw --registry https://cn.clawhub-mirror.com

技能文档

描述

一个强大的、通用的 Web 爬虫，优化为 BBC 新闻，但也能爬取其他新闻网站。集成了高级爬取技术（Crawl4AI 和 Playwright）以处理动态内容和反爬机制。

功能

多方法提取：

- crawl4ai：使用 AsyncWebCrawler 的主方法，提供高性能和准确性。 - playwright：用于复杂动态页面的完整浏览器渲染回退。 - requests：用于静态内容的快速回退。 - auto：自动检测最佳方法（优先使用 Crawl4AI）。

层次存储：以结构化格式保存内容：YYYY-MM-DD/分类/标题.md。
局部图像存档：下载图像，使用 MD5 哈希命名，并更新 Markdown 参考。
内容过滤：使用 CSS 选择器智能提取主文章内容和相关图像。

要求

Python 3.9+
参见 requirements.txt 获取 Python 包。

安装

# 1. 安装依赖项
# 注意：install.py 支持传递参数给 pip，例如 --break-system-packages
python install.py
# 示例：对于需要 --break-system-packages 的环境
python install.py --break-system-packages

使用

基本使用

python universal_crawler_v2.py --url https://www.bbc.co.uk/news --max-pages 50

高级使用

# 强制使用 Crawl4AI
python universal_crawler_v2.py --url https://www.bbc.co.uk/news --method crawl4ai
# 强制使用 Playwright
python universal_crawler_v2.py --url https://www.bbc.co.uk/news --method playwright
# 控制深度和延迟
python universal_crawler_v2.py --url https://www.bbc.co.uk/news --depth 3 --delay 2.5
# 指定输出目录
python universal_crawler_v2.py --url https://www.bbc.co.uk/news --output ./my_data

故障排除

导入错误：如果看到 “No module named 'crawl4ai'” 之类的错误，请再次运行 python install.py。
空响应：确保您拥有最新版本的爬虫。有些网站可能会阻止特定的 IP 或用户代理；尝试增加延迟或切换方法。

Description

A powerful, universal web crawler optimized for BBC News but capable of crawling other sites. It integrates advanced scraping technologies including Crawl4AI and Playwright to handle dynamic content and anti-bot protections.

Features

Multi-Method Extraction:

- crawl4ai: Primary method using AsyncWebCrawler for high performance and accuracy. - playwright: Full browser rendering fallback for complex dynamic pages. - requests: Fast fallback for static content. - auto: Automatically detects the best method (Prioritizes Crawl4AI).

Hierarchical Storage: Saves content in a structured format: YYYY-MM-DD/Category/Title.md.
Local Image Archiving: Downloads images locally, names them by MD5 hash, and updates Markdown references.
Content Filtering: Intelligently extracts main article content and relevant images using CSS selectors.

Requirements

Python 3.9+
See requirements.txt for Python packages.

Installation

# 1. Install dependencies # Note: install.py supports passing arguments to pip, e.g., --break-system-packages python install.py

# Example for environments requiring --break-system-packages: python install.py --break-system-packages

Usage

Basic Usage

python universal_crawler_v2.py --url https://www.bbc.co.uk/news --max-pages 50

Advanced Usage

# Force Crawl4AI python universal_crawler_v2.py --url https://www.bbc.co.uk/news --method crawl4ai # Force Playwright python universal_crawler_v2.py --url https://www.bbc.co.uk/news --method playwright # Control depth and delay python universal_crawler_v2.py --url https://www.bbc.co.uk/news --depth 3 --delay 2.5

# Specify output directory python universal_crawler_v2.py --url https://www.bbc.co.uk/news --output ./my_data

Troubleshooting

Import Errors: If you see "No module named 'crawl4ai'" or similar, run python install.py again.
Empty Responses: Ensure you have the latest version of the crawler. Some sites may block specific IPs or user agents; try increasing delay or switching methods.

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

描述

功能

要求

安装

使用

基本使用

高级使用

故障排除

Description

Features

Requirements

Installation

Usage

Basic Usage

Advanced Usage

Troubleshooting

安装命令点击复制