📦 Scrapling MCP — 网页抓取

v1.2.0

基于 Scrapling 的高级网页抓取技能，通过 MCP 协议提供提取、爬取与反爬策略指导，支持 mcporter 调用。

0· 1.7k·15 当前·16 累计

by @devbd1 (Burak)·MIT-0

网络工具自动化数据处理开发工具 API工具

下载技能包

License

MIT-0

最后更新

2026/3/6

安全扫描

VirusTotal

无害

查看报告

OpenClaw

安全

high confidence

该技能的需求、说明及所包含代码均与 Scrapling MCP 服务器集成的网页抓取助手一致，未发现隐蔽或无关行为。

评估建议

此技能在 Scrapling 网页抓取场景下表现一致，但使用前请注意：1) 确保获得目标站点明确授权并遵守服务条款与法律——技能含反爬/隐身功能，不得用于绕过访问控制；2) 安装 Playwright 会下载浏览器二进制文件，请验证安装命令及来源；3) 示例代理 URL 包含凭据，真实代理凭据属敏感信息，仅提供给可信服务；4) 脚本会写入文件（下载、crawldir），请在受控目录运行并检查代码；5) 若将 Scrapling 作为 MCP 服务器运行，建议绑定 localhost 并检查网络暴露面；6) 如需更高保障，请验证上游 Scrapling 项目（已提供仓库/文档链接）并审查所含 Python 脚本是否有本地修改。...

详细分析 ▾

✓ 用途与能力

名称/描述（Scrapling MCP 网页抓取）与文件及说明一致：提供指导、配方、MCP 配置及两个调用 Scrapling 抓取器的辅助 Python 脚本。未请求无关凭据、二进制文件或配置路径。

✓ 指令范围

SKILL.md 聚焦抓取策略、MCP 配置及 mcporter 用法。说明仅引用 Scrapling API、代理示例及本地爬取目录。无读取无关系统文件或窃取敏感信息的指令；多次提醒需授权使用反爬/隐身模式。

✓ 安装机制

无安装规范（仅说明），安装过程中不会自动下载或写入任何内容。README 建议可选 pip 安装 Scrapling 与 Playwright，符合技能功能预期。

✓ 凭证需求

技能未声明必需的环境变量、凭据或配置路径。示例展示可选 PYTHONPATH 及代理 URL（含示例 user:pass 格式），对抓取工具而言合理且与声明用途相称。

✓ 持久化与权限

always:false 且未禁用模型调用（默认），属正常行为。技能未请求持久平台权限，也未尝试修改其他技能或全局代理配置，仅指导用户如何为 Scrapling 添加 MCP 服务器条目。

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.2.02026/2/25

重命名显示名称 + 将技能元数据对齐至 scrapling-mcp（slug 不变）

● 无害

安装命令

点击复制

官方npx clawhub@latest install scrapling-web-scraping

镜像加速npx clawhub@latest install scrapling-web-scraping --registry https://cn.longxiaskill.com

技能文档

# Scrapling MCP — 网页抓取指南 > 指导层 + MCP 集成 > 使用本技能获取策略与模式。执行时，通过 mcporter 调用 Scrapling 的 MCP 服务器。 ## 快速开始（MCP） ### 1. 安装支持 MCP 的 Scrapling ``bash pip install scrapling[mcp] # 或完整安装： pip install scrapling[mcp,playwright] python -m playwright install chromium ` ### 2. 添加到 OpenClaw MCP 配置 `json { "mcpServers": { "scrapling": { "command": "python", "args": ["-m", "scrapling.mcp"] } } } ` ### 3. 通过 mcporter 调用 ` mcporter call scrapling fetch_page --url "https://example.com" ` ## 执行与指导 | 任务 | 工具 | 示例 | |------|------|---------| | 拉取页面 | mcporter | mcporter call scrapling fetch_page --url URL | | 用 CSS 提取 | mcporter | mcporter call scrapling css_select --selector ".title::text" | | 选哪个 fetcher？ | 本技能 | 见下方“Fetcher 选择指南” | | 反爬策略？ | 本技能 | 见“反爬升级阶梯” | | 复杂爬取模式？ | 本技能 | 见“Spider 配方” | ## Fetcher 选择指南 ` ┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ Fetcher │────▶│ DynamicFetcher │────▶│ StealthyFetcher │ │ (HTTP) │ │ (浏览器/JS) │ │ (反爬) │ └─────────────────┘ └──────────────────┘ └──────────────────┘ 最快 JS 渲染 Cloudflare、静态页面 SPA、React/Vue Turnstile 等 ` ### 决策树 1. 静态 HTML？ → Fetcher（快 10–100 倍） 2. 需要执行 JS？ → DynamicFetcher 3. 被拦截？ → StealthyFetcher 4. 复杂会话？ → 使用 Session 变体 ### MCP 抓取模式 - fetch_page — HTTP fetcher - fetch_dynamic — 基于浏览器的 Playwright - fetch_stealthy — 反爬绕过模式 ## 反爬升级阶梯 ### 第 1 级：礼貌 HTTP `python # MCP 调用：带选项的 fetch_page { "url": "https://example.com", "headers": {"User-Agent": "..."}, "delay": 2.0 } ` ### 第 2 级：会话保持 `python # 跨请求保留 cookie/状态 FetcherSession(impersonate="chrome") # TLS 指纹伪装 ` ### 第 3 级：隐身模式 `python # MCP：fetch_stealthy StealthyFetcher.fetch( url, headless=True, solve_cloudflare=True, # 自动解 Turnstile network_idle=True ) ` ### 第 4 级：代理轮换见 references/proxy-rotation.md ## 自适应抓取（反脆弱） Scrapling 可在网站改版后存活，使用自适应选择器： `python # 首次运行 — 保存指纹 products = page.css('.product', auto_save=True) # 后续运行 — DOM 变化时自动重定位 products = page.css('.product', adaptive=True) ` MCP 用法： ` mcporter call scrapling css_select \\ --selector ".product" \\ --adaptive true \\ --auto-save true ` ## Spider 框架（大规模爬取）何时用 Spider 而非直接抓取： - ✅ Spider：10+ 页面、需并发、可恢复、代理轮换 - ✅ 直接：1–5 页面、快速提取、流程简单 ### 基础 Spider 模式 `python from scrapling.spiders import Spider, Response class ProductSpider(Spider): name = "products" start_urls = ["https://example.com/products"] concurrent_requests = 10 download_delay = 1.0 async def parse(self, response: Response): for product in response.css('.product'): yield { "name": product.css('h2::text').get(), "price": product.css('.price::text').get(), "url": response.url } # 跟随分页 next_page = response.css('.next a::attr(href)').get() if next_page: yield response.follow(next_page) # 带恢复能力启动 result = ProductSpider(crawldir="./crawl_data").start() result.items.to_jsonl("products.jsonl") ` ### 高级：多会话 Spider `python from scrapling.spiders import Spider, Request, Response from scrapling.fetchers import FetcherSession, AsyncStealthySession class MultiSessionSpider(Spider): name = "multi" start_urls = ["https://example.com/"] def configure_sessions(self, manager): manager.add("fast", FetcherSession(impersonate="chrome")) manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) async def parse(self, response: Response): for link in response.css('a::attr(href)').getall(): if "/protected/" in link: yield Request(link, sid="stealth") else: yield Request(link, sid="fast") ` ### Spider 特性 - 暂停/恢复：crawldir 参数保存检查点 - 流式：async for item in spider.stream() 实时处理 - 自动重试：被拦截时配置重试 - 导出：内置 to_json()、to_jsonl() ## CLI 与交互式 Shell ### 终端提取（零代码） `bash # 提取为 markdown scrapling extract get 'https://example.com' content.md # 提取特定元素 scrapling extract get 'https://example.com' content.txt \\ --css-selector '.article' \\ --impersonate 'chrome' # 隐身模式 scrapling extract stealthy-fetch 'https://protected.com' content.md \\ --no-headless \\ --solve-cloudflare ` ### 交互式 Shell `bash scrapling shell # 在 shell 内： >>> page = Fetcher.get('https://example.com') >>> page.css('h1::text').get() >>> page.find_all('div', class_='item') ` ## Parser API（超越 CSS/XPath） ### BeautifulSoup 风格方法 `python # 按属性查找 page.find_all('div', {'class': 'product', 'data-id': True}) page.find_all('div', class_='product', id=re.compile(r'item-\d+')) # 文本搜索 page.find_by_text('Add to Cart', tag='button') page.find_by_regex(r'\$\d+\.\d{2}') # 导航 first = page.css('.product')[0] parent = first.parent siblings = first.next_siblings children = first.children # 相似元素 similar = first.find_similar() # 视觉/结构相似 below = first.below_elements() # DOM 下方元素 ` ### 自动生成选择器 `python # 为任意元素生成稳定选择器 element = page.css('.product')[0] selector = element.auto_css_selector() # 返回稳定 CSS 路径 xpath = element.auto_xpath() ` ## 代理轮换 `python from scrapling.spiders import ProxyRotator # 循环轮换 rotator = ProxyRotator([ "http://proxy1:8080", "http://proxy2:8080", "http://user:pass@proxy3:8080" ], strategy="cyclic") # 与任意会话配合使用 with FetcherSession(proxy=rotator.next()) as session: page = session.get('https://example.com') ` ## 常见配方 ### 分页模式 `python # 页码 for page_num in range(1, 11): url = f"https://example.com/products?page={page_num}" ... # 下一页按钮 while next_page := response.css('.next a::attr(href)').get(): yield response.follow(next_page) # 无限滚动（DynamicFetcher） with DynamicSession() as session: page = session.fetch(url) page.scroll_to_bottom() items = page.css('.item').getall() ` ### 登录会话 `python with StealthySession(headless=False) as session: # 登录 login_page = session.fetch('https://example.com/login') login_page.fill('input[name="username"]', 'user') login_page.fill('input[name="password"]', 'pass') login_page.click('button[type="submit"]') # 现在会话已带 cookie protected_page = session.fetch('https://example.com/dashboard') ` ### Next.js 数据提取 `python # 从 __NEXT_DATA__ 提取 JSON import json import re next_data = json.loads( re.search( r'__NEXT_DATA__" type="application/json">(.*?)', page.html_content, re.S ).group(1) ) props = next_data['props']['pageProps'] ` ## 输出格式 `python # JSON（美化） result.items.to_json('output.json') # JSONL（流式，每行一条） result.items.to_jsonl('output.jsonl') # Python 对象 for item in result.items: print(item['title']) ` ## 性能提示 1. 优先使用 HTTP fetcher — 比浏览器快 10–100 倍 2. 伪装浏览器 — impersonate='chrome' 进行 TLS 指纹伪装 3. 支持 HTTP/3 — FetcherSession(http3=True) 4. 限制资源 — Dynamic/Stealthy 中 disable_resources=True 5. 连接池 — 跨请求复用会话 ## 安全准则（务必遵守） - 仅抓取已获授权的内容 - 尊重 robots.txt 与 ToS - 大规模爬取时增加延迟（download_delay） - 未经同意勿绕过付费墙或认证 - 绝不抓取个人/敏感数据 ## 参考资料 - references/mcp-setup.md — 详细 MCP 配置 - references/anti-bot.md — 反爬处理策略 - references/proxy-rotation.md — 代理设置与轮换 - references/spider-recipes.md — 高级爬取模式 - references/api-reference.md — 快速 API 参考 - references/links.md — 官方文档链接 ## 脚本 - scripts/scrapling_scrape.py — 一次性快速提取 - scripts/scrapling_smoke_test.py` — 测试连通性与反爬指标

数据来源：ClawHub ↗ · 中文优化：龙虾技能库