Scrapling MCP — 网页爬取助手

Name: Scrapling MCP — 网页爬取助手
Author: Burak

Burak

Scrapling MCP — 网页爬取助手

v1.2.0

Scrapling MCP 提供高级网页爬取功能，包括提取、爬行和反反爬策略。通过 MCP 原生支持，使用 mcporter 调用 `scrapling` MCP 服务器执行爬取任务。提供策略、配方和最佳实践。

0· 1,700·15 当前·16 累计

by @devbd1 (Burak)·MIT-0

浏览器自动化 API工具智能体

下载技能包

License

MIT-0

最后更新

2026/3/6

安全扫描

VirusTotal

无害

查看报告

OpenClaw

安全

high confidence

该技能的要求、指令和代码与集成 Scrapling MCP 服务器的网页爬取助手一致，没有任何隐秘或不相关的行为。

评估建议

["确保您有明确权限爬取目标网站，遵守服务条款和法律法规。该技能的反反爬/隐身功能不应用于规避访问控制。","安装 Playwright 将下载浏览器二进制文件，请验证安装命令和来源。","样例代理示例中包含凭证，请将真实代理凭证视为敏感信息仅提供给信任的提供者。","包含的脚本会写入文件（下载、crawldir），请在受控目录中运行并检查代码。","如果作为 MCP 服务器运行 Scrapling，请考虑绑定到 localhost 并在公开之前审查其网络接口。","如果需要更高的保证，请验证上游 Scrapling 项目（提供仓库/文档链接）并审查包含的 Python 脚本的任何本地修改。"]...

详细分析 ▾

✓ 用途与能力

名称/描述与文件和指令匹配：提供指导、配方、MCP 设置和两个调用 Scrapling fetchers 的辅助 Python 脚本。没有不相关的凭证、二进制文件或配置路径被请求。

✓ 指令范围

SKILL.md 集中于爬取策略、MCP 配置和 mcporter 使用。指令仅引用 Scrapling APIs、代理示例和本地爬取目录。没有指令读取不相关的系统文件或泄露秘密；指导反复提醒使用权限的反反爬/隐身模式。

✓ 安装机制

没有安装规范（仅指令），因此安装过程中没有自动下载或写入内容。README 建议可选的 pip 安装 Scrapling 和 Playwright，这对于技能的功能是合适和预期的。

✓ 凭证需求

技能没有声明所需的环境变量、凭证或配置路径。示例显示了可选的 PYTHONPATH 和代理 URL 示例（包括示例用户：密码格式），对于爬取工具来说这是合理的，并且与声明的目的成比例。

✓ 持久化与权限

always：false，并且模型调用未被禁用（默认），这是正常的。技能没有请求持久的平台权限，也没有尝试修改其他技能或全局代理配置，除了指导用户如何添加 Scrapling 的 MCP 服务器条目。

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.2.02026/2/25

重命名显示名称并将技能元数据对齐到 scrapling-mcp（Slug 不变）

● 无害

安装命令点击复制

官方npx clawhub@latest install scrapling-web-scraping

镜像加速npx clawhub@latest install scrapling-web-scraping --registry https://cn.clawhub-mirror.com

技能文档

（由于字符限制，以下为简略版，完整版请根据原文自行翻译）

name: scrapling-mcp description: 高级网页爬取...

# Scrapling MCP — 网页爬取指导

指导层 + MCP 集成
使用此技能进行 策略和模式。执行时通过 mcporter 调用 Scrapling 的 MCP 服务器。

Guidance Layer + MCP Integration
Use this skill for strategy and patterns. For execution, call Scrapling's MCP server via mcporter.

Quick Start (MCP)

1. Install Scrapling with MCP support

pip install scrapling[mcp]
# Or for full features:
pip install scrapling[mcp,playwright]
python -m playwright install chromium

2. Add to OpenClaw MCP config

{
  "mcpServers": {
    "scrapling": {
      "command": "python",
      "args": ["-m", "scrapling.mcp"]
    }
  }
}

3. Call via mcporter

mcporter call scrapling fetch_page --url "https://example.com"

Execution vs Guidance

Task	Tool	Example
Fetch a page	mcporter	`mcporter call scrapling fetch_page --url URL`
Extract with CSS	mcporter	`mcporter call scrapling css_select --selector ".title::text"`
Which fetcher to use?	This skill	See "Fetcher Selection Guide" below
Anti-bot strategy?	This skill	See "Anti-Bot Escalation Ladder"
Complex crawl patterns?	This skill	See "Spider Recipes"

Fetcher Selection Guide

┌─────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│   Fetcher       │────▶│ DynamicFetcher   │────▶│ StealthyFetcher  │
│   (HTTP)        │     │ (Browser/JS)     │     │ (Anti-bot)       │
└─────────────────┘     └──────────────────┘     └──────────────────┘
     Fastest              JS-rendered               Cloudflare, 
     Static pages         SPAs, React/Vue          Turnstile, etc.

Decision Tree

Static HTML? → Fetcher (10-100x faster)
Need JS execution? → DynamicFetcher
Getting blocked? → StealthyFetcher
Complex session? → Use Session variants

MCP Fetch Modes

fetch_page — HTTP fetcher
fetch_dynamic — Browser-based with Playwright
fetch_stealthy — Anti-bot bypass mode

Anti-Bot Escalation Ladder

Level 1: Polite HTTP

# MCP call: fetch_page with options
{
  "url": "https://example.com",
  "headers": {"User-Agent": "..."},
  "delay": 2.0
}

Level 2: Session Persistence

# Use sessions for cookie/state across requests
FetcherSession(impersonate="chrome")  # TLS fingerprint spoofing

Level 3: Stealth Mode

# MCP: fetch_stealthy
StealthyFetcher.fetch(
    url,
    headless=True,
    solve_cloudflare=True,  # Auto-solve Turnstile
    network_idle=True
)

Level 4: Proxy Rotation

See references/proxy-rotation.md

Adaptive Scraping (Anti-Fragile)

Scrapling can survive website redesigns using adaptive selectors:

# First run — save fingerprints
products = page.css('.product', auto_save=True)# Later runs — auto-relocate if DOM changed
products = page.css('.product', adaptive=True)

MCP usage:

mcporter call scrapling css_select \\
  --selector ".product" \\
  --adaptive true \\
  --auto-save true

Spider Framework (Large Crawls)

When to use Spiders vs direct fetching:

✅ Spider: 10+ pages, concurrency needed, resume capability, proxy rotation
✅ Direct: 1-5 pages, quick extraction, simple flow

Basic Spider Pattern

from scrapling.spiders import Spider, Response
class ProductSpider(Spider):
    name = "products"
    start_urls = ["https://example.com/products"]
    concurrent_requests = 10
    download_delay = 1.0
    
    async def parse(self, response: Response):
        for product in response.css('.product'):
            yield {
                "name": product.css('h2::text').get(),
                "price": product.css('.price::text').get(),
                "url": response.url
            }
        
        # Follow pagination
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page)# Run with resume capability
result = ProductSpider(crawldir="./crawl_data").start()
result.items.to_jsonl("products.jsonl")

Advanced: Multi-Session Spider

from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySessionclass MultiSessionSpider(Spider):
    name = "multi"
    start_urls = ["https://example.com/"]
    
    def configure_sessions(self, manager):
        manager.add("fast", FetcherSession(impersonate="chrome"))
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
    
    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            if "/protected/" in link:
                yield Request(link, sid="stealth")
            else:
                yield Request(link, sid="fast")

Spider Features

Pause/Resume: crawldir parameter saves checkpoints
Streaming: async for item in spider.stream() for real-time processing
Auto-retry: Configurable retry on blocked requests
Export: Built-in to_json(), to_jsonl()

CLI & Interactive Shell

Terminal Extraction (No Code)

# Extract to markdown scrapling extract get 'https://example.com' content.md # Extract specific element scrapling extract get 'https://example.com' content.txt \\ --css-selector '.article' \\ --impersonate 'chrome'

# Stealth mode scrapling extract stealthy-fetch 'https://protected.com' content.md \\ --no-headless \\ --solve-cloudflare

Interactive Shell

scrapling shell# Inside shell:
>>> page = Fetcher.get('https://example.com')
>>> page.css('h1::text').get()
>>> page.find_all('div', class_='item')

Parser API (Beyond CSS/XPath)

BeautifulSoup-Style Methods

# Find by attributes
page.find_all('div', {'class': 'product', 'data-id': True})
page.find_all('div', class_='product', id=re.compile(r'item-\\d+'))
# Text search
page.find_by_text('Add to Cart', tag='button')
page.find_by_regex(r'\\$\\d+\\.\\d{2}')
# Navigation
first = page.css('.product')[0]
parent = first.parent
siblings = first.next_siblings
children = first.children# Similarity
similar = first.find_similar()  # Find visually/structurally similar elements
below = first.below_elements()  # Elements below in DOM

Auto-Generated Selectors

# Get robust selector for any element
element = page.css('.product')[0]
selector = element.auto_css_selector()  # Returns stable CSS path
xpath = element.auto_xpath()

Proxy Rotation

from scrapling.spiders import ProxyRotator
# Cyclic rotation
rotator = ProxyRotator([
    "http://proxy1:8080",
    "http://proxy2:8080",
    "http://user:pass@proxy3:8080"
], strategy="cyclic")# Use with any session
with FetcherSession(proxy=rotator.next()) as session:
    page = session.get('https://example.com')

Common Recipes

Pagination Patterns

# Page numbers
for page_num in range(1, 11):
    url = f"https://example.com/products?page={page_num}"
    ...
# Next button
while next_page := response.css('.next a::attr(href)').get():
    yield response.follow(next_page)# Infinite scroll (DynamicFetcher)
with DynamicSession() as session:
    page = session.fetch(url)
    page.scroll_to_bottom()
    items = page.css('.item').getall()

Login Sessions

with StealthySession(headless=False) as session:
    # Login
    login_page = session.fetch('https://example.com/login')
    login_page.fill('input[name="username"]', 'user')
    login_page.fill('input[name="password"]', 'pass')
    login_page.click('button[type="submit"]')
    
    # Now session has cookies
    protected_page = session.fetch('https://example.com/dashboard')

Next.js Data Extraction

# Extract JSON from __NEXT_DATA__
import json
import renext_data = json.loads(
    re.search(
        r'__NEXT_DATA__" type="application/json">(.*?)',
        page.html_content,
        re.S
    ).group(1)
)
props = next_data['props']['pageProps']

Output Formats

# JSON (pretty)
result.items.to_json('output.json')
# JSONL (streaming, one per line)
result.items.to_jsonl('output.jsonl')# Python objects
for item in result.items:
    print(item['title'])

Performance Tips

Use HTTP fetcher when possible — 10-100x faster than browser
Impersonate browsers — impersonate='chrome' for TLS fingerprinting
HTTP/3 support — FetcherSession(http3=True)
Limit resources — disable_resources=True in Dynamic/Stealthy
Connection pooling — Reuse sessions across requests

Guardrails (Always)

Only scrape content you're authorized to access
Respect robots.txt and ToS
Add delays (download_delay) for large crawls
Don't bypass paywalls or authentication without permission
Never scrape personal/sensitive data

References

references/mcp-setup.md — Detailed MCP configuration
references/anti-bot.md — Anti-bot handling strategies
references/proxy-rotation.md — Proxy setup and rotation
references/spider-recipes.md — Advanced crawling patterns
references/api-reference.md — Quick API reference
references/links.md — Official docs links

Scripts

scripts/scrapling_scrape.py — Quick one-off extraction
scripts/scrapling_smoke_test.py — Test connectivity and anti-bot indicators

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

Quick Start (MCP)

1. Install Scrapling with MCP support

2. Add to OpenClaw MCP config

3. Call via mcporter

Execution vs Guidance

Fetcher Selection Guide

Decision Tree

MCP Fetch Modes

Anti-Bot Escalation Ladder

Level 1: Polite HTTP

Level 2: Session Persistence

Level 3: Stealth Mode

Level 4: Proxy Rotation

Adaptive Scraping (Anti-Fragile)

Spider Framework (Large Crawls)

Basic Spider Pattern

Advanced: Multi-Session Spider

Spider Features

CLI & Interactive Shell

Terminal Extraction (No Code)

Interactive Shell

Parser API (Beyond CSS/XPath)

BeautifulSoup-Style Methods

Auto-Generated Selectors

Proxy Rotation

Common Recipes

Pagination Patterns

Login Sessions

Next.js Data Extraction

Output Formats

Performance Tips

Guardrails (Always)

References

Scripts

安装命令点击复制