首页龙虾技能列表 › Scrapling MCP — 技能工具

Scrapling MCP — 技能工具

v0.1.2

[自动翻译] Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) to call the `scrapl...

0· 334·4 当前·4 累计
by @devbd1 (Burak)·MIT-0
下载技能包
License
MIT-0
最后更新
2026/3/6
安全扫描
VirusTotal
无害
查看报告
OpenClaw
安全
medium confidence
The skill's files, instructions, and requirements are consistent with a web-scraping guidance + MCP integration tool; it asks for no unrelated credentials or elevated platform privileges, but its anti-bot/stealth guidance can be misused and should only be used with permission.
评估建议
This skill appears to be a legitimate guidance layer + helper scripts for using Scrapling via MCP. Before installing/using it: 1) Verify you have permission to scrape target sites — do not use stealth or proxy rotation to evade protections, bypass paywalls, or access private data. 2) Inspect any proxy strings or Authorization headers you paste into configs; never store real credentials in public places. 3) The skill instructs installing third‑party packages (scrapling, playwright) — prefer insta...
详细分析 ▾
用途与能力
Name/description (Scrapling MCP guidance) align with the provided files and instructions: SKILL.md, reference docs, and helper scripts all focus on scraping, MCP setup, fetcher selection, spiders, proxies and anti-bot handling. No unrelated env vars, binaries, or platform access are requested.
指令范围
Runtime instructions and examples legitimately show installing scrapling/playwright, configuring an MCP server, calling mcporter, and using fetcher/stealthy/dynamic modes. The instructions include explicit guidance and examples for proxy rotation and 'solve_cloudflare' / stealthy fetchers; those are coherent for advanced scraping but can enable bypassing anti-bot measures if used without authorization — the docs repeatedly note 'use only when authorized', which mitigates but does not remove misuse risk.
安装机制
No install specification is included in the registry (instruction-only). The SKILL.md instructs pip installs from known packages (scrapling, playwright) and to run playwright install; helper scripts are shipped with the skill but there is no downloader or remote install URL that would write arbitrary code at runtime.
凭证需求
The skill declares no required environment variables, no primary credential, and no config-path requirements. Example snippets show proxy URLs (including username:password examples) and an example API Authorization header in a recipe — these are examples only and not requested by the skill; exercise caution when inserting real credentials into proxy strings or requests.
持久化与权限
The skill is not marked always:true and does not request persistent or cross-skill configuration changes. It does not attempt to modify other skills or system-wide settings.
安全有层次,运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发,无需署名。

运行时依赖

无特殊依赖

版本

latestv0.1.22026/3/6

Docs: add source repo link

● 无害

安装命令 点击复制

官方npx clawhub@latest install scrapling-mcp
镜像加速npx clawhub@latest install scrapling-mcp --registry https://cn.clawhub-mirror.com

技能文档

Source repo: https://github.com/DevBD1/openclaw-skill-scrapling-mcp

Guidance Layer + MCP Integration
Use this skill for strategy and patterns. For execution, call Scrapling's MCP server via mcporter.

Quick Start (MCP)

1. Install Scrapling with MCP support

pip install scrapling[mcp]
# Or for full features:
pip install scrapling[mcp,playwright]
python -m playwright install chromium

2. Add to OpenClaw MCP config

{
  "mcpServers": {
    "scrapling": {
      "command": "python",
      "args": ["-m", "scrapling.mcp"]
    }
  }
}

3. Call via mcporter

mcporter call scrapling fetch_page --url "https://example.com"

Execution vs Guidance

TaskToolExample
Fetch a pagemcportermcporter call scrapling fetch_page --url URL
Extract with CSSmcportermcporter call scrapling css_select --selector ".title::text"
Which fetcher to use?This skillSee "Fetcher Selection Guide" below
Anti-bot strategy?This skillSee "Anti-Bot Escalation Ladder"
Complex crawl patterns?This skillSee "Spider Recipes"

Fetcher Selection Guide

┌─────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│   Fetcher       │────▶│ DynamicFetcher   │────▶│ StealthyFetcher  │
│   (HTTP)        │     │ (Browser/JS)     │     │ (Anti-bot)       │
└─────────────────┘     └──────────────────┘     └──────────────────┘
     Fastest              JS-rendered               Cloudflare, 
     Static pages         SPAs, React/Vue          Turnstile, etc.

Decision Tree

  • Static HTML?Fetcher (10-100x faster)
  • Need JS execution?DynamicFetcher
  • Getting blocked?StealthyFetcher
  • Complex session? → Use Session variants

MCP Fetch Modes

  • fetch_page — HTTP fetcher
  • fetch_dynamic — Browser-based with Playwright
  • fetch_stealthy — Anti-bot bypass mode

Anti-Bot Escalation Ladder

Level 1: Polite HTTP

# MCP call: fetch_page with options
{
  "url": "https://example.com",
  "headers": {"User-Agent": "..."},
  "delay": 2.0
}

Level 2: Session Persistence

# Use sessions for cookie/state across requests
FetcherSession(impersonate="chrome")  # TLS fingerprint spoofing

Level 3: Stealth Mode

# MCP: fetch_stealthy
StealthyFetcher.fetch(
    url,
    headless=True,
    solve_cloudflare=True,  # Auto-solve Turnstile
    network_idle=True
)

Level 4: Proxy Rotation

See references/proxy-rotation.md

Adaptive Scraping (Anti-Fragile)

Scrapling can survive website redesigns using adaptive selectors:

# First run — save fingerprints
products = page.css('.product', auto_save=True)

# Later runs — auto-relocate if DOM changed products = page.css('.product', adaptive=True)

MCP usage:

mcporter call scrapling css_select \\
  --selector ".product" \\
  --adaptive true \\
  --auto-save true

Spider Framework (Large Crawls)

When to use Spiders vs direct fetching:

  • Spider: 10+ pages, concurrency needed, resume capability, proxy rotation
  • Direct: 1-5 pages, quick extraction, simple flow

Basic Spider Pattern

from scrapling.spiders import Spider, Response

class ProductSpider(Spider): name = "products" start_urls = ["https://example.com/products"] concurrent_requests = 10 download_delay = 1.0 async def parse(self, response: Response): for product in response.css('.product'): yield { "name": product.css('h2::text').get(), "price": product.css('.price::text').get(), "url": response.url } # Follow pagination next_page = response.css('.next a::attr(href)').get() if next_page: yield response.follow(next_page)

# Run with resume capability result = ProductSpider(crawldir="./crawl_data").start() result.items.to_jsonl("products.jsonl")

Advanced: Multi-Session Spider

from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSessionSpider(Spider): name = "multi" start_urls = ["https://example.com/"] def configure_sessions(self, manager): manager.add("fast", FetcherSession(impersonate="chrome")) manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) async def parse(self, response: Response): for link in response.css('a::attr(href)').getall(): if "/protected/" in link: yield Request(link, sid="stealth") else: yield Request(link, sid="fast")

Spider Features

  • Pause/Resume: crawldir parameter saves checkpoints
  • Streaming: async for item in spider.stream() for real-time processing
  • Auto-retry: Configurable retry on blocked requests
  • Export: Built-in to_json(), to_jsonl()

CLI & Interactive Shell

Terminal Extraction (No Code)

# Extract to markdown
scrapling extract get 'https://example.com' content.md

# Extract specific element scrapling extract get 'https://example.com' content.txt \\ --css-selector '.article' \\ --impersonate 'chrome'

# Stealth mode scrapling extract stealthy-fetch 'https://protected.com' content.md \\ --no-headless \\ --solve-cloudflare

Interactive Shell

scrapling shell

# Inside shell: >>> page = Fetcher.get('https://example.com') >>> page.css('h1::text').get() >>> page.find_all('div', class_='item')

Parser API (Beyond CSS/XPath)

BeautifulSoup-Style Methods

# Find by attributes
page.find_all('div', {'class': 'product', 'data-id': True})
page.find_all('div', class_='product', id=re.compile(r'item-\\d+'))

# Text search page.find_by_text('Add to Cart', tag='button') page.find_by_regex(r'\\$\\d+\\.\\d{2}')

# Navigation first = page.css('.product')[0] parent = first.parent siblings = first.next_siblings children = first.children

# Similarity similar = first.find_similar() # Find visually/structurally similar elements below = first.below_elements() # Elements below in DOM

Auto-Generated Selectors

# Get robust selector for any element
element = page.css('.product')[0]
selector = element.auto_css_selector()  # Returns stable CSS path
xpath = element.auto_xpath()

Proxy Rotation

from scrapling.spiders import ProxyRotator

# Cyclic rotation rotator = ProxyRotator([ "http://proxy1:8080", "http://proxy2:8080", "http://user:pass@proxy3:8080" ], strategy="cyclic")

# Use with any session with FetcherSession(proxy=rotator.next()) as session: page = session.get('https://example.com')

Common Recipes

Pagination Patterns

# Page numbers
for page_num in range(1, 11):
    url = f"https://example.com/products?page={page_num}"
    ...

# Next button while next_page := response.css('.next a::attr(href)').get(): yield response.follow(next_page)

# Infinite scroll (DynamicFetcher) with DynamicSession() as session: page = session.fetch(url) page.scroll_to_bottom() items = page.css('.item').getall()

Login Sessions

with StealthySession(headless=False) as session:
    # Login
    login_page = session.fetch('https://example.com/login')
    login_page.fill('input[name="username"]', 'user')
    login_page.fill('input[name="password"]', 'pass')
    login_page.click('button[type="submit"]')
    
    # Now session has cookies
    protected_page = session.fetch('https://example.com/dashboard')

Next.js Data Extraction

# Extract JSON from __NEXT_DATA__
import json
import re

next_data = json.loads( re.search( r'__NEXT_DATA__" type="application/json">(.*?)', page.html_content, re.S ).group(1) ) props = next_data['props']['pageProps']

Output Formats

# JSON (pretty)
result.items.to_json('output.json')

# JSONL (streaming, one per line) result.items.to_jsonl('output.jsonl')

# Python objects for item in result.items: print(item['title'])

Performance Tips

  • Use HTTP fetcher when possible — 10-100x faster than browser
  • Impersonate browsersimpersonate='chrome' for TLS fingerprinting
  • HTTP/3 supportFetcherSession(http3=True)
  • Limit resourcesdisable_resources=True in Dynamic/Stealthy
  • Connection pooling — Reuse sessions across requests

Guardrails (Always)

  • Only scrape content you're authorized to access
  • Respect robots.txt and ToS
  • Add delays (download_delay) for large crawls
  • Don't bypass paywalls or authentication without permission
  • Never scrape personal/sensitive data

References

  • references/mcp-setup.md — Detailed MCP configuration
  • references/anti-bot.md — Anti-bot handling strategies
  • references/proxy-rotation.md — Proxy setup and rotation
  • references/spider-recipes.md — Advanced crawling patterns
  • references/api-reference.md — Quick API reference
  • references/links.md — Official docs links

Scripts

  • scripts/scrapling_scrape.py — Quick one-off extraction
  • scripts/scrapling_smoke_test.py — Test connectivity and anti-bot indicators
数据来源:ClawHub ↗ · 中文优化:龙虾技能库
OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险,如需更匹配、更安全的方案,建议联系付费定制

了解定制服务