📦 Scrapling Yoo — 技能工具

v1.0.0

Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) for execution; this skill...

0· 349·0 当前·0 累计

by @cccccqqqqq (C5Q5)·MIT-0

网络工具浏览器自动化 API工具数据分析设计工具

下载技能包

License

MIT-0

最后更新

2026/4/11

安全扫描

VirusTotal

可疑

查看报告

OpenClaw

可疑

medium confidence

The skill's files and instructions largely match a web-scraping guidance tool, but small metadata inconsistencies and guidance for proxy/captcha bypass (with external service references) warrant caution before installing or running it.

评估建议

This package appears to be legitimate-sounding scraping guidance and helper scripts, but take these precautions before installing or running: 1) Verify provenance: check the upstream project (GitHub/ PyPI) and confirm the maintainer matches the registry listing; the included _meta.json ownerId differs from the registry ownerId which may indicate repackaging. 2) Inspect and pin the pip package: if you pip install scrapling, confirm the exact PyPI project name, version, and upstream repository. P...

详细分析 ▾

ℹ 用途与能力

Name/description match the included files: SKILL.md, reference docs, and helper scripts all describe and call a 'scrapling' Python package and MCP integration. However _meta.json contains a different ownerId/slug/version than the registry metadata, which suggests the package may have been copied or repackaged; verify provenance before trusting.

ℹ 指令范围

Runtime instructions are focused on scraping, MCP integration, and anti-bot strategies. The guidance explicitly warns to avoid scraping private/paywalled data and includes escalation steps (stealthy fetchers, Cloudflare solving, proxy rotation). Those are within the stated purpose but expand the skill into potentially sensitive territory (anti-bot evasion). The SKILL.md and scripts do not instruct reading unrelated system files or exfiltrating data to hidden endpoints.

✓ 安装机制

This is an instruction-only skill with example pip install commands (pip install scrapling[mcp,etc.]). There is no bundled install script or downloaded binary. Relying on pip is typical, but you should confirm the package's identity on PyPI/GitHub before installing.

ℹ 凭证需求

The skill declares no required environment variables or credentials (good). But reference docs and recipes show examples that use proxies with credentials and API bearer tokens, and anti-bot instructions (solve_cloudflare) imply use of external solver services which commonly require keys—none are declared. That absence isn't malicious by itself but means the skill expects you to supply sensitive credentials when you choose to use features like proxy rotation or third-party solvers.

✓ 持久化与权限

Skill is not always-enabled, has no requested persistent privileges, and does not modify other skills' configs. It simply contains guidance and helper scripts that run in the user's environment.

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.0.02026/2/28

scrapling-yoo 1.0.0 - Initial release providing web scraping guidance, strategies, and best practices for the Scrapling framework integrated with MCP (mcporter/OpenClaw). - Covers fetcher selection, anti-bot techniques, spider patterns, and adaptive scraping. - Includes step-by-step quick start instructions for MCP integration and usage. - Documents core concepts, spider framework usage, parser API, proxy rotation, CLI commands, and common scraping recipes.

● 可疑

安装命令

点击复制

官方npx clawhub@latest install scrapling-yoo

镜像加速npx clawhub@latest install scrapling-yoo --registry https://cn.longxiaskill.com

技能文档

Guidance Layer + MCP Integration
Use this skill for strategy and patterns. For execution, call Scrapling's MCP server via mcporter.

Quick Start (MCP)

1. Install Scrapling with MCP support

pip install scrapling[mcp]
# Or for full features:
pip install scrapling[mcp,playwright]
python -m playwright install chromium

2. Add to OpenClaw MCP config

{
  "mcpServers": {
    "scrapling": {
      "command": "python",
      "args": ["-m", "scrapling.mcp"]
    }
  }
}

3. Call via mcporter

mcporter call scrapling fetch_page --url "https://example.com"

Execution vs Guidance

Task	Tool	Example
Fetch a page	mcporter	`mcporter call scrapling fetch_page --url URL`
Extract with CSS	mcporter	`mcporter call scrapling css_select --selector ".title::text"`
Which fetcher to use?	This skill	See "Fetcher Selection Guide" below
Anti-bot strategy?	This skill	See "Anti-Bot Escalation Ladder"
Complex crawl patterns?	This skill	See "Spider Recipes"

Fetcher Selection Guide

┌─────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│   Fetcher       │────▶│ DynamicFetcher   │────▶│ StealthyFetcher  │
│   (HTTP)        │     │ (Browser/JS)     │     │ (Anti-bot)       │
└─────────────────┘     └──────────────────┘     └──────────────────┘
     Fastest              JS-rendered               Cloudflare, 
     Static pages         SPAs, React/Vue          Turnstile, etc.

Decision Tree

Static HTML? → Fetcher (10-100x faster)
Need JS execution? → DynamicFetcher
Getting blocked? → StealthyFetcher
Complex session? → Use Session variants

MCP Fetch Modes

fetch_page — HTTP fetcher
fetch_dynamic — Browser-based with Playwright
fetch_stealthy — Anti-bot bypass mode

Anti-Bot Escalation Ladder

Level 1: Polite HTTP

# MCP call: fetch_page with options
{
  "url": "https://example.com",
  "headers": {"User-Agent": "..."},
  "delay": 2.0
}

Level 2: Session Persistence

# Use sessions for cookie/state across requests
FetcherSession(impersonate="chrome")  # TLS fingerprint spoofing

Level 3: Stealth Mode

# MCP: fetch_stealthy
StealthyFetcher.fetch(
    url,
    headless=True,
    solve_cloudflare=True,  # Auto-solve Turnstile
    network_idle=True
)

Level 4: Proxy Rotation

See references/proxy-rotation.md

Adaptive Scraping (Anti-Fragile)

Scrapling can survive website redesigns using adaptive selectors:

# First run — save fingerprints
products = page.css('.product', auto_save=True)# Later runs — auto-relocate if DOM changed
products = page.css('.product', adaptive=True)

MCP usage:

mcporter call scrapling css_select \\
  --selector ".product" \\
  --adaptive true \\
  --auto-save true

Spider Framework (Large Crawls)

When to use Spiders vs direct fetching:

✅ Spider: 10+ pages, concurrency needed, resume capability, proxy rotation
✅ Direct: 1-5 pages, quick extraction, simple flow

Basic Spider Pattern

from scrapling.spiders import Spider, Response
class ProductSpider(Spider):
    name = "products"
    start_urls = ["https://example.com/products"]
    concurrent_requests = 10
    download_delay = 1.0
    
    async def parse(self, response: Response):
        for product in response.css('.product'):
            yield {
                "name": product.css('h2::text').get(),
                "price": product.css('.price::text').get(),
                "url": response.url
            }
        
        # Follow pagination
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page)# Run with resume capability
result = ProductSpider(crawldir="./crawl_data").start()
result.items.to_jsonl("products.jsonl")

Advanced: Multi-Session Spider

from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySessionclass MultiSessionSpider(Spider):
    name = "multi"
    start_urls = ["https://example.com/"]
    
    def configure_sessions(self, manager):
        manager.add("fast", FetcherSession(impersonate="chrome"))
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
    
    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            if "/protected/" in link:
                yield Request(link, sid="stealth")
            else:
                yield Request(link, sid="fast")

Spider Features

Pause/Resume: crawldir parameter saves checkpoints
Streaming: async for item in spider.stream() for real-time processing
Auto-retry: Configurable retry on blocked requests
Export: Built-in to_json(), to_jsonl()

CLI & Interactive Shell

Terminal Extraction (No Code)

# Extract to markdown scrapling extract get 'https://example.com' content.md # Extract specific element scrapling extract get 'https://example.com' content.txt \\ --css-selector '.article' \\ --impersonate 'chrome'

# Stealth mode scrapling extract stealthy-fetch 'https://protected.com' content.md \\ --no-headless \\ --solve-cloudflare

Interactive Shell

scrapling shell# Inside shell:
>>> page = Fetcher.get('https://example.com')
>>> page.css('h1::text').get()
>>> page.find_all('div', class_='item')

Parser API (Beyond CSS/XPath)

BeautifulSoup-Style Methods

# Find by attributes
page.find_all('div', {'class': 'product', 'data-id': True})
page.find_all('div', class_='product', id=re.compile(r'item-\\d+'))
# Text search
page.find_by_text('Add to Cart', tag='button')
page.find_by_regex(r'\\$\\d+\\.\\d{2}')
# Navigation
first = page.css('.product')[0]
parent = first.parent
siblings = first.next_siblings
children = first.children# Similarity
similar = first.find_similar()  # Find visually/structurally similar elements
below = first.below_elements()  # Elements below in DOM

Auto-Generated Selectors

# Get robust selector for any element
element = page.css('.product')[0]
selector = element.auto_css_selector()  # Returns stable CSS path
xpath = element.auto_xpath()

Proxy Rotation

from scrapling.spiders import ProxyRotator
# Cyclic rotation
rotator = ProxyRotator([
    "http://proxy1:8080",
    "http://proxy2:8080",
    "http://user:pass@proxy3:8080"
], strategy="cyclic")# Use with any session
with FetcherSession(proxy=rotator.next()) as session:
    page = session.get('https://example.com')

Common Recipes

Pagination Patterns

# Page numbers
for page_num in range(1, 11):
    url = f"https://example.com/products?page={page_num}"
    ...
# Next button
while next_page := response.css('.next a::attr(href)').get():
    yield response.follow(next_page)# Infinite scroll (DynamicFetcher)
with DynamicSession() as session:
    page = session.fetch(url)
    page.scroll_to_bottom()
    items = page.css('.item').getall()

Login Sessions

with StealthySession(headless=False) as session:
    # Login
    login_page = session.fetch('https://example.com/login')
    login_page.fill('input[name="username"]', 'user')
    login_page.fill('input[name="password"]', 'pass')
    login_page.click('button[type="submit"]')
    
    # Now session has cookies
    protected_page = session.fetch('https://example.com/dashboard')

Next.js Data Extraction

# Extract JSON from __NEXT_DATA__
import json
import renext_data = json.loads(
    re.search(
        r'__NEXT_DATA__" type="application/json">(.*?)',
        page.html_content,
        re.S
    ).group(1)
)
props = next_data['props']['pageProps']

Output Formats

# JSON (pretty)
result.items.to_json('output.json')
# JSONL (streaming, one per line)
result.items.to_jsonl('output.jsonl')# Python objects
for item in result.items:
    print(item['title'])

Performance Tips

Use HTTP fetcher when possible — 10-100x faster than browser
Impersonate browsers — impersonate='chrome' for TLS fingerprinting
HTTP/3 support — FetcherSession(http3=True)
Limit resources — disable_resources=True in Dynamic/Stealthy
Connection pooling — Reuse sessions across requests

Guardrails (Always)

Only scrape content you're authorized to access
Respect robots.txt and ToS
Add delays (download_delay) for large crawls
Don't bypass paywalls or authentication without permission
Never scrape personal/sensitive data

References

references/mcp-setup.md — Detailed MCP configuration
references/anti-bot.md — Anti-bot handling strategies
references/proxy-rotation.md — Proxy setup and rotation
references/spider-recipes.md — Advanced crawling patterns
references/api-reference.md — Quick API reference
references/links.md — Official docs links

Scripts

scripts/scrapling_scrape.py — Quick one-off extraction
scripts/scrapling_smoke_test.py — Test connectivity and anti-bot indicators

数据来源：ClawHub ↗ · 中文优化：龙虾技能库