Facebook Scraper — Facebook 页面和群组爬虫

Name: Facebook Scraper — Facebook 页面和群组爬虫
Rating: 1 (6 reviews)
Author: ArulmozhiV

ArulmozhiV

Facebook Scraper — Facebook 页面和群组爬虫

v0.1.2

通过浏览器模拟和地理位置、类别发现公共 Facebook 页面和群组，支持 JSON/CSV 导出。注意：使用前请确认合规性和安全风险。

6· 953·2 当前·2 累计

by @arulmozhiv (ArulmozhiV)·MIT-0

浏览器自动化数据分析安全

下载技能包

License

MIT-0

最后更新

2026/4/12

安全扫描

VirusTotal

可疑

查看报告

OpenClaw

可疑

medium confidence

该技能需要 Python、Chromium、Facebook 账户凭据、Google API 密钥和代理凭据，但注册元数据未声明这些依赖项或秘密，存在安全风险，使用前需澄清。

评估建议

此技能可能侵犯隐私，仅在隔离环境（VM/容器）运行，避免提供真实个人或高权限凭据。同时，注意 Facebook 条款和当地法律风险。使用前请向发布者询问：（1）源代码或可复制的安装脚本；（2）环境变量列表和凭据存储方式；（3）可靠的安装规范；（4）安全的身份验证处理方式。...

详细分析 ▾

⚠ 用途与能力

SKILL.md 声明运行时要求（在其前置物中）python3 和 chromium，并描述 Playwright 风格的浏览器爬取、指纹识别和下载缩略图。

⚠ 指令范围

运行时指令指导代理发现和爬取 Facebook 页面/群组，下载缩略图，在 data/queue 和 data/output 中持久化队列/输出文件，并处理 Facebook 登录流和验证码。

⚠ 安装机制

注册表中没有安装规范（仅指令）。

⚠ 凭证需求

尽管技能的操作逻辑上需要 Facebook 账户凭据（用于登录流）、可选 Google API 密钥/搜索引擎 ID 和可能的代理凭据，但注册表未声明任何必需的环境变量或主凭据。

✓ 持久化与权限

该技能不请求 always:true，默认为用户可调用。

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv0.1.22026/2/24

facebook-scraper 0.1.2 - 更新 SKILL.md 文档，无功能或特性变化。

● 可疑

安装命令点击复制

官方npx clawhub@latest install facebook-scraper

镜像加速npx clawhub@latest install facebook-scraper --registry https://cn.clawhub-mirror.com

技能文档

来自 ScrapeClaw — 一套生产就绪的代理式社交媒体抓取工具，支持 Instagram、YouTube、X/Twitter 和 Facebook，基于 Python 和 Playwright 构建，无需 API 密钥。

基于浏览器的 Facebook 页面和群组发现及抓取工具。

概述

此技能提供两阶段 Facebook 抓取系统：

页面/群组发现
浏览器抓取

功能

🔍 - 按位置和类别发现 Facebook 页面和群组
🌐 - 完整浏览器模拟，精确抓取
🛡️ - 浏览器指纹、人类行为模拟和隐形脚本
📊 - 页面/群组信息、统计、图片和互动数据
💾 - JSON/CSV 导出，附带下载的缩略图
🔄 - 恢复中断的抓取会话
⚡ - 自动跳过私密群组、低赞页面和空档案
📂 - 通过 --type 标志支持页面、群组和公开档案

获取 Google API 凭证（可选）

前往 Google Cloud Console
创建新项目或选择现有项目
启用"Custom Search API"
创建 API 凭证 → API 密钥
前往 Programmable Search Engine
创建搜索引擎，以 facebook.com 为搜索站点
复制搜索引擎 ID

使用方法

代理工具接口

对于 OpenClaw 代理集成，此技能提供 JSON 输出：

# 发现 Facebook 页面（返回 JSON） discover --location "Miami" --category "restaurant" --type page --output json # 发现 Facebook 群组（返回 JSON） discover --location "New York" --category "fitness" --type group --output json # 抓取单个页面（返回 JSON） scrape --page-name examplebusiness --output json

# 抓取单个群组（返回 JSON） scrape --page-name examplegroup --type group --output json

输出数据

页面/群组数据结构

{
  "page_name": "example_business",
  "display_name": "Example Business",
  "entity_type": "page",
  "category": "Restaurant",
  "subcategory": "Italian Restaurant",
  "about": "Family-owned Italian restaurant since 1985",
  "followers": 45000,
  ...
}

安装

# 克隆仓库 git clone https://github.com/influenza-0/facebook-scraper.git cd facebook-scraper

# 安装依赖 pip install -r requirements.txt

配置

环境变量

# 可选：Google API 凭证（用于高级发现） export GOOGLE_API_KEY="your_api_key" export GOOGLE_CSE_ID="your_cse_id"

# 可选：代理 export HTTP_PROXY="http://proxy:port" export HTTPS_PROXY="http://proxy:port"

运行发现

# 按位置发现页面 python -m src.main discover --location "Los Angeles" --category "restaurant" --type page

# 按类别发现群组 python -m src.main discover --location "San Francisco" --category "technology" --type group

运行抓取

# 抓取页面 python -m src.main scrape --page-name examplebusiness --type page

# 抓取群组 python -m src.main scrape --page-name examplegroup --type group

输出格式

JSON 输出

python -m src.main scrape --page-name examplebusiness --output json

生成 data/output/examplebusiness.json

CSV 输出

python -m src.main scrape --page-name examplebusiness --output csv

生成 data/output/examplebusiness.csv

高级选项

缩略图下载

python -m src.main scrape --page-name examplebusiness --download-thumbnails

恢复中断的会话

python -m src.main scrape --page-name examplebusiness --resume

过滤选项

# 跳过低赞页面 python -m src.main discover --location "Miami" --category "restaurant" --min-likes 1000

# 跳过私密群组 python -m src.main discover --location "New York" --category "fitness" --skip-private

注意事项

此工具仅用于抓取公开可访问的内容
请遵守 Facebook 的服务条款
大规模抓取可能导致 IP 被封，建议使用代理
建议在非高峰时段运行

故障排除

浏览器启动失败

确保已安装 Chromium：

playwright install chromium

被检测到为机器人

使用 --stealth 模式
添加延迟：--delay 2
使用代理：--proxy http://proxy:port

无内容返回

检查页面/群组名称是否正确
确认页面/群组是公开的
尝试使用不同位置或类别

# Facebook Page & Group Scraper

Part of ScrapeClaw — a suite of production-ready, agentic social media scrapers for Instagram, YouTube, X/Twitter, and Facebook built with Python & Playwright, no API keys required.

A browser-based Facebook page and group discovery and scraping tool.

---
name: facebook-scraper
description: Discover and scrape Facebook pages and public groups from your browser.
emoji: 📘
version: 1.0.0
author: influenza
tags:
  - facebook
  - scraping
  - social-media
  - page-discovery
  - group-discovery
  - business-pages
metadata:
  clawdbot:
    requires:
      bins:
        - python3
        - chromium
    config:
      stateDirs:
        - data/output
        - data/queue
        - thumbnails
      outputFormats:
        - json
        - csv

Overview

This skill provides a two-phase Facebook scraping system:

Page/Group Discovery
Browser Scraping

Features

🔍 - Discover Facebook pages and groups by location and category
🌐 - Full browser simulation for accurate scraping
🛡️ - Browser fingerprinting, human behavior simulation, and stealth scripts
📊 - Page/group info, stats, images, and engagement data
💾 - JSON/CSV export with downloaded thumbnails
🔄 - Resume interrupted scraping sessions
⚡ - Auto-skip private groups, low-like pages, empty profiles
📂 - Supports pages, groups, and public profiles via --type flag

Getting Google API Credentials (Optional)

Go to Google Cloud Console
Create a new project or select existing
Enable "Custom Search API"
Create API credentials → API Key
Go to Programmable Search Engine
Create a search engine with facebook.com as the site to search
Copy the Search Engine ID

Usage

Agent Tool Interface

For OpenClaw agent integration, the skill provides JSON output:

# Discover Facebook pages (returns JSON) discover --location "Miami" --category "restaurant" --type page --output json # Discover Facebook groups (returns JSON) discover --location "New York" --category "fitness" --type group --output json # Scrape single page (returns JSON) scrape --page-name examplebusiness --output json

# Scrape single group (returns JSON) scrape --page-name examplegroup --type group --output json

Output Data

Page/Group Data Structure

{
  "page_name": "example_business",
  "display_name": "Example Business",
  "entity_type": "page",
  "category": "Restaurant",
  "subcategory": "Italian Restaurant",
  "about": "Family-owned Italian restaurant since 1985",
  "followers": 45000,
  "page_likes": 42000,
  "location": "Miami, FL",
  "address": "123 Main St, Miami, FL 33101",
  "phone": "+1-555-0123",
  "email": "info@example.com",
  "website": "https://example.com",
  "hours": "Mon-Sat 11AM-10PM",
  "is_verified": false,
  "page_tier": "mid",
  "profile_pic_local": "thumbnails/example_business/profile_abc123.jpg",
  "cover_photo_local": "thumbnails/example_business/cover_def456.jpg",
  "recent_posts": [
    {"post_url": "https://facebook.com/example_business/posts/123", "reactions": 320, "comments": 45, "shares": 12}
  ],
  "scrape_timestamp": "2026-02-20T14:30:00"
}

Group Data Structure

{
  "page_name": "example_group",
  "display_name": "Miami Fitness Community",
  "entity_type": "group",
  "about": "A community for fitness enthusiasts in Miami",
  "members": 15000,
  "privacy": "Public",
  "posts_per_day": 25,
  "location": "Miami",
  "page_tier": "mid",
  "profile_pic_local": "thumbnails/example_group/profile_abc123.jpg",
  "cover_photo_local": "thumbnails/example_group/cover_def456.jpg",
  "scrape_timestamp": "2026-02-20T14:30:00"
}

Page Tiers

Tier	Likes/Members Range
nano	< 1,000
micro	1,000 - 10,000
mid	10,000 - 100,000
macro	100,000 - 1M
mega	> 1,000,000

File Outputs

Queue files: data/queue/{location}_{category}_{type}_{timestamp}.json
Scraped data: data/output/{page_name}.json
Thumbnails: thumbnails/{page_name}/profile_.jpg, thumbnails/{page_name}/cover_.jpg
Export files: data/export_{timestamp}.json, data/export_{timestamp}.csv

Configuration

Edit config/scraper_config.json:

{
  "google_search": {
    "enabled": true,
    "api_key": "",
    "search_engine_id": "",
    "queries_per_location": 3
  },
  "scraper": {
    "headless": false,
    "min_likes": 1000,
    "download_thumbnails": true,
    "max_thumbnails": 6
  },
  "cities": ["New York", "Los Angeles", "Miami", "Chicago"],
  "categories": ["restaurant", "retail", "fitness", "real-estate", "healthcare", "beauty"]
}

Filters Applied

The scraper automatically filters out:

❌ Private groups
❌ Pages with < 1,000 likes (configurable)
❌ Deactivated or removed pages
❌ Non-existent pages/groups
❌ Already scraped entries (deduplication)

Troubleshooting

Login Issues

Ensure credentials are correct
Handle verification codes when prompted
Wait if rate limited (the script will auto-retry)

No Pages Discovered

Check Google API key and quota
Verify Search Engine ID is configured for facebook.com
Try different location/category combinations

Rate Limiting

Reduce scraping speed (increase delays)
Use multiple Facebook accounts
Run during off-peak hours
Use a residential proxy (see below)

🌐 Residential Proxy Support

Why Use a Residential Proxy?

Running a scraper at scale without a residential proxy will get your IP blocked fast. Here's why proxies are essential for long-running scrapes:

Advantage	Description
Avoid IP Bans	Residential IPs look like real household users, not data-center bots. Facebook is far less likely to flag them.
Automatic IP Rotation	Each request (or session) gets a fresh IP, so rate-limits never stack up on one address.
Geo-Targeting	Route traffic through a specific country/city so scraped content matches the target audience's locale.
Sticky Sessions	Keep the same IP for a configurable window (e.g. 10 min) — critical for maintaining a Facebook login session.
Higher Success Rate	Rotating residential IPs deliver 95%+ success rates compared to ~30% with data-center proxies on Facebook.
Long-Running Scrapes	Scrape thousands of pages/groups over hours or days without interruption.
Concurrent Scraping	Run multiple browser instances across different IPs simultaneously.

Recommended Proxy Providers

We have affiliate partnerships with top residential proxy providers. Using these links supports continued development of this skill:

Provider	Best For	Sign Up
Bright Data	World's largest residential network, 72M+ IPs, enterprise-grade	👉 Sign Up for Bright Data
IProyal	Premium residential pool, pay-as-you-go, 195+ countries	👉 Sign Up for IProyal
Storm Proxies	Fast & reliable residential IPs, developer-friendly API	👉 Sign Up for Storm Proxies
NetNut	ISP-grade residential network, 52M+ IPs, direct connectivity	👉 Sign Up for NetNut

Setup Steps

1. Get Your Proxy Credentials

Sign up with any provider above, then grab:

Username (from your provider dashboard)
Password (from your provider dashboard)
Host and Port are pre-configured per provider (or use custom)

2. Configure Entirely via Environment Variables

export PROXY_ENABLED=true
export PROXY_PROVIDER=netnut       # brightdata | iproyal | stormproxies | netnut | custom
export PROXY_USERNAME=your_user
export PROXY_PASSWORD=your_pass
export PROXY_COUNTRY=us            # optional: two-letter country code
export PROXY_STICKY=true           # optional: keep same IP per session

3. Provider-Specific Host/Port Defaults

These are auto-configured when you set the provider name:

Provider	Host	Port
Bright Data	`brd.superproxy.io`	`22225`
IProyal	`proxy.iproyal.com`	`12321`
Storm Proxies	`rotating.stormproxies.com`	`9999`
NetNut	`gw-resi.netnut.io`	`5959`

Override with "host" and "port" in config or PROXY_HOST / PROXY_PORT env vars if your plan uses a different gateway.

4. Custom Proxy Provider

For any other proxy service, set provider to custom and supply host/port manually:

{
  "proxy": {
    "enabled": true,
    "provider": "custom",
    "host": "your.proxy.host",
    "port": 8080,
    "username": "user",
    "password": "pass"
  }
}

Running the Scraper with Proxy

Once configured, the scraper picks up the proxy automatically — no extra flags needed:

# Discover and scrape as usual — proxy is applied automatically python main.py discover --location "Miami" --category "restaurant" --type page python main.py scrape --page-name examplebusiness

# The log will confirm proxy is active: # INFO - Proxy enabled: # INFO - Browser using proxy: netnut → gw-resi.netnut.io:5959

Using the Proxy Manager Programmatically

from proxy_manager import ProxyManager
# From config (auto-reads config/scraper_config.json)
pm = ProxyManager.from_config()
# From environment variables
pm = ProxyManager.from_env()
# Manual construction
pm = ProxyManager(
    provider="netnut",
    username="your_user",
    password="your_pass",
    country="us",
    sticky=True
)
# For Playwright browser context
proxy = pm.get_playwright_proxy()
# → {"server": "http://gw-resi.netnut.io:5959", "username": "user-country-us-session-abc123", "password": "pass"}
# For requests / aiohttp
proxies = pm.get_requests_proxy()
# → {"http": "http://user:pass@host:port", "https": "http://user:pass@host:port"}
# Force new IP (rotates session ID)
pm.rotate_session()# Debug info
print(pm.info())

Best Practices for Long-Running Scrapes

Always use sticky sessions — Facebook requires consistent IPs during a login session. Set "sticky": true.
Target the right country — Set "country": "us" (or your target region) so Facebook serves content in the expected locale.
Combine with existing anti-detection — This scraper already has fingerprinting, stealth scripts, and human behavior simulation. The proxy is the final layer.
Rotate sessions between accounts — Call pm.rotate_session() when switching Facebook accounts to get a fresh IP.
Use delays — Even with proxies, respect delay_between_profiles in config (default 5-10s) to avoid aggressive patterns.
Monitor your proxy dashboard — All providers (Bright Data, IProyal, Storm Proxies, NetNut) have dashboards showing bandwidth usage and success rates.

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

概述

功能

获取 Google API 凭证（可选）

使用方法

代理工具接口

输出数据

页面/群组数据结构

安装

配置

环境变量

运行发现

运行抓取

输出格式

JSON 输出

CSV 输出

高级选项

缩略图下载

恢复中断的会话

过滤选项

注意事项

故障排除

浏览器启动失败

被检测到为机器人

无内容返回

Overview

Features

Getting Google API Credentials (Optional)

Usage

Agent Tool Interface

Output Data

Page/Group Data Structure

Group Data Structure

Page Tiers

File Outputs

Configuration

Filters Applied

Troubleshooting

Login Issues

No Pages Discovered

Rate Limiting

🌐 Residential Proxy Support

Why Use a Residential Proxy?

Recommended Proxy Providers

Setup Steps

1. Get Your Proxy Credentials

2. Configure Entirely via Environment Variables

3. Provider-Specific Host/Port Defaults

4. Custom Proxy Provider

Running the Scraper with Proxy

Using the Proxy Manager Programmatically

Best Practices for Long-Running Scrapes

安装命令点击复制