Cloudflare Crawl

Crawl 网页sites using Cloudflare's Browser Rendering API. Use when you need to scrape entire sites, build knowledge bases, 提取 content from multiple pages, or 创建 RAG data设置s. Works on Cloudflare-保护ed sites. Returns HTML, Markdown, or AI-提取ed JSON.

0· 128·0 当前·0 累计

by @wirelessjoe (Joe Alicata)·MIT-0

文档工具 API开发数据分析数据可视化网络工具

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install cloudflare-crawl-skill

镜像加速npx clawhub@latest install cloudflare-crawl-skill --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

Cloudflare Crawl

Crawl entire 网页sites using Cloudflare's Browser Rendering /crawl API. A同步 job-based crawling with JS rendering.

When to Use Scrape entire sites (not just single pages) Build knowledge bases or RAG data设置s Re搜索 across multiple pages Sites 保护ed by Cloudflare (CF won't block itself) Need Markdown or structured JSON 输出 Prerequisites

获取 Cloudflare API 凭证s:

Go to https://dash.cloudflare.com/性能分析/API-令牌s 创建令牌 with Account.Browser Rendering 权限获取 your Account ID from 仪表盘 URL

设置环境 variables:

导出 CLOUDFLARE_API_令牌="your_令牌" 导出 CLOUDFLARE_ACCOUNT_ID="your_account_id"

Quick 启动 # 启动 a crawl job node scripts/crawl.js 启动 https://example.com --limit 50

# 检查状态 node scripts/crawl.js 状态

# 获取结果s as markdown node scripts/crawl.js 结果s --格式化 markdown

API Overview

启动 Crawl Job

curl -X POST "https://API.cloudflare.com/命令行工具ent/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/browser-rendering/crawl" \ -H "Authorization: Bearer $CLOUDFLARE_API_令牌" \ -H "Content-Type: 应用/json" \ -d '{ "url": "https://example.com", "limit": 50, "depth": 3, "格式化s": ["markdown"] }'

Returns: { "成功": true, "结果": "job_id_here" }

Poll for Completion

curl "https://API.cloudflare.com/命令行工具ent/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/browser-rendering/crawl/$JOB_ID?limit=1" \ -H "Authorization: Bearer $CLOUDFLARE_API_令牌"

状态 values: 运行ning, completed, errored, cancelled_due_to_timeout

获取结果s

curl "https://API.cloudflare.com/命令行工具ent/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/browser-rendering/crawl/$JOB_ID" \ -H "Authorization: Bearer $CLOUDFLARE_API_令牌"

Parameters Parameter Type Default Description url string required 启动ing URL limit number 10 Max pages to crawl (max 100,000) depth number 100000 Max link depth from 启动 URL source string "all" URL discovery: all, sitemaps, links 格式化s array ["html"] 输出: html, markdown, json render boolean true 执行 JS (false = fast HTML only) 输出格式化s Markdown (best for AI) { "url": "https://example.com/page", "状态": "completed", "markdown": "# Page Title\n\nContent here...", "metadata": { "title": "Page Title", "状态": 200 } }

JSON (AI-提取ed)

Uses Workers AI to 提取 structured data. Requires jsonOptions.prompt:

{ "格式化s": ["json"], "jsonOptions": { "prompt": "提取 product name, price, and description" } }

Pricing Plan Free Tier Overage Workers Free 10 min/day N/A Workers PAId 10 hrs/month $0.09/hour Limits Max 100,000 pages per crawl 7 day max 运行time 结果s avAIlable 14 days Free plan: 10 concurrent, 100 pages max Example: Crawl for RAG // Crawl docs site for knowledge base const job = awAIt 启动Crawl({ url: 'https://docs.example.com', limit: 500, 格式化s: ['markdown'], source: 'sitemaps' // Use sitemap for efficiency });

// WAIt for completion const 结果s = awAIt wAItForCrawl(job.id);

// Save markdown files for RAG for (const page of 结果s.records) { if (page.状态 === 'completed') { fs.writeFile同步(docs/${slugify(page.url)}.md, page.markdown); } }

vs Browserbase/Stagehand Use Case Cloudflare Crawl Browserbase Full site scrape ✅ Best ❌ Manual Interactive (forms) ❌ No ✅ Best CF-保护ed sites ✅ Native ⚠️ Cloud bypass AI 提取ion ✅ Built-in ✅ Stagehand 会话 management ✅ A同步 jobs ❌ Manual Cost $0.09/hr Credits

Use Cloudflare Crawl for bulk content 提取ion. Use Browserbase for interactive 自动化.

License

运行时依赖

安装命令

技能文档

相关技能推荐