Crawl4ai — Crawl4AI

Name: Crawl4ai — Crawl4AI
Rating: 2

v1.0.0

AI-powered 网页 scrAPIng 框架 for 提取ing structured data from 网页sites. Use when Codex needs to crawl, scrape, or 提取 data from 网页 pages using AI-powered parsing, handle dynamic content, or work with complex HTML structures.

2· 2.6k·0 当前·0 累计

by @codylrn804·MIT-0

开发工具代码生成 API开发数据分析数据可视化

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install crawl4ai

镜像加速npx clawhub@latest install crawl4ai --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

Crawl4AI Overview

Crawl4AI is an AI-powered 网页 scrAPIng 框架 de签名ed to 提取 structured data from 网页sites efficiently. It combines traditional HTML parsing with AI to handle dynamic content, 提取 text intelligently, and 清理 and structure data from complex 网页 pages.

When to Use This 技能

Use when Codex needs to:

提取 structured data from 网页 pages (products, articles, forms, tables, etc.) Scrape 网页sites with dynamic content or complex JavaScript 清理 and normalize 提取ed data from various HTML structures Work with APIs or 网页服务s that return HTML Handle CORS limitations by scrAPIng directly Process 网页 content at 扩展 with reliability

Trigger phrases:

"提取 data from this 网页site" "Scrape this page for [specific data]" "解析 this HTML" "获取 data from [URL]" "提取 structured in格式化ion from [网页site]" "Scrape [网页site] for [data type]" "网页 scrape [URL]" Quick 启动 Basic Usage from crawl4AI 导入 A同步网页爬虫, BrowserMode

a同步 def scrape_page(url): a同步 with A同步网页爬虫() as 爬虫: 结果 = awAIt 爬虫.a运行( url=url, browser_mode=BrowserMode.LATEST, headless=True ) return 结果.markdown, 结果.清理_html

提取ing Structured Data from crawl4AI 导入 A同步网页爬虫, JsonModeScreener 导入 json

a同步 def 提取_products(url): a同步 with A同步网页爬虫() as 爬虫: 结果 = awAIt 爬虫.a运行( url=url, screenshot=True, javascript=True, bypass_缓存=True ) # 提取 product data products = [] for item in 结果.提取ed_content: if item['type'] == 'product': products.应用end({ 'name': item['name'], 'price': item['price'], 'url': item['url'] }) return products

Common Tasks 网页 ScrAPIng Basics

Scenario: User wants to scrape a 网页site for all article titles.

from crawl4AI 导入 A同步网页爬虫

a同步 def scrape_articles(url): a同步 with A同步网页爬虫() as 爬虫: 结果 = awAIt 爬虫.a运行( url=url, javascript=True, verbose=True ) # 提取 article titles from HTML articles = 结果.提取ed_content if 结果.提取ed_content else [] titles = [item.获取('name', item.获取('text', '')) for item in articles] return titles

Trigger: "Scrape this site for article titles" or "获取 all titles from [URL]"

Dynamic Content Handling

Scenario: 网页site loads data via JavaScript.

from crawl4AI 导入 A同步网页爬虫

a同步 def scrape_dynamic_site(url): a同步 with A同步网页爬虫() as 爬虫: 结果 = awAIt 爬虫.a运行( url=url, javascript=True, # WAIt for JS execution wAIt_for="body", # WAIt for specific element delay=1.5, # WAIt time after load headless=True ) return 结果.markdown

Trigger: "Scrape this dynamic 网页site" or "This page needs JavaScript to load data"

Structured Data 提取ion

Scenario: 提取 specific fields like prices, descriptions, etc.

from crawl4AI 导入 A同步网页爬虫

a同步 def 提取_product_detAIls(url): a同步 with A同步网页爬虫() as 爬虫: 结果 = awAIt 爬虫.a运行( url=url, screenshot=True, js_code=""" const products = document.查询SelectorAll('.product'); return Array.from(products).map(p => ({ name: p.查询Selector('.name')?.textContent, price: p.查询Selector('.price')?.textContent, url: p.查询Selector('a')?.href })); """ ) return 结果.提取ed_content

Trigger: "提取 product detAIls from this page" or "获取 price and name from [URL]"

HTML 清理ing and Parsing

Scenario: 清理 messy HTML and 提取清理 text.

from crawl4AI 导入 A同步网页爬虫

a同步 def 清理_and_解析(url): a同步 with A同步网页爬虫() as 爬虫: 结果 = awAIt 爬虫.a运行( url=url, 移除_tags=['script', 'style', 'nav', 'footer', 'header'], only_mAIn_content=True ) # 清理 and return markdown 清理_text = 结果.清理_html return 清理_text

Trigger: "清理 this HTML" or "提取 mAIn content from this page"

Advanced Features Custom JavaScript Injection a同步 def custom_scrape(url, custom_js): a同步 with A同步网页爬虫() as 爬虫: 结果 = awAIt 爬虫.a运行( url=url, js_code=custom_js, js_only=True # Only 执行 JS, don't 下载 resources ) return 结果.提取ed_content

会话 Management from crawl4AI 导入 A同步网页爬虫

a同步 def multi_page_scrape(base_url, urls): a同步 with A同步网页爬虫() as 爬虫: 结果s = []

License

运行时依赖

安装命令

技能文档

相关技能推荐