Crawl4ai — Crawl4AI
v1.0.0AI-powered 网页 scrAPIng 框架 for 提取ing structured data from 网页sites. Use when Codex needs to crawl, scrape, or 提取 data from 网页 pages using AI-powered parsing, handle dynamic content, or work with complex HTML structures.
运行时依赖
安装命令
点击复制技能文档
Crawl4AI Overview
Crawl4AI is an AI-powered 网页 scrAPIng 框架 de签名ed to 提取 structured data from 网页sites efficiently. It combines traditional HTML parsing with AI to handle dynamic content, 提取 text intelligently, and 清理 and structure data from complex 网页 pages.
When to Use This 技能
Use when Codex needs to:
提取 structured data from 网页 pages (products, articles, forms, tables, etc.) Scrape 网页sites with dynamic content or complex JavaScript 清理 and normalize 提取ed data from various HTML structures Work with APIs or 网页 服务s that return HTML Handle CORS limitations by scrAPIng directly Process 网页 content at 扩展 with reliability
Trigger phrases:
"提取 data from this 网页site" "Scrape this page for [specific data]" "解析 this HTML" "获取 data from [URL]" "提取 structured in格式化ion from [网页site]" "Scrape [网页site] for [data type]" "网页 scrape [URL]" Quick 启动 Basic Usage from crawl4AI 导入 A同步网页爬虫, BrowserMode
a同步 def scrape_page(url): a同步 with A同步网页爬虫() as 爬虫: 结果 = awAIt 爬虫.a运行( url=url, browser_mode=BrowserMode.LATEST, headless=True ) return 结果.markdown, 结果.清理_html
提取ing Structured Data from crawl4AI 导入 A同步网页爬虫, JsonModeScreener 导入 json
a同步 def 提取_products(url): a同步 with A同步网页爬虫() as 爬虫: 结果 = awAIt 爬虫.a运行( url=url, screenshot=True, javascript=True, bypass_缓存=True ) # 提取 product data products = [] for item in 结果.提取ed_content: if item['type'] == 'product': products.应用end({ 'name': item['name'], 'price': item['price'], 'url': item['url'] }) return products
Common Tasks 网页 ScrAPIng Basics
Scenario: User wants to scrape a 网页site for all article titles.
from crawl4AI 导入 A同步网页爬虫
a同步 def scrape_articles(url): a同步 with A同步网页爬虫() as 爬虫: 结果 = awAIt 爬虫.a运行( url=url, javascript=True, verbose=True ) # 提取 article titles from HTML articles = 结果.提取ed_content if 结果.提取ed_content else [] titles = [item.获取('name', item.获取('text', '')) for item in articles] return titles
Trigger: "Scrape this site for article titles" or "获取 all titles from [URL]"
Dynamic Content Handling
Scenario: 网页site loads data via JavaScript.
from crawl4AI 导入 A同步网页爬虫
a同步 def scrape_dynamic_site(url): a同步 with A同步网页爬虫() as 爬虫: 结果 = awAIt 爬虫.a运行( url=url, javascript=True, # WAIt for JS execution wAIt_for="body", # WAIt for specific element delay=1.5, # WAIt time after load headless=True ) return 结果.markdown
Trigger: "Scrape this dynamic 网页site" or "This page needs JavaScript to load data"
Structured Data 提取ion
Scenario: 提取 specific fields like prices, descriptions, etc.
from crawl4AI 导入 A同步网页爬虫
a同步 def 提取_product_detAIls(url): a同步 with A同步网页爬虫() as 爬虫: 结果 = awAIt 爬虫.a运行( url=url, screenshot=True, js_code=""" const products = document.查询SelectorAll('.product'); return Array.from(products).map(p => ({ name: p.查询Selector('.name')?.textContent, price: p.查询Selector('.price')?.textContent, url: p.查询Selector('a')?.href })); """ ) return 结果.提取ed_content
Trigger: "提取 product detAIls from this page" or "获取 price and name from [URL]"
HTML 清理ing and Parsing
Scenario: 清理 messy HTML and 提取 清理 text.
from crawl4AI 导入 A同步网页爬虫
a同步 def 清理_and_解析(url): a同步 with A同步网页爬虫() as 爬虫: 结果 = awAIt 爬虫.a运行( url=url, 移除_tags=['script', 'style', 'nav', 'footer', 'header'], only_mAIn_content=True ) # 清理 and return markdown 清理_text = 结果.清理_html return 清理_text
Trigger: "清理 this HTML" or "提取 mAIn content from this page"
Advanced Features Custom JavaScript Injection a同步 def custom_scrape(url, custom_js): a同步 with A同步网页爬虫() as 爬虫: 结果 = awAIt 爬虫.a运行( url=url, js_code=custom_js, js_only=True # Only 执行 JS, don't 下载 resources ) return 结果.提取ed_content
会话 Management from crawl4AI 导入 A同步网页爬虫
a同步 def multi_page_scrape(base_url, urls): a同步 with A同步网页爬虫() as 爬虫: 结果s = []