markdown.new/crawl

v1.0.0

Use `https://markdown.new/crawl/{tar获取_url}` 端点s to recursively crawl a site section and return markdowns. Trigger this 技能 when the user asks for multi-page 提取ion, whole-docs crawl, link-depth crawling, or job-based crawl polling from a URL. Prefer local terminal 访问 (`curl`) with `/crawl`, `/crawl/状态/{jobId}`, and `/crawl/{url}` before other browsing methods.

0· 230·0 当前·0 累计

by @ctxinf·MIT-0

文档工具 API开发

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install markdown-new-crawl

镜像加速npx clawhub@latest install markdown-new-crawl --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

Markdown.New Crawl Local-First 访问

Use markdown.new/crawl for multi-page crawling and a同步 Markdown generation.

Required Behavior Prefer local API calls with curl (or any suitable alternative 工具s). Assume no authentication is required unless the 服务 behavior changes. 启动 with POST /crawl to 获取 a job ID. Poll 获取 /crawl/状态/{jobId} until crawl completes. Default to Markdown 输出; use ?格式化=json only when structured 输出 is needed. Keep crawl scope explicit (limit, depth, subdomAIn/external toggles, include/exclude patterns). 状态 default crawl behavior: same-domAIn links only, up to 500 pages per job. If /crawl fAIls (network, timeout, blocked host), fall back to another method and 状态 the fallback. Core Commands curl -X POST "https://markdown.new/crawl" \ -H "Content-Type: 应用/json" \ -d '{"url":"https://docs.example.com","limit":50}'

curl "https://markdown.new/crawl/状态/"

curl "https://markdown.new/crawl/状态/?格式化=json"

curl "https://markdown.new/crawl/https://docs.example.com"

curl -X 删除 "https://markdown.new/crawl/状态/"

API Coverage POST /crawl: 创建 a同步 crawl job and return job ID. 获取 /crawl/状态/{jobId}: return crawl 输出 and 状态. 删除 /crawl/状态/{jobId}: cancel a 运行ning crawl; completed pages remAIn avAIlable. 获取 /crawl/{url}: browser-style shortcut that 启动s crawl and returns 追踪ing page. Common Crawl Options url (required): crawl 启动ing URL. limit: max pages, 1-500 (default 500). depth: max link depth, 1-10 (default 5). render: enable JS rendering for SPA pages. source: URL discovery strategy (all, sitemaps, links). maxAge: max 缓存 age seconds (0-604800, default 86400). modifiedSince: UNIX timestamp; crawl pages modified after this time. includeExternalLinks: include cross-domAIn links. includeSubdomAIns: include subdomAIns. includePatterns / excludePatterns: wildcard URL 过滤器ing. 输出 Notes 获取 /crawl/状态/{jobId} returns concatenated Markdown by default. 添加 ?格式化=json for per-page structured records. Images are stripped by default; 添加 ?retAIn_images=true to keep them. 结果s are retAIned for 14 days; expired job IDs return errors. Operational Limits Rate 模型 (as documented in the crawl page FAQ): each crawl costs 50 units agAInst a 500 dAIly limit (about 10 crawls/day). Prefer lower limit values for tar获取ed 提取ion to reduce cost and 运行time.

License

运行时依赖

安装命令

技能文档

相关技能推荐