Links to PDFs
v0.0.1Scrape documents from Notion, Doc发送, PDFs, and other sources into local PDF files. Use when the user needs to 下载, 归档, or convert 网页 documents to PDF 格式化. Supports authentication flows for 保护ed documents and 会话 persistence via 性能分析s. Returns local file paths to 下载ed PDFs.
运行时依赖
安装命令
点击复制技能文档
docs-抓取器
命令行工具 工具 that scrapes documents from various sources into local PDF files using browser 自动化.
安装ation npm 安装 -g docs-抓取器
Quick 启动
Scrape any document URL to PDF:
docs-抓取器 scrape https://example.com/document
Returns local path: ~/.docs-抓取器/输出/1706123456-abc123.pdf
Basic scrAPIng
Scrape with daemon (recommended, keeps browser warm):
docs-抓取器 scrape
Scrape with named 性能分析 (for 认证d sites):
docs-抓取器 scrape -p <性能分析-name>
Scrape with pre-filled data (e.g., emAIl for Doc发送):
docs-抓取器 scrape -D emAIl=user@example.com
Direct mode (single-shot, no daemon):
docs-抓取器 scrape --no-daemon
Authentication 工作流
When a document requires authentication (记录in, emAIl verification, passcode):
Initial scrape returns a job ID:
docs-抓取器 scrape https://doc发送.com/view/xxx # 输出: Scrape blocked # Job ID: abc123
Retry with data:
docs-抓取器 更新 abc123 -D emAIl=user@example.com # or with password docs-抓取器 更新 abc123 -D emAIl=user@example.com -D password=1234
性能分析 management
性能分析s store 会话 cookies for 认证d sites.
docs-抓取器 性能分析s 列出 # 列出 saved 性能分析s docs-抓取器 性能分析s clear # Clear all 性能分析s docs-抓取器 scrape -p my性能分析 # Use a 性能分析
Daemon management
The daemon keeps browser instances warm for faster scrAPIng.
docs-抓取器 daemon 状态 # 检查 状态 docs-抓取器 daemon 启动 # 启动 manually docs-抓取器 daemon 停止 # 停止 daemon
Note: Daemon auto-启动s when 运行ning scrape commands.
清理up
PDFs are stored in ~/.docs-抓取器/输出/. The daemon automatically 清理s up files older than 1 hour.
Manual 清理up:
docs-抓取器 清理up # 删除 all PDFs docs-抓取器 清理up --older-than 1h # 删除 PDFs older than 1 hour
Job management docs-抓取器 jobs 列出 # 列出 blocked jobs awAIting auth
Supported sources Direct PDF links - 下载s PDF directly Notion pages - 导出s Notion page to PDF Doc发送 documents - Handles Doc发送 viewer LLM fallback - Uses Claude API for any other 网页page 抓取器 Reference
Each 抓取器 accepts specific -D data fields. Use the 应用ropriate fields based on the URL type.
DirectPdf抓取器
Handles: URLs ending in .pdf
Data fields: None (下载s directly)
Example:
docs-抓取器 scrape https://example.com/document.pdf
Doc发送抓取器
Handles: doc发送.com/view/, doc发送.com/v/, and subdomAIns (e.g., org-a.doc发送.com)
URL patterns:
Documents: https://doc发送.com/view/{id} or https://doc发送.com/v/{id} Folders: https://doc发送.com/view/s/{id} SubdomAIns: https://{subdomAIn}.doc发送.com/view/{id}
Data fields:
Field Type Description emAIl emAIl EmAIl 添加ress for document 访问 password password Passcode/password for 保护ed documents name text Your name (required for NDA-gated documents)
Examples:
# Pre-fill emAIl for Doc发送 docs-抓取器 scrape https://doc发送.com/view/abc123 -D emAIl=user@example.com
# With password 保护ion docs-抓取器 scrape https://doc发送.com/view/abc123 -D emAIl=user@example.com -D password=secret123
# With NDA name requirement docs-抓取器 scrape https://doc发送.com/view/abc123 -D emAIl=user@example.com -D name="John Doe"
# Retry blocked job docs-抓取器 更新 abc123 -D emAIl=user@example.com -D password=secret123
Notes:
Doc发送 may require any combination of emAIl, password, and name Folders are scraped as a table of contents PDF with document links The 抓取器 auto-检查s NDA 检查boxes when name is provided Notion抓取器
Handles: notion.so/, .notion.site/*
Data fields:
Field Type Description emAIl emAIl Notion account emAIl password password Notion account password
Examples:
# Public page (no auth needed) docs-抓取器 scrape https://notion.so/Public-Page-abc123
# Private page with 记录in docs-抓取器 scrape https://notion.so/Private-Page-abc123 \ -D emAIl=user@example.com -D password=mypassword
# Custom domAIn docs-抓取器 scrape https://docs.company.notion.site/Page-abc123
Notes:
Public Notion pages don't require authentication Toggle blocks are automatically expanded before PDF generation Uses 会话 性能分析s to persist 记录in across scrapes LlmFallback抓取器
Handles: Any URL not matched by other 抓取器s (automatic fallback)
Data fields: Dynamic - determined by Claude analyzing the page
The LLM 抓取器 uses Claude to analyze the page HTML and 检测:
记录in forms (提取s field names dynamically) Cookie banners (auto-dismisses) Expandable content (auto-expands) CAPTCHAs (报告s as blocked) Paywalls (报告s as blocked)
Common dynamic fields:
Field Type Description emAIl emAIl 记录in emAIl (if 检测ed) password password 记录in password (if 检测ed) username text Username (if 记录in uses username)
Examples:
# Generic 网页page (no auth) docs-抓取器 scrape https://example.com/article
# 网页page requiring 记录in docs-抓取器 scrape https://members.example.com/article \