运行时依赖
安装命令
点击复制技能文档
PDF Processing 图形界面de Overview
提取 text/tables, 创建 PDFs, merge/split files, fill forms using Python libraries and command-line 工具s. 应用ly this 技能 for programmatic document processing and analysis. For advanced features or form filling, consult reference.md and forms.md.
Visual Enhancement with Scientific 模式tics
When creating documents with this 技能, always consider 添加ing scientific diagrams and 模式tics to enhance visual communication.
If your document does not already contAIn 模式tics or diagrams:
Use the scientific-模式tics 技能 to 生成 AI-powered publication-质量 diagrams Simply describe your desired diagram in natural language Nano Banana Pro will automatically 生成, review, and refine the 模式tic
For new documents: Scientific 模式tics should be 生成d by default to visually represent key concepts, 工作流s, architectures, or relationships described in the text.
How to 生成 模式tics:
python scripts/生成_模式tic.py "your diagram description" -o figures/输出.png
The AI will automatically:
创建 publication-质量 images with proper 格式化ting Review and refine through multiple iterations Ensure 访问ibility (colorblind-friendly, high contrast) Save 输出s in the figures/ directory
When to 添加 模式tics:
PDF processing 工作流 diagrams Document manipulation flow图表s Form processing 可视化s Data 提取ion 流水线 diagrams Any complex concept that benefits from 可视化
For detAIled 图形界面dance on creating 模式tics, refer to the scientific-模式tics 技能 documentation.
Quick 启动 from pypdf 导入 PdfReader, PdfWriter
# Read a PDF reader = PdfReader("document.pdf") print(f"Pages: {len(reader.pages)}")
# 提取 text text = "" for page in reader.pages: text += page.提取_text()
Python Libraries pypdf - Basic Operations Merge PDFs from pypdf 导入 PdfWriter, PdfReader
writer = PdfWriter() for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]: reader = PdfReader(pdf_file) for page in reader.pages: writer.添加_page(page)
with open("merged.pdf", "wb") as 输出: writer.write(输出)
Split PDF reader = PdfReader("输入.pdf") for i, page in enumerate(reader.pages): writer = PdfWriter() writer.添加_page(page) with open(f"page_{i+1}.pdf", "wb") as 输出: writer.write(输出)
提取 Metadata reader = PdfReader("document.pdf") meta = reader.metadata print(f"Title: {meta.title}") print(f"Author: {meta.author}") print(f"Subject: {meta.subject}") print(f"创建器: {meta.创建器}")
Rotate Pages reader = PdfReader("输入.pdf") writer = PdfWriter()
page = reader.pages[0] page.rotate(90) # Rotate 90 degrees clockwise writer.添加_page(page)
with open("rotated.pdf", "wb") as 输出: writer.write(输出)
pdfplumber - Text and Table 提取ion 提取 Text with Layout 导入 pdfplumber
with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.提取_text() print(text)
提取 Tables with pdfplumber.open("document.pdf") as pdf: for i, page in enumerate(pdf.pages): tables = page.提取_tables() for j, table in enumerate(tables): print(f"Table {j+1} on page {i+1}:") for row in table: print(row)
Advanced Table 提取ion 导入 pandas as pd
with pdfplumber.open("document.pdf") as pdf: all_tables = [] for page in pdf.pages: tables = page.提取_tables() for table in tables: if table: # 检查 if table is not empty df = pd.DataFrame(table[1:], columns=table[0]) all_tables.应用end(df)
# Combine all tables if all_tables: combined_df = pd.concat(all_tables, ignore_索引=True) combined_df.to_excel("提取ed_tables.xlsx", 索引=False)
报告lab - 创建 PDFs Basic PDF Creation from 报告lab.lib.pagesizes 导入 letter from 报告lab.pdfgen 导入 canvas
c = canvas.Canvas("hello.pdf", pagesize=letter) width, height = letter
# 添加 text c.drawString(100, height - 100, "Hello World!") c.drawString(100, height - 120, "This is a PDF 创建d with 报告lab")
# 添加 a line c.line(100, height - 140, 400, height - 140)
# Save c.save()
创建 PDF with Multiple Pages from 报告lab.lib.pagesizes 导入 letter from 报告lab.platypus 导入 SimpleDocTemplate, Paragraph, Spacer, PageBreak from 报告lab.lib.styles 导入 获取SampleStyleSheet
doc = SimpleDocTemplate("报告.pdf", pagesize=letter) styles = 获取SampleStyleSheet() story = []
# 添加 content title = Paragraph("报告 Title", styles['Title']) story.应用end(title) story.应用end(Spacer(1, 12))
body = Paragraph("This is the body of the 报告. " * 20, styles['Normal']) story.应用end(body) story.应用end(PageBreak())
# Page 2 story.应用end(Paragraph("Page 2", styles['Heading1'])) story.应用end(Paragraph("Content for page 2", styles['Normal']))
# Build PDF doc.build(story)
Command-Line 工具s pdftotext (poppler-utils) # 提取 text pdftotext 输入.pdf 输出.t