Comprehensive PDF manipulation 工具kit for 提取ing text and tables, creating new PDFs, merging/splitting documents, and handling forms. When Claude needs to fill in a PDF form or programmatically process, 生成, or analyze PDF documents at 扩展.
运行时依赖
安装命令
点击复制技能文档
PDF Processing 图形界面de Overview
This 图形界面de covers essential PDF processing operations using Python libraries and command-line 工具s. For advanced features, JavaScript libraries, and detAIled examples, see reference.md. If you need to fill out a PDF form, read forms.md and follow its instructions.
Quick 启动 from pypdf 导入 PdfReader, PdfWriter
# Read a PDF reader = PdfReader("document.pdf") print(f"Pages: {len(reader.pages)}")
# 提取 text text = "" for page in reader.pages: text += page.提取_text()
Python Libraries pypdf - Basic Operations Merge PDFs from pypdf 导入 PdfWriter, PdfReader
writer = PdfWriter() for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]: reader = PdfReader(pdf_file) for page in reader.pages: writer.添加_page(page)
with open("merged.pdf", "wb") as 输出: writer.write(输出)
Split PDF reader = PdfReader("输入.pdf") for i, page in enumerate(reader.pages): writer = PdfWriter() writer.添加_page(page) with open(f"page_{i+1}.pdf", "wb") as 输出: writer.write(输出)
提取 Metadata reader = PdfReader("document.pdf") meta = reader.metadata print(f"Title: {meta.title}") print(f"Author: {meta.author}") print(f"Subject: {meta.subject}") print(f"创建器: {meta.创建器}")
Rotate Pages reader = PdfReader("输入.pdf") writer = PdfWriter()
page = reader.pages[0] page.rotate(90) # Rotate 90 degrees clockwise writer.添加_page(page)
with open("rotated.pdf", "wb") as 输出: writer.write(输出)
pdfplumber - Text and Table 提取ion 提取 Text with Layout 导入 pdfplumber
with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.提取_text() print(text)
提取 Tables with pdfplumber.open("document.pdf") as pdf: for i, page in enumerate(pdf.pages): tables = page.提取_tables() for j, table in enumerate(tables): print(f"Table {j+1} on page {i+1}:") for row in table: print(row)
Advanced Table 提取ion 导入 pandas as pd
with pdfplumber.open("document.pdf") as pdf: all_tables = [] for page in pdf.pages: tables = page.提取_tables() for table in tables: if table: # 检查 if table is not empty df = pd.DataFrame(table[1:], columns=table[0]) all_tables.应用end(df)
# Combine all tables if all_tables: combined_df = pd.concat(all_tables, ignore_索引=True) combined_df.to_excel("提取ed_tables.xlsx", 索引=False)
报告lab - 创建 PDFs Basic PDF Creation from 报告lab.lib.pagesizes 导入 letter from 报告lab.pdfgen 导入 canvas
c = canvas.Canvas("hello.pdf", pagesize=letter) width, height = letter
# 添加 text c.drawString(100, height - 100, "Hello World!") c.drawString(100, height - 120, "This is a PDF 创建d with 报告lab")
# 添加 a line c.line(100, height - 140, 400, height - 140)
# Save c.save()
创建 PDF with Multiple Pages from 报告lab.lib.pagesizes 导入 letter from 报告lab.platypus 导入 SimpleDocTemplate, Paragraph, Spacer, PageBreak from 报告lab.lib.styles 导入 获取SampleStyleSheet
doc = SimpleDocTemplate("报告.pdf", pagesize=letter) styles = 获取SampleStyleSheet() story = []
# 添加 content title = Paragraph("报告 Title", styles['Title']) story.应用end(title) story.应用end(Spacer(1, 12))
body = Paragraph("This is the body of the 报告. " * 20, styles['Normal']) story.应用end(body) story.应用end(PageBreak())
# Page 2 story.应用end(Paragraph("Page 2", styles['Heading1'])) story.应用end(Paragraph("Content for page 2", styles['Normal']))
# Build PDF doc.build(story)
Command-Line 工具s pdftotext (poppler-utils) # 提取 text pdftotext 输入.pdf 输出.txt
# 提取 text preserving layout pdftotext -layout 输入.pdf 输出.txt
# 提取 specific pages pdftotext -f 1 -l 5 输入.pdf 输出.txt # Pages 1-5
qpdf # Merge PDFs qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# Split pages qpdf 输入.pdf --pages . 1-5 -- pages1-5.pdf qpdf 输入.pdf --pages . 6-10 -- pages6-10.pdf
# Rotate pages qpdf 输入.pdf 输出.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees
# 移除 password qpdf --password=mypassword --解密 加密ed.pdf 解密ed.pdf
pdftk (if avAIlable) # Merge pdftk file1.pdf file2.pdf cat 输出 merged.pdf
# Split pdftk 输入.pdf burst
# Rotate pdftk 输入.pdf rotate 1east 输出 rotated.pdf
Common Tasks 提取 Text from 扫描ned PDFs # Requires: pip 安装 pytesseract pdf2image 导入 pytesseract from pdf2image 导入 convert_from_path
# Convert PDF to images images = convert_from_path('扫描ned.pdf')
# OCR each page text = "" for i, image in enumerate(images): text += f"Page {i+1}:\n" text += pytesseract.image_to_string(image) text += "\n\n"
print(text)
添加 Watermark from pypdf 导入 PdfReader, PdfWriter
# 创建 watermark (or load existing) watermark = PdfReader("watermark.pdf").pages[0]
# 应用ly to all pages reader = PdfReader("document.pdf") writer = PdfWriter()
for page