5.8 KiB
5.8 KiB
📄 MCP PDF
A FastMCP server for PDF processing
41 tools for text extraction, OCR, tables, forms, annotations, and more
Works great with MCP Office Tools
What It Does
MCP PDF extracts content from PDFs using multiple libraries with automatic fallbacks. If one method fails, it tries another.
Core capabilities:
- Text extraction via PyMuPDF, pdfplumber, or pypdf (auto-fallback)
- Table extraction via Camelot, pdfplumber, or Tabula (auto-fallback)
- OCR for scanned documents via Tesseract
- Form handling - extract, fill, and create PDF forms
- Document assembly - merge, split, reorder pages
- Annotations - sticky notes, highlights, stamps
- Vector graphics - extract to SVG for schematics and technical drawings
Quick Start
# Install from PyPI
uvx mcp-pdf
# Or add to Claude Code
claude mcp add pdf-tools uvx mcp-pdf
Development Installation
git clone https://github.com/rsp2k/mcp-pdf
cd mcp-pdf
uv sync
# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript
# Verify
uv run python examples/verify_installation.py
Tools
Content Extraction
| Tool | What it does |
|---|---|
extract_text |
Pull text from PDF pages with automatic chunking for large files |
extract_tables |
Extract tables to JSON, CSV, or Markdown |
extract_images |
Extract embedded images |
extract_links |
Get all hyperlinks with page filtering |
pdf_to_markdown |
Convert PDF to markdown preserving structure |
ocr_pdf |
OCR scanned documents using Tesseract |
extract_vector_graphics |
Export vector graphics to SVG (schematics, charts, drawings) |
Document Analysis
| Tool | What it does |
|---|---|
extract_metadata |
Get title, author, creation date, page count, etc. |
get_document_structure |
Extract table of contents and bookmarks |
analyze_layout |
Detect columns, headers, footers |
is_scanned_pdf |
Check if PDF needs OCR |
compare_pdfs |
Diff two PDFs by text, structure, or metadata |
analyze_pdf_health |
Check for corruption, optimization opportunities |
analyze_pdf_security |
Report encryption, permissions, signatures |
Forms
| Tool | What it does |
|---|---|
extract_form_data |
Get form field names and values |
fill_form_pdf |
Fill form fields from JSON |
create_form_pdf |
Create new forms with text fields, checkboxes, dropdowns |
add_form_fields |
Add fields to existing PDFs |
Document Assembly
| Tool | What it does |
|---|---|
merge_pdfs |
Combine multiple PDFs with bookmark preservation |
split_pdf_by_pages |
Split by page ranges |
split_pdf_by_bookmarks |
Split at chapter/section boundaries |
reorder_pdf_pages |
Rearrange pages in custom order |
Annotations
| Tool | What it does |
|---|---|
add_sticky_notes |
Add comment annotations |
add_highlights |
Highlight text regions |
add_stamps |
Add Approved/Draft/Confidential stamps |
extract_all_annotations |
Export annotations to JSON |
How Fallbacks Work
The server tries multiple libraries for each operation:
Text extraction:
- PyMuPDF (fastest)
- pdfplumber (better for complex layouts)
- pypdf (most compatible)
Table extraction:
- Camelot (best accuracy, requires Ghostscript)
- pdfplumber (no dependencies)
- Tabula (requires Java)
If a PDF fails with one library, the next is tried automatically.
Token Management
Large PDFs can overflow MCP response limits. The server handles this:
- Automatic chunking splits large documents into page groups
- Table row limits prevent huge tables from blowing up responses
- Summary mode returns structure without full content
# Get first 10 pages
result = await extract_text("huge.pdf", pages="1-10")
# Limit table rows
tables = await extract_tables("data.pdf", max_rows_per_table=50)
# Structure only
tables = await extract_tables("data.pdf", summary_only=True)
URL Processing
PDFs can be fetched directly from HTTPS URLs:
result = await extract_text("https://example.com/report.pdf")
Files are cached locally for subsequent operations.
System Dependencies
Some features require system packages:
| Feature | Dependency |
|---|---|
| OCR | tesseract-ocr |
| Camelot tables | ghostscript |
| Tabula tables | default-jre-headless |
| PDF to images | poppler-utils |
Ubuntu/Debian:
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless
Configuration
Optional environment variables:
| Variable | Purpose |
|---|---|
MCP_PDF_ALLOWED_PATHS |
Colon-separated directories for file output |
PDF_TEMP_DIR |
Temp directory for processing (default: /tmp/mcp-pdf-processing) |
TESSDATA_PREFIX |
Tesseract language data location |
Development
# Run tests
uv run pytest
# With coverage
uv run pytest --cov=mcp_pdf
# Format
uv run black src/ tests/
# Lint
uv run ruff check src/ tests/
License
MIT