Compare commits

..

2 Commits

Author SHA1 Message Date
f32a014909 📝 Rewrite README: remove marketing fluff, describe what tools do
Some checks failed
Security Scan / security-scan (push) Has been cancelled
2026-02-06 22:43:02 -07:00
e4f77008bb 🚀 v2.0.8: Add extract_vector_graphics tool for PDF to SVG extraction
New tool extracts vector graphics from PDF pages as SVG files, supporting
three modes: full_page (PyMuPDF native SVG), drawings_only (raw vector
paths), and both. Handles lines, curves, rectangles, quads with proper
color space conversion (RGB, grayscale, CMYK). No new dependencies.
2026-02-02 13:56:17 -07:00
4 changed files with 540 additions and 648 deletions

828
README.md
View File

@ -4,558 +4,58 @@
<img src="https://img.shields.io/badge/MCP-PDF%20Tools-red?style=for-the-badge&logo=adobe-acrobat-reader" alt="MCP PDF"> <img src="https://img.shields.io/badge/MCP-PDF%20Tools-red?style=for-the-badge&logo=adobe-acrobat-reader" alt="MCP PDF">
**🚀 The Ultimate PDF Processing Intelligence Platform for AI** **A FastMCP server for PDF processing**
*Transform any PDF into structured, actionable intelligence with 24 specialized tools* *41 tools for text extraction, OCR, tables, forms, annotations, and more*
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg?style=flat-square)](https://www.python.org/downloads/) [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg?style=flat-square)](https://www.python.org/downloads/)
[![FastMCP](https://img.shields.io/badge/FastMCP-2.0+-green.svg?style=flat-square)](https://github.com/jlowin/fastmcp) [![FastMCP](https://img.shields.io/badge/FastMCP-2.0+-green.svg?style=flat-square)](https://github.com/jlowin/fastmcp)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square)](https://opensource.org/licenses/MIT) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square)](https://opensource.org/licenses/MIT)
[![Production Ready](https://img.shields.io/badge/status-production%20ready-brightgreen?style=flat-square)](https://github.com/rsp2k/mcp-pdf) [![PyPI](https://img.shields.io/pypi/v/mcp-pdf?style=flat-square)](https://pypi.org/project/mcp-pdf/)
[![MCP Protocol](https://img.shields.io/badge/MCP-1.13.0-purple?style=flat-square)](https://modelcontextprotocol.io)
**🤝 Perfect Companion to [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)** **Works great with [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**
</div> </div>
--- ---
## ✨ **What Makes MCP PDF Revolutionary?** ## What It Does
> 🎯 **The Problem**: PDFs contain incredible intelligence, but extracting it reliably is complex, slow, and often fails. MCP PDF extracts content from PDFs using multiple libraries with automatic fallbacks. If one method fails, it tries another.
>
> ⚡ **The Solution**: MCP PDF delivers **AI-powered document intelligence** with **40 specialized tools** that understand both content and structure.
<table> **Core capabilities:**
<tr> - **Text extraction** via PyMuPDF, pdfplumber, or pypdf (auto-fallback)
<td> - **Table extraction** via Camelot, pdfplumber, or Tabula (auto-fallback)
- **OCR** for scanned documents via Tesseract
### 🏆 **Why MCP PDF Leads** - **Form handling** - extract, fill, and create PDF forms
- **🚀 40 Specialized Tools** for every PDF scenario - **Document assembly** - merge, split, reorder pages
- **🧠 AI-Powered Intelligence** beyond basic extraction - **Annotations** - sticky notes, highlights, stamps
- **🔄 Multi-Library Fallbacks** for 99.9% reliability - **Vector graphics** - extract to SVG for schematics and technical drawings
- **⚡ 10x Faster** than traditional solutions
- **🌐 URL Processing** with smart caching
- **🎯 Smart Token Management** prevents MCP overflow errors
</td>
<td>
### 📊 **Enterprise-Proven For:**
- **Business Intelligence** & financial analysis
- **Document Security** assessment & compliance
- **Academic Research** & content analysis
- **Automated Workflows** & form processing
- **Document Migration** & modernization
- **Content Management** & archival
</td>
</tr>
</table>
--- ---
## 🚀 **Get Intelligence in 60 Seconds** ## Quick Start
```bash
# Install from PyPI
uvx mcp-pdf
# Or add to Claude Code
claude mcp add pdf-tools uvx mcp-pdf
```
<details>
<summary><b>Development Installation</b></summary>
```bash ```bash
# 1⃣ Clone and install
git clone https://github.com/rsp2k/mcp-pdf git clone https://github.com/rsp2k/mcp-pdf
cd mcp-pdf cd mcp-pdf
uv sync uv sync
# 2⃣ Install system dependencies (Ubuntu/Debian) # System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript
# 3⃣ Verify installation # Verify
uv run python examples/verify_installation.py
# 4⃣ Run the MCP server
uv run mcp-pdf
```
<details>
<summary>🔧 <b>Claude Desktop Integration</b> (click to expand)</summary>
### **📦 Production Installation (PyPI)**
```bash
# For personal use across all projects
claude mcp add -s local pdf-tools uvx mcp-pdf
# For project-specific use (isolated)
claude mcp add -s project pdf-tools uvx mcp-pdf
```
### **🛠️ Development Installation (Source)**
```bash
# For local development from source
claude mcp add -s project pdf-tools-dev uv -- --directory /path/to/mcp-pdf run mcp-pdf
```
### **⚙️ Manual Configuration**
Add to your `claude_desktop_config.json`:
```json
{
"mcpServers": {
"pdf-tools": {
"command": "uvx",
"args": ["mcp-pdf"]
}
}
}
```
*Restart Claude Desktop and unlock PDF intelligence!*
</details>
---
## 🎭 **See AI-Powered Intelligence In Action**
### **📊 Business Intelligence Workflow**
```python
# Complete financial report analysis in seconds
health = await analyze_pdf_health("quarterly-report.pdf")
classification = await classify_content("quarterly-report.pdf")
summary = await summarize_content("quarterly-report.pdf", summary_length="medium")
# Smart table extraction - prevents token overflow on large tables
tables = await extract_tables("quarterly-report.pdf", pages="5-7", max_rows_per_table=100)
# Or get just table structure without data
table_summary = await extract_tables("quarterly-report.pdf", pages="5-7", summary_only=True)
charts = await extract_charts("quarterly-report.pdf")
# Get instant insights
{
"document_type": "Financial Report",
"health_score": 9.2,
"key_insights": [
"Revenue increased 23% YoY",
"Operating margin improved to 15.3%",
"Strong cash flow generation"
],
"tables_extracted": 12,
"charts_found": 8,
"processing_time": 2.1
}
```
### **🔒 Document Security Assessment**
```python
# Comprehensive security analysis
security = await analyze_pdf_security("sensitive-document.pdf")
watermarks = await detect_watermarks("sensitive-document.pdf")
health = await analyze_pdf_health("sensitive-document.pdf")
# Enterprise-grade security insights
{
"encryption_type": "AES-256",
"permissions": {
"print": false,
"copy": false,
"modify": false
},
"security_warnings": [],
"watermarks_detected": true,
"compliance_ready": true
}
```
### **📚 Academic Research Processing**
```python
# Advanced research paper analysis
layout = await analyze_layout("research-paper.pdf", pages=[1,2,3])
summary = await summarize_content("research-paper.pdf", summary_length="long")
citations = await extract_text("research-paper.pdf", pages=[15,16,17])
# Research intelligence delivered
{
"reading_complexity": "Graduate Level",
"main_topics": ["Machine Learning", "Natural Language Processing"],
"citation_count": 127,
"figures_detected": 15,
"methodology_extracted": true
}
```
---
## 🛠️ **Complete Arsenal: 40+ Specialized Tools**
<div align="center">
### **🎯 Document Intelligence & Analysis**
| 🧠 **Tool** | 📋 **Purpose** | ⚡ **AI Powered** | 🎯 **Accuracy** |
|-------------|---------------|-----------------|----------------|
| `classify_content` | AI-powered document type detection | ✅ Yes | 97% |
| `summarize_content` | Intelligent key insights extraction | ✅ Yes | 95% |
| `analyze_pdf_health` | Comprehensive quality assessment | ✅ Yes | 99% |
| `analyze_pdf_security` | Security & vulnerability analysis | ✅ Yes | 99% |
| `compare_pdfs` | Advanced document comparison | ✅ Yes | 96% |
### **📊 Core Content Extraction**
| 🔧 **Tool** | 📋 **Purpose** | ⚡ **Speed** | 🎯 **Accuracy** |
|-------------|---------------|-------------|----------------|
| `extract_text` | Multi-method text extraction with auto-chunking | **Ultra Fast** | 99.9% |
| `extract_tables` | Smart table extraction with token overflow protection | **Fast** | 98% |
| `ocr_pdf` | Advanced OCR for scanned docs | **Moderate** | 95% |
| `extract_images` | Media extraction & processing | **Fast** | 99% |
| `pdf_to_markdown` | Structure-preserving conversion | **Fast** | 97% |
### **📐 Visual & Layout Analysis**
| 🎨 **Tool** | 📋 **Purpose** | 🔍 **Precision** | 💪 **Features** |
|-------------|---------------|-----------------|----------------|
| `analyze_layout` | Page structure & column detection | **High** | Advanced |
| `extract_charts` | Visual element extraction | **High** | Smart |
| `detect_watermarks` | Watermark identification | **Perfect** | Complete |
</div>
---
## 🌟 **Document Format Intelligence Matrix**
<div align="center">
### **📄 Universal PDF Processing Capabilities**
| 📋 **Document Type** | 🔍 **Detection** | 📊 **Text** | 📈 **Tables** | 🖼️ **Images** | 🧠 **Intelligence** |
|---------------------|-----------------|------------|--------------|--------------|-------------------|
| **Financial Reports** | ✅ Perfect | ✅ Perfect | ✅ Perfect | ✅ Perfect | 🧠 **AI-Enhanced** |
| **Research Papers** | ✅ Perfect | ✅ Perfect | ✅ Excellent | ✅ Perfect | 🧠 **AI-Enhanced** |
| **Legal Documents** | ✅ Perfect | ✅ Perfect | ✅ Good | ✅ Perfect | 🧠 **AI-Enhanced** |
| **Scanned PDFs** | ✅ Auto-Detect | ✅ OCR | ✅ OCR | ✅ Perfect | 🧠 **AI-Enhanced** |
| **Forms & Applications** | ✅ Perfect | ✅ Perfect | ✅ Excellent | ✅ Perfect | 🧠 **AI-Enhanced** |
| **Technical Manuals** | ✅ Perfect | ✅ Perfect | ✅ Perfect | ✅ Perfect | 🧠 **AI-Enhanced** |
*✅ Perfect • 🧠 AI-Enhanced Intelligence • 🔍 Auto-Detection*
</div>
---
## ⚡ **Performance That Amazes**
<div align="center">
### **🚀 Real-World Benchmarks**
| 📄 **Document Type** | 📏 **Pages** | ⏱️ **Processing Time** | 🆚 **vs Competitors** | 🧠 **Intelligence Level** |
|---------------------|-------------|----------------------|----------------------|---------------------------|
| Financial Report | 50 pages | 2.1 seconds | **10x faster** | **AI-Powered** |
| Research Paper | 25 pages | 1.3 seconds | **8x faster** | **Deep Analysis** |
| Scanned Document | 100 pages | 45 seconds | **5x faster** | **OCR + AI** |
| Complex Forms | 15 pages | 0.8 seconds | **12x faster** | **Structure Aware** |
*Benchmarked on: MacBook Pro M2, 16GB RAM • Including AI processing time*
</div>
---
## 🏗️ **Intelligent Architecture**
### **🧠 Multi-Library Intelligence System**
*Never worry about PDF compatibility or failure again*
```mermaid
graph TD
A[PDF Input] --> B{Smart Detection}
B --> C{Document Type}
C -->|Text-based| D[PyMuPDF Fast Path]
C -->|Scanned| E[OCR Processing]
C -->|Complex Layout| F[pdfplumber Analysis]
C -->|Tables Heavy| G[Camelot + Tabula]
D -->|Success| H[✅ Content Extracted]
D -->|Fail| I[pdfplumber Fallback]
I -->|Fail| J[pypdf Fallback]
E --> K[Tesseract OCR]
K --> L[AI Content Analysis]
F --> M[Layout Intelligence]
G --> N[Table Intelligence]
H --> O[🧠 AI Enhancement]
L --> O
M --> O
N --> O
O --> P[🎯 Structured Intelligence]
```
### **🎯 Intelligent Processing Pipeline**
1. **🔍 Smart Detection**: Automatically identify document type and optimal processing strategy
2. **⚡ Optimized Extraction**: Use the fastest, most accurate method for each document
3. **🛡️ Fallback Protection**: Seamless method switching if primary approach fails
4. **🧠 AI Enhancement**: Apply document intelligence and content analysis
5. **🧹 Clean Output**: Deliver perfectly structured, AI-ready intelligence
---
## 🌍 **Real-World Success Stories**
<div align="center">
### **🏢 Proven at Enterprise Scale**
</div>
<table>
<tr>
<td>
### **📊 Financial Services Giant**
*Processing 50,000+ reports monthly*
**Challenge**: Analyze quarterly reports from 2,000+ companies
**Results**:
- ⚡ **98% time reduction** (2 weeks → 4 hours)
- 🎯 **99.9% accuracy** in financial data extraction
- 💰 **$5M annual savings** in analyst time
- 🏆 **SEC compliance** maintained
</td>
<td>
### **🏥 Healthcare Research Institute**
*Processing 100,000+ research papers*
**Challenge**: Analyze medical literature for drug discovery
**Results**:
- 🚀 **25x faster** literature review process
- 📋 **95% accuracy** in data extraction
- 🧬 **12 new drug targets** identified
- 📚 **Publication in Nature** based on insights
</td>
</tr>
<tr>
<td>
### **⚖️ Legal Firm Network**
*Processing 500,000+ legal documents*
**Challenge**: Document review and compliance checking
**Results**:
- 🏃 **40x speed improvement** in document review
- 🛡️ **100% security compliance** maintained
- 💼 **$20M cost savings** across network
- 🏆 **Zero data breaches** during migration
</td>
<td>
### **🎓 Global University System**
*Processing 1M+ academic papers*
**Challenge**: Create searchable academic knowledge base
**Results**:
- 📖 **50x faster** knowledge extraction
- 🧠 **AI-ready** structured academic data
- 🔍 **97% search accuracy** improvement
- 📊 **3 Nobel Prize** papers processed
</td>
</tr>
</table>
---
## 🎯 **Advanced Features That Set Us Apart**
### **🌐 HTTPS URL Processing with Smart Caching**
```python
# Process PDFs directly from anywhere on the web
report_url = "https://company.com/annual-report.pdf"
analysis = await classify_content(report_url) # Downloads & caches automatically
tables = await extract_tables(report_url) # Uses cache - instant!
summary = await summarize_content(report_url) # Lightning fast!
```
### **🩺 Comprehensive Document Health Analysis**
```python
# Enterprise-grade document assessment
health = await analyze_pdf_health("critical-document.pdf")
{
"overall_health_score": 9.2,
"corruption_detected": false,
"optimization_potential": "23% size reduction possible",
"security_assessment": "enterprise_ready",
"recommendations": [
"Document is production-ready",
"Consider optimization for web delivery"
],
"processing_confidence": 99.8
}
```
### **🔍 AI-Powered Content Classification**
```python
# Automatically understand document types
classification = await classify_content("mystery-document.pdf")
{
"document_type": "Financial Report",
"confidence": 97.3,
"key_topics": ["Revenue", "Operating Expenses", "Cash Flow"],
"complexity_level": "Professional",
"suggested_tools": ["extract_tables", "extract_charts", "summarize_content"],
"industry_vertical": "Technology"
}
```
---
## 🤝 **Perfect Integration Ecosystem**
### **💎 Companion to MCP Office Tools**
*The ultimate document processing powerhouse*
<div align="center">
| 🔧 **Processing Need** | 📄 **PDF Files** | 📊 **Office Files** | 🔗 **Integration** |
|-----------------------|------------------|-------------------|-------------------|
| **Text Extraction** | MCP PDF ✅ | [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools) ✅ | **Unified API** |
| **Table Processing** | Advanced ✅ | Advanced ✅ | **Cross-Format** |
| **Image Extraction** | Smart ✅ | Smart ✅ | **Consistent** |
| **Format Detection** | AI-Powered ✅ | AI-Powered ✅ | **Intelligent** |
| **Health Analysis** | Complete ✅ | Complete ✅ | **Comprehensive** |
[**🚀 Get Both Tools for Complete Document Intelligence**](https://git.supported.systems/MCP/mcp-office-tools)
</div>
### **🔗 Unified Document Processing Workflow**
```python
# Process ALL document formats with unified intelligence
pdf_analysis = await pdf_tools.classify_content("report.pdf")
word_analysis = await office_tools.detect_office_format("report.docx")
excel_data = await office_tools.extract_text("data.xlsx")
# Cross-format document comparison
comparison = await compare_cross_format_documents([
pdf_analysis, word_analysis, excel_data
])
```
### **⚡ Works Seamlessly With**
- **🤖 Claude Desktop**: Native MCP protocol integration
- **📊 Jupyter Notebooks**: Perfect for research and analysis
- **🐍 Python Applications**: Direct async/await API access
- **🌐 Web Services**: RESTful wrappers and microservices
- **☁️ Cloud Platforms**: AWS Lambda, Google Functions, Azure
- **🔄 Workflow Engines**: Zapier, Microsoft Power Automate
---
## 🛡️ **Enterprise-Grade Security & Compliance**
<div align="center">
| 🔒 **Security Feature** | ✅ **Status** | 📋 **Enterprise Ready** |
|------------------------|---------------|------------------------|
| **Local Processing** | ✅ Enabled | Documents never leave your environment |
| **Memory Security** | ✅ Optimized | Automatic sensitive data cleanup |
| **HTTPS Validation** | ✅ Enforced | Certificate validation and secure headers |
| **Access Controls** | ✅ Configurable | Role-based processing permissions |
| **Audit Logging** | ✅ Available | Complete processing audit trails |
| **GDPR Compliant** | ✅ Certified | No personal data retention |
| **SOC2 Ready** | ✅ Verified | Enterprise security standards |
</div>
---
## 📈 **Installation & Enterprise Setup**
<details>
<summary>🚀 <b>Quick Start</b> (Recommended)</summary>
```bash
# Clone repository
git clone https://github.com/rsp2k/mcp-pdf
cd mcp-pdf
# Install with uv (fastest)
uv sync
# Install system dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript
# Verify installation
uv run python examples/verify_installation.py
```
</details>
<details>
<summary>🐳 <b>Docker Enterprise Setup</b></summary>
```dockerfile
FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
tesseract-ocr tesseract-ocr-eng \
poppler-utils ghostscript \
default-jre-headless
COPY . /app
WORKDIR /app
RUN pip install -e .
CMD ["mcp-pdf"]
```
</details>
<details>
<summary>🌐 <b>Claude Desktop Integration</b></summary>
```json
{
"mcpServers": {
"pdf-tools": {
"command": "uv",
"args": ["run", "mcp-pdf"],
"cwd": "/path/to/mcp-pdf"
},
"office-tools": {
"command": "mcp-office-tools"
}
}
}
```
*Unified document processing across all formats!*
</details>
<details>
<summary>🔧 <b>Development Environment</b></summary>
```bash
# Clone and setup
git clone https://github.com/rsp2k/mcp-pdf
cd mcp-pdf
uv sync --dev
# Quality checks
uv run pytest --cov=mcp_pdf_tools
uv run black src/ tests/ examples/
uv run ruff check src/ tests/ examples/
uv run mypy src/
# Run all 23 tools demo
uv run python examples/verify_installation.py uv run python examples/verify_installation.py
``` ```
@ -563,124 +63,162 @@ uv run python examples/verify_installation.py
--- ---
## 🚀 **What's Coming Next?** ## Tools
<div align="center"> ### Content Extraction
### **🔮 Innovation Roadmap 2024-2025** | Tool | What it does |
|------|-------------|
</div> | `extract_text` | Pull text from PDF pages with automatic chunking for large files |
| `extract_tables` | Extract tables to JSON, CSV, or Markdown |
| 🗓️ **Timeline** | 🎯 **Feature** | 📋 **Impact** | | `extract_images` | Extract embedded images |
|-----------------|---------------|--------------| | `extract_links` | Get all hyperlinks with page filtering |
| **Q4 2024** | **Enhanced AI Analysis** | GPT-powered content understanding | | `pdf_to_markdown` | Convert PDF to markdown preserving structure |
| **Q1 2025** | **Batch Processing** | Process 1000+ documents simultaneously | | `ocr_pdf` | OCR scanned documents using Tesseract |
| **Q2 2025** | **Cloud Integration** | Direct S3, GCS, Azure Blob support | | `extract_vector_graphics` | Export vector graphics to SVG (schematics, charts, drawings) |
| **Q3 2025** | **Real-time Streaming** | Process documents as they're created |
| **Q4 2025** | **Multi-language OCR** | 50+ language support with AI translation | ### Document Analysis
| **2026** | **Blockchain Verification** | Cryptographic document integrity |
| Tool | What it does |
--- |------|-------------|
| `extract_metadata` | Get title, author, creation date, page count, etc. |
## 🎭 **Complete Tool Showcase** | `get_document_structure` | Extract table of contents and bookmarks |
| `analyze_layout` | Detect columns, headers, footers |
<details> | `is_scanned_pdf` | Check if PDF needs OCR |
<summary>📊 <b>Business Intelligence Tools</b> (click to expand)</summary> | `compare_pdfs` | Diff two PDFs by text, structure, or metadata |
| `analyze_pdf_health` | Check for corruption, optimization opportunities |
### **Core Extraction** | `analyze_pdf_security` | Report encryption, permissions, signatures |
- `extract_text` - Multi-method text extraction with layout preservation
- `extract_tables` - Intelligent table extraction (JSON, CSV, Markdown) ### Forms
- `extract_images` - Image extraction with size filtering and format options
- `pdf_to_markdown` - Clean markdown conversion with structure preservation | Tool | What it does |
|------|-------------|
### **AI-Powered Analysis** | `extract_form_data` | Get form field names and values |
- `classify_content` - AI document type classification and analysis | `fill_form_pdf` | Fill form fields from JSON |
- `summarize_content` - Intelligent summarization with key insights | `create_form_pdf` | Create new forms with text fields, checkboxes, dropdowns |
- `analyze_pdf_health` - Comprehensive quality assessment | `add_form_fields` | Add fields to existing PDFs |
- `analyze_pdf_security` - Security feature analysis and vulnerability detection
### Document Assembly
</details>
| Tool | What it does |
<details> |------|-------------|
<summary>🔍 <b>Advanced Analysis Tools</b> (click to expand)</summary> | `merge_pdfs` | Combine multiple PDFs with bookmark preservation |
| `split_pdf_by_pages` | Split by page ranges |
### **Document Intelligence** | `split_pdf_by_bookmarks` | Split at chapter/section boundaries |
- `compare_pdfs` - Advanced document comparison (text, structure, metadata) | `reorder_pdf_pages` | Rearrange pages in custom order |
- `is_scanned_pdf` - Smart detection of scanned vs. text-based documents
- `get_document_structure` - Document outline and structural analysis ### Annotations
- `extract_metadata` - Comprehensive metadata and statistics extraction
| Tool | What it does |
### **Visual Processing** |------|-------------|
- `analyze_layout` - Page layout analysis with column and spacing detection | `add_sticky_notes` | Add comment annotations |
- `extract_charts` - Chart, diagram, and visual element extraction | `add_highlights` | Highlight text regions |
- `detect_watermarks` - Watermark detection and analysis | `add_stamps` | Add Approved/Draft/Confidential stamps |
| `extract_all_annotations` | Export annotations to JSON |
</details>
---
<details>
<summary>🔨 <b>Document Manipulation Tools</b> (click to expand)</summary> ## How Fallbacks Work
### **Content Operations** The server tries multiple libraries for each operation:
- `extract_form_data` - Interactive PDF form data extraction
- `split_pdf` - Intelligent document splitting at specified pages **Text extraction:**
- `merge_pdfs` - Multi-document merging with page range tracking 1. PyMuPDF (fastest)
- `rotate_pages` - Precise page rotation (90°/180°/270°) 2. pdfplumber (better for complex layouts)
3. pypdf (most compatible)
### **Optimization & Repair**
- `convert_to_images` - PDF to image conversion with quality control **Table extraction:**
- `optimize_pdf` - Multi-level file size optimization 1. Camelot (best accuracy, requires Ghostscript)
- `repair_pdf` - Automated corruption repair and recovery 2. pdfplumber (no dependencies)
- `ocr_pdf` - Advanced OCR with preprocessing for scanned documents 3. Tabula (requires Java)
</details> If a PDF fails with one library, the next is tried automatically.
--- ---
## 💝 **Enterprise Support & Community** ## Token Management
<div align="center"> Large PDFs can overflow MCP response limits. The server handles this:
### **🌟 Join the PDF Intelligence Revolution!** - **Automatic chunking** splits large documents into page groups
- **Table row limits** prevent huge tables from blowing up responses
[![GitHub](https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github)](https://github.com/rsp2k/mcp-pdf) - **Summary mode** returns structure without full content
[![Issues](https://img.shields.io/badge/Issues-Welcome-green?style=for-the-badge&logo=github)](https://github.com/rsp2k/mcp-pdf/issues)
[![MCP Office Tools](https://img.shields.io/badge/Companion-MCP%20Office%20Tools-blue?style=for-the-badge)](https://git.supported.systems/MCP/mcp-office-tools) ```python
# Get first 10 pages
**💬 Enterprise Support Available** • **🐛 Bug Bounty Program** • **💡 Feature Requests Welcome** result = await extract_text("huge.pdf", pages="1-10")
</div> # Limit table rows
tables = await extract_tables("data.pdf", max_rows_per_table=50)
### **🏢 Enterprise Services**
- **📞 Priority Support**: 24/7 enterprise support available # Structure only
- **🎓 Training Programs**: Comprehensive team training tables = await extract_tables("data.pdf", summary_only=True)
- **🔧 Custom Integration**: Tailored enterprise deployments ```
- **📊 Analytics Dashboard**: Usage analytics and insights
- **🛡️ Security Audits**: Comprehensive security assessments ---
--- ## URL Processing
<div align="center"> PDFs can be fetched directly from HTTPS URLs:
## 📜 **License & Ecosystem** ```python
result = await extract_text("https://example.com/report.pdf")
**MIT License** - Freedom to innovate everywhere ```
**🤝 Part of the MCP Document Processing Ecosystem** Files are cached locally for subsequent operations.
*Powered by [FastMCP](https://github.com/jlowin/fastmcp) • [Model Context Protocol](https://modelcontextprotocol.io) • Enterprise Python* ---
### **🔗 Complete Document Processing Solution** ## System Dependencies
**PDF Intelligence** ➜ **[MCP PDF](https://github.com/rsp2k/mcp-pdf)** (You are here!) Some features require system packages:
**Office Intelligence** ➜ **[MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**
**Unified Power** ➜ **Both Tools Together** | Feature | Dependency |
|---------|-----------|
--- | OCR | `tesseract-ocr` |
| Camelot tables | `ghostscript` |
### **⭐ Star both repositories for the complete solution! ⭐** | Tabula tables | `default-jre-headless` |
| PDF to images | `poppler-utils` |
**📄 [Star MCP PDF](https://github.com/rsp2k/mcp-pdf)** • **📊 [Star MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**
Ubuntu/Debian:
*Building the future of intelligent document processing* 🚀 ```bash
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless
```
---
## Configuration
Optional environment variables:
| Variable | Purpose |
|----------|---------|
| `MCP_PDF_ALLOWED_PATHS` | Colon-separated directories for file output |
| `PDF_TEMP_DIR` | Temp directory for processing (default: `/tmp/mcp-pdf-processing`) |
| `TESSDATA_PREFIX` | Tesseract language data location |
---
## Development
```bash
# Run tests
uv run pytest
# With coverage
uv run pytest --cov=mcp_pdf
# Format
uv run black src/ tests/
# Lint
uv run ruff check src/ tests/
```
---
## License
MIT
</div> </div>

View File

@ -1,6 +1,6 @@
[project] [project]
name = "mcp-pdf" name = "mcp-pdf"
version = "2.0.7" version = "2.0.8"
description = "Secure FastMCP server for comprehensive PDF processing - text extraction, OCR, table extraction, forms, annotations, and more" description = "Secure FastMCP server for comprehensive PDF processing - text extraction, OCR, table extraction, forms, annotations, and more"
authors = [{name = "Ryan Malloy", email = "ryan@malloys.us"}] authors = [{name = "Ryan Malloy", email = "ryan@malloys.us"}]
readme = "README.md" readme = "README.md"

View File

@ -383,3 +383,357 @@ class ImageProcessingMixin(MCPMixin):
# Very basic check - could be enhanced # Very basic check - could be enhanced
markdown_patterns = ['# ', '## ', '### ', '* ', '- ', '1. ', '**', '__'] markdown_patterns = ['# ', '## ', '### ', '* ', '- ', '1. ', '**', '__']
return any(pattern in line for pattern in markdown_patterns) return any(pattern in line for pattern in markdown_patterns)
@mcp_tool(
name="extract_vector_graphics",
description="Extract vector graphics from PDF to SVG format. Ideal for schematics, charts, and technical drawings."
)
async def extract_vector_graphics(
self,
pdf_path: str,
output_directory: Optional[str] = None,
pages: Optional[str] = None,
mode: str = "full_page",
include_text: bool = True,
simplify_paths: bool = False,
) -> Dict[str, Any]:
"""
Extract vector graphics from PDF pages as SVG files.
Perfect for extracting:
- IC functional diagrams from datasheets
- Frequency response charts and line graphs
- Package outline drawings (dimensioned technical drawings)
- Circuit schematics
- PCB layout diagrams
Args:
pdf_path: Path to PDF file or HTTPS URL
output_directory: Directory to save SVG files (default: temp directory)
pages: Page numbers to extract (comma-separated, 1-based), None for all
mode: Extraction mode:
- "full_page": Complete page as SVG (default, best for general use)
- "drawings_only": Extract individual vector paths as separate SVG
- "both": Export both formats for flexibility
include_text: Whether to include text in SVG output (default: True)
simplify_paths: Reduce path complexity for smaller files (default: False)
Returns:
Dictionary containing extraction summary and SVG file paths
"""
start_time = time.time()
try:
# Validate PDF path
input_pdf_path = await validate_pdf_path(pdf_path)
# Setup output directory
if output_directory:
output_dir = validate_output_path(output_directory)
output_dir.mkdir(parents=True, exist_ok=True)
else:
output_dir = Path(tempfile.mkdtemp(prefix="pdf_vectors_"))
# Parse pages parameter
parsed_pages = parse_pages_parameter(pages)
# Validate mode
valid_modes = ["full_page", "drawings_only", "both"]
if mode not in valid_modes:
return {
"success": False,
"error": f"Invalid mode '{mode}'. Valid modes: {', '.join(valid_modes)}",
"extraction_time": round(time.time() - start_time, 2)
}
# Open PDF document
doc = fitz.open(str(input_pdf_path))
total_pages = len(doc)
# Determine pages to process
pages_to_process = parsed_pages if parsed_pages else list(range(total_pages))
pages_to_process = [p for p in pages_to_process if 0 <= p < total_pages]
if not pages_to_process:
doc.close()
return {
"success": False,
"error": "No valid pages specified",
"extraction_time": round(time.time() - start_time, 2)
}
svg_files = []
total_size = 0
base_name = input_pdf_path.stem
for page_num in pages_to_process:
try:
page = doc[page_num]
page_results = {}
# Full page SVG extraction
if mode in ["full_page", "both"]:
svg_content = page.get_svg_image(
text_as_path=not include_text
)
# Optionally simplify paths (basic implementation)
if simplify_paths:
svg_content = self._simplify_svg_paths(svg_content)
filename = f"{base_name}_page_{page_num + 1}.svg"
output_path = output_dir / filename
with open(output_path, 'w', encoding='utf-8') as f:
f.write(svg_content)
file_size = output_path.stat().st_size
total_size += file_size
page_results["full_page"] = {
"filename": filename,
"path": str(output_path),
"size_bytes": file_size,
"size_kb": round(file_size / 1024, 1)
}
# Individual drawings extraction
if mode in ["drawings_only", "both"]:
drawings = page.get_drawings()
drawing_count = len(drawings)
if drawing_count > 0:
# Convert drawings to SVG
drawings_svg = self._drawings_to_svg(
drawings,
page.rect.width,
page.rect.height
)
filename = f"{base_name}_page_{page_num + 1}_drawings.svg"
output_path = output_dir / filename
with open(output_path, 'w', encoding='utf-8') as f:
f.write(drawings_svg)
file_size = output_path.stat().st_size
total_size += file_size
page_results["drawings_only"] = {
"filename": filename,
"path": str(output_path),
"size_bytes": file_size,
"size_kb": round(file_size / 1024, 1),
"drawing_count": drawing_count
}
else:
page_results["drawings_only"] = {
"skipped": True,
"reason": "No vector drawings found on page"
}
# Get drawing statistics for the page
all_drawings = page.get_drawings()
svg_files.append({
"page": page_num + 1,
"has_text": bool(page.get_text().strip()),
"drawing_count": len(all_drawings),
**page_results
})
except Exception as e:
logger.warning(f"Failed to extract vectors from page {page_num + 1}: {e}")
svg_files.append({
"page": page_num + 1,
"error": sanitize_error_message(str(e))
})
doc.close()
# Count successful extractions
successful_pages = sum(1 for f in svg_files if "error" not in f)
return {
"success": True,
"extraction_summary": {
"pages_processed": len(pages_to_process),
"pages_successful": successful_pages,
"mode": mode,
"total_size_bytes": total_size,
"total_size_kb": round(total_size / 1024, 1),
"output_directory": str(output_dir)
},
"svg_files": svg_files,
"settings": {
"include_text": include_text,
"simplify_paths": simplify_paths,
"mode": mode
},
"file_info": {
"input_path": str(input_pdf_path),
"total_pages": total_pages,
"pages_processed": pages or "all"
},
"extraction_time": round(time.time() - start_time, 2),
"hints": {
"viewing": "Open SVG files in browser, Inkscape, or Illustrator for editing",
"full_page_vs_drawings": "full_page preserves layout; drawings_only extracts raw vector paths"
}
}
except Exception as e:
error_msg = sanitize_error_message(str(e))
logger.error(f"Vector graphics extraction failed: {error_msg}")
return {
"success": False,
"error": error_msg,
"extraction_time": round(time.time() - start_time, 2)
}
def _drawings_to_svg(
self,
drawings: List[Dict],
width: float,
height: float
) -> str:
"""
Convert PyMuPDF drawings to standalone SVG.
Drawings contain: rect, items (path operations), color, fill, width, etc.
"""
svg_parts = [
f'<?xml version="1.0" encoding="UTF-8"?>',
f'<svg xmlns="http://www.w3.org/2000/svg" ',
f'viewBox="0 0 {width:.2f} {height:.2f}" ',
f'width="{width:.2f}" height="{height:.2f}">',
'',
' <!-- Extracted vector drawings from PDF -->',
''
]
for idx, drawing in enumerate(drawings):
try:
path_data = self._drawing_to_path(drawing)
if not path_data:
continue
# Extract style attributes
stroke_color = self._color_to_svg(drawing.get('color'))
fill_color = self._color_to_svg(drawing.get('fill'))
stroke_width = drawing.get('width', 1)
# Build style string
style_parts = []
if fill_color:
style_parts.append(f'fill:{fill_color}')
else:
style_parts.append('fill:none')
if stroke_color:
style_parts.append(f'stroke:{stroke_color}')
style_parts.append(f'stroke-width:{stroke_width:.2f}')
style = ';'.join(style_parts)
svg_parts.append(f' <path d="{path_data}" style="{style}" />')
except Exception as e:
logger.debug(f"Failed to convert drawing {idx}: {e}")
continue
svg_parts.append('</svg>')
return '\n'.join(svg_parts)
def _drawing_to_path(self, drawing: Dict) -> Optional[str]:
"""Convert a single drawing to SVG path data string."""
items = drawing.get('items', [])
if not items:
return None
path_parts = []
for item in items:
if not item:
continue
# Item format: (type, points...)
item_type = item[0]
try:
if item_type == 'l': # Line
# ('l', Point, Point)
p1, p2 = item[1], item[2]
path_parts.append(f'M {p1.x:.2f} {p1.y:.2f}')
path_parts.append(f'L {p2.x:.2f} {p2.y:.2f}')
elif item_type == 're': # Rectangle
# ('re', Rect)
rect = item[1]
path_parts.append(f'M {rect.x0:.2f} {rect.y0:.2f}')
path_parts.append(f'L {rect.x1:.2f} {rect.y0:.2f}')
path_parts.append(f'L {rect.x1:.2f} {rect.y1:.2f}')
path_parts.append(f'L {rect.x0:.2f} {rect.y1:.2f}')
path_parts.append('Z')
elif item_type == 'qu': # Quad (4-point polygon)
# ('qu', Quad)
quad = item[1]
path_parts.append(f'M {quad.ul.x:.2f} {quad.ul.y:.2f}')
path_parts.append(f'L {quad.ur.x:.2f} {quad.ur.y:.2f}')
path_parts.append(f'L {quad.lr.x:.2f} {quad.lr.y:.2f}')
path_parts.append(f'L {quad.ll.x:.2f} {quad.ll.y:.2f}')
path_parts.append('Z')
elif item_type == 'c': # Cubic bezier curve
# ('c', Point, Point, Point, Point) - start, ctrl1, ctrl2, end
p0, p1, p2, p3 = item[1], item[2], item[3], item[4]
if not path_parts or not path_parts[-1].startswith('M'):
path_parts.append(f'M {p0.x:.2f} {p0.y:.2f}')
path_parts.append(f'C {p1.x:.2f} {p1.y:.2f} {p2.x:.2f} {p2.y:.2f} {p3.x:.2f} {p3.y:.2f}')
except (IndexError, AttributeError) as e:
logger.debug(f"Failed to process drawing item {item_type}: {e}")
continue
return ' '.join(path_parts) if path_parts else None
def _color_to_svg(self, color) -> Optional[str]:
"""Convert PyMuPDF color to SVG color string."""
if color is None:
return None
if isinstance(color, (list, tuple)):
if len(color) == 3:
r, g, b = [int(c * 255) for c in color]
return f'rgb({r},{g},{b})'
elif len(color) == 1:
# Grayscale
gray = int(color[0] * 255)
return f'rgb({gray},{gray},{gray})'
elif len(color) == 4:
# CMYK - convert to RGB (simplified)
c, m, y, k = color
r = int(255 * (1 - c) * (1 - k))
g = int(255 * (1 - m) * (1 - k))
b = int(255 * (1 - y) * (1 - k))
return f'rgb({r},{g},{b})'
return None
def _simplify_svg_paths(self, svg_content: str) -> str:
"""
Basic SVG path simplification.
Reduces decimal precision to shrink file size.
"""
import re
# Reduce decimal precision in path data
def reduce_precision(match):
num = float(match.group())
return f'{num:.1f}'
# Match floating point numbers in SVG
simplified = re.sub(r'-?\d+\.\d{3,}', reduce_precision, svg_content)
return simplified

2
uv.lock generated
View File

@ -1032,7 +1032,7 @@ wheels = [
[[package]] [[package]]
name = "mcp-pdf" name = "mcp-pdf"
version = "2.0.7" version = "2.0.8"
source = { editable = "." } source = { editable = "." }
dependencies = [ dependencies = [
{ name = "camelot-py", extra = ["cv"] }, { name = "camelot-py", extra = ["cv"] },