Compare commits
2 Commits
19bdeddcdf
...
f32a014909
| Author | SHA1 | Date | |
|---|---|---|---|
| f32a014909 | |||
| e4f77008bb |
828
README.md
828
README.md
@ -4,558 +4,58 @@
|
|||||||
|
|
||||||
<img src="https://img.shields.io/badge/MCP-PDF%20Tools-red?style=for-the-badge&logo=adobe-acrobat-reader" alt="MCP PDF">
|
<img src="https://img.shields.io/badge/MCP-PDF%20Tools-red?style=for-the-badge&logo=adobe-acrobat-reader" alt="MCP PDF">
|
||||||
|
|
||||||
**🚀 The Ultimate PDF Processing Intelligence Platform for AI**
|
**A FastMCP server for PDF processing**
|
||||||
|
|
||||||
*Transform any PDF into structured, actionable intelligence with 24 specialized tools*
|
*41 tools for text extraction, OCR, tables, forms, annotations, and more*
|
||||||
|
|
||||||
[](https://www.python.org/downloads/)
|
[](https://www.python.org/downloads/)
|
||||||
[](https://github.com/jlowin/fastmcp)
|
[](https://github.com/jlowin/fastmcp)
|
||||||
[](https://opensource.org/licenses/MIT)
|
[](https://opensource.org/licenses/MIT)
|
||||||
[](https://github.com/rsp2k/mcp-pdf)
|
[](https://pypi.org/project/mcp-pdf/)
|
||||||
[](https://modelcontextprotocol.io)
|
|
||||||
|
|
||||||
**🤝 Perfect Companion to [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**
|
**Works great with [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## ✨ **What Makes MCP PDF Revolutionary?**
|
## What It Does
|
||||||
|
|
||||||
> 🎯 **The Problem**: PDFs contain incredible intelligence, but extracting it reliably is complex, slow, and often fails.
|
MCP PDF extracts content from PDFs using multiple libraries with automatic fallbacks. If one method fails, it tries another.
|
||||||
>
|
|
||||||
> ⚡ **The Solution**: MCP PDF delivers **AI-powered document intelligence** with **40 specialized tools** that understand both content and structure.
|
|
||||||
|
|
||||||
<table>
|
**Core capabilities:**
|
||||||
<tr>
|
- **Text extraction** via PyMuPDF, pdfplumber, or pypdf (auto-fallback)
|
||||||
<td>
|
- **Table extraction** via Camelot, pdfplumber, or Tabula (auto-fallback)
|
||||||
|
- **OCR** for scanned documents via Tesseract
|
||||||
### 🏆 **Why MCP PDF Leads**
|
- **Form handling** - extract, fill, and create PDF forms
|
||||||
- **🚀 40 Specialized Tools** for every PDF scenario
|
- **Document assembly** - merge, split, reorder pages
|
||||||
- **🧠 AI-Powered Intelligence** beyond basic extraction
|
- **Annotations** - sticky notes, highlights, stamps
|
||||||
- **🔄 Multi-Library Fallbacks** for 99.9% reliability
|
- **Vector graphics** - extract to SVG for schematics and technical drawings
|
||||||
- **⚡ 10x Faster** than traditional solutions
|
|
||||||
- **🌐 URL Processing** with smart caching
|
|
||||||
- **🎯 Smart Token Management** prevents MCP overflow errors
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
### 📊 **Enterprise-Proven For:**
|
|
||||||
- **Business Intelligence** & financial analysis
|
|
||||||
- **Document Security** assessment & compliance
|
|
||||||
- **Academic Research** & content analysis
|
|
||||||
- **Automated Workflows** & form processing
|
|
||||||
- **Document Migration** & modernization
|
|
||||||
- **Content Management** & archival
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 🚀 **Get Intelligence in 60 Seconds**
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install from PyPI
|
||||||
|
uvx mcp-pdf
|
||||||
|
|
||||||
|
# Or add to Claude Code
|
||||||
|
claude mcp add pdf-tools uvx mcp-pdf
|
||||||
|
```
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary><b>Development Installation</b></summary>
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 1️⃣ Clone and install
|
|
||||||
git clone https://github.com/rsp2k/mcp-pdf
|
git clone https://github.com/rsp2k/mcp-pdf
|
||||||
cd mcp-pdf
|
cd mcp-pdf
|
||||||
uv sync
|
uv sync
|
||||||
|
|
||||||
# 2️⃣ Install system dependencies (Ubuntu/Debian)
|
# System dependencies (Ubuntu/Debian)
|
||||||
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript
|
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript
|
||||||
|
|
||||||
# 3️⃣ Verify installation
|
# Verify
|
||||||
uv run python examples/verify_installation.py
|
|
||||||
|
|
||||||
# 4️⃣ Run the MCP server
|
|
||||||
uv run mcp-pdf
|
|
||||||
```
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>🔧 <b>Claude Desktop Integration</b> (click to expand)</summary>
|
|
||||||
|
|
||||||
### **📦 Production Installation (PyPI)**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# For personal use across all projects
|
|
||||||
claude mcp add -s local pdf-tools uvx mcp-pdf
|
|
||||||
|
|
||||||
# For project-specific use (isolated)
|
|
||||||
claude mcp add -s project pdf-tools uvx mcp-pdf
|
|
||||||
```
|
|
||||||
|
|
||||||
### **🛠️ Development Installation (Source)**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# For local development from source
|
|
||||||
claude mcp add -s project pdf-tools-dev uv -- --directory /path/to/mcp-pdf run mcp-pdf
|
|
||||||
```
|
|
||||||
|
|
||||||
### **⚙️ Manual Configuration**
|
|
||||||
Add to your `claude_desktop_config.json`:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"mcpServers": {
|
|
||||||
"pdf-tools": {
|
|
||||||
"command": "uvx",
|
|
||||||
"args": ["mcp-pdf"]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
*Restart Claude Desktop and unlock PDF intelligence!*
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🎭 **See AI-Powered Intelligence In Action**
|
|
||||||
|
|
||||||
### **📊 Business Intelligence Workflow**
|
|
||||||
```python
|
|
||||||
# Complete financial report analysis in seconds
|
|
||||||
health = await analyze_pdf_health("quarterly-report.pdf")
|
|
||||||
classification = await classify_content("quarterly-report.pdf")
|
|
||||||
summary = await summarize_content("quarterly-report.pdf", summary_length="medium")
|
|
||||||
|
|
||||||
# Smart table extraction - prevents token overflow on large tables
|
|
||||||
tables = await extract_tables("quarterly-report.pdf", pages="5-7", max_rows_per_table=100)
|
|
||||||
# Or get just table structure without data
|
|
||||||
table_summary = await extract_tables("quarterly-report.pdf", pages="5-7", summary_only=True)
|
|
||||||
|
|
||||||
charts = await extract_charts("quarterly-report.pdf")
|
|
||||||
|
|
||||||
# Get instant insights
|
|
||||||
{
|
|
||||||
"document_type": "Financial Report",
|
|
||||||
"health_score": 9.2,
|
|
||||||
"key_insights": [
|
|
||||||
"Revenue increased 23% YoY",
|
|
||||||
"Operating margin improved to 15.3%",
|
|
||||||
"Strong cash flow generation"
|
|
||||||
],
|
|
||||||
"tables_extracted": 12,
|
|
||||||
"charts_found": 8,
|
|
||||||
"processing_time": 2.1
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### **🔒 Document Security Assessment**
|
|
||||||
```python
|
|
||||||
# Comprehensive security analysis
|
|
||||||
security = await analyze_pdf_security("sensitive-document.pdf")
|
|
||||||
watermarks = await detect_watermarks("sensitive-document.pdf")
|
|
||||||
health = await analyze_pdf_health("sensitive-document.pdf")
|
|
||||||
|
|
||||||
# Enterprise-grade security insights
|
|
||||||
{
|
|
||||||
"encryption_type": "AES-256",
|
|
||||||
"permissions": {
|
|
||||||
"print": false,
|
|
||||||
"copy": false,
|
|
||||||
"modify": false
|
|
||||||
},
|
|
||||||
"security_warnings": [],
|
|
||||||
"watermarks_detected": true,
|
|
||||||
"compliance_ready": true
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### **📚 Academic Research Processing**
|
|
||||||
```python
|
|
||||||
# Advanced research paper analysis
|
|
||||||
layout = await analyze_layout("research-paper.pdf", pages=[1,2,3])
|
|
||||||
summary = await summarize_content("research-paper.pdf", summary_length="long")
|
|
||||||
citations = await extract_text("research-paper.pdf", pages=[15,16,17])
|
|
||||||
|
|
||||||
# Research intelligence delivered
|
|
||||||
{
|
|
||||||
"reading_complexity": "Graduate Level",
|
|
||||||
"main_topics": ["Machine Learning", "Natural Language Processing"],
|
|
||||||
"citation_count": 127,
|
|
||||||
"figures_detected": 15,
|
|
||||||
"methodology_extracted": true
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🛠️ **Complete Arsenal: 40+ Specialized Tools**
|
|
||||||
|
|
||||||
<div align="center">
|
|
||||||
|
|
||||||
### **🎯 Document Intelligence & Analysis**
|
|
||||||
|
|
||||||
| 🧠 **Tool** | 📋 **Purpose** | ⚡ **AI Powered** | 🎯 **Accuracy** |
|
|
||||||
|-------------|---------------|-----------------|----------------|
|
|
||||||
| `classify_content` | AI-powered document type detection | ✅ Yes | 97% |
|
|
||||||
| `summarize_content` | Intelligent key insights extraction | ✅ Yes | 95% |
|
|
||||||
| `analyze_pdf_health` | Comprehensive quality assessment | ✅ Yes | 99% |
|
|
||||||
| `analyze_pdf_security` | Security & vulnerability analysis | ✅ Yes | 99% |
|
|
||||||
| `compare_pdfs` | Advanced document comparison | ✅ Yes | 96% |
|
|
||||||
|
|
||||||
### **📊 Core Content Extraction**
|
|
||||||
|
|
||||||
| 🔧 **Tool** | 📋 **Purpose** | ⚡ **Speed** | 🎯 **Accuracy** |
|
|
||||||
|-------------|---------------|-------------|----------------|
|
|
||||||
| `extract_text` | Multi-method text extraction with auto-chunking | **Ultra Fast** | 99.9% |
|
|
||||||
| `extract_tables` | Smart table extraction with token overflow protection | **Fast** | 98% |
|
|
||||||
| `ocr_pdf` | Advanced OCR for scanned docs | **Moderate** | 95% |
|
|
||||||
| `extract_images` | Media extraction & processing | **Fast** | 99% |
|
|
||||||
| `pdf_to_markdown` | Structure-preserving conversion | **Fast** | 97% |
|
|
||||||
|
|
||||||
### **📐 Visual & Layout Analysis**
|
|
||||||
|
|
||||||
| 🎨 **Tool** | 📋 **Purpose** | 🔍 **Precision** | 💪 **Features** |
|
|
||||||
|-------------|---------------|-----------------|----------------|
|
|
||||||
| `analyze_layout` | Page structure & column detection | **High** | Advanced |
|
|
||||||
| `extract_charts` | Visual element extraction | **High** | Smart |
|
|
||||||
| `detect_watermarks` | Watermark identification | **Perfect** | Complete |
|
|
||||||
|
|
||||||
</div>
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🌟 **Document Format Intelligence Matrix**
|
|
||||||
|
|
||||||
<div align="center">
|
|
||||||
|
|
||||||
### **📄 Universal PDF Processing Capabilities**
|
|
||||||
|
|
||||||
| 📋 **Document Type** | 🔍 **Detection** | 📊 **Text** | 📈 **Tables** | 🖼️ **Images** | 🧠 **Intelligence** |
|
|
||||||
|---------------------|-----------------|------------|--------------|--------------|-------------------|
|
|
||||||
| **Financial Reports** | ✅ Perfect | ✅ Perfect | ✅ Perfect | ✅ Perfect | 🧠 **AI-Enhanced** |
|
|
||||||
| **Research Papers** | ✅ Perfect | ✅ Perfect | ✅ Excellent | ✅ Perfect | 🧠 **AI-Enhanced** |
|
|
||||||
| **Legal Documents** | ✅ Perfect | ✅ Perfect | ✅ Good | ✅ Perfect | 🧠 **AI-Enhanced** |
|
|
||||||
| **Scanned PDFs** | ✅ Auto-Detect | ✅ OCR | ✅ OCR | ✅ Perfect | 🧠 **AI-Enhanced** |
|
|
||||||
| **Forms & Applications** | ✅ Perfect | ✅ Perfect | ✅ Excellent | ✅ Perfect | 🧠 **AI-Enhanced** |
|
|
||||||
| **Technical Manuals** | ✅ Perfect | ✅ Perfect | ✅ Perfect | ✅ Perfect | 🧠 **AI-Enhanced** |
|
|
||||||
|
|
||||||
*✅ Perfect • 🧠 AI-Enhanced Intelligence • 🔍 Auto-Detection*
|
|
||||||
|
|
||||||
</div>
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## ⚡ **Performance That Amazes**
|
|
||||||
|
|
||||||
<div align="center">
|
|
||||||
|
|
||||||
### **🚀 Real-World Benchmarks**
|
|
||||||
|
|
||||||
| 📄 **Document Type** | 📏 **Pages** | ⏱️ **Processing Time** | 🆚 **vs Competitors** | 🧠 **Intelligence Level** |
|
|
||||||
|---------------------|-------------|----------------------|----------------------|---------------------------|
|
|
||||||
| Financial Report | 50 pages | 2.1 seconds | **10x faster** | **AI-Powered** |
|
|
||||||
| Research Paper | 25 pages | 1.3 seconds | **8x faster** | **Deep Analysis** |
|
|
||||||
| Scanned Document | 100 pages | 45 seconds | **5x faster** | **OCR + AI** |
|
|
||||||
| Complex Forms | 15 pages | 0.8 seconds | **12x faster** | **Structure Aware** |
|
|
||||||
|
|
||||||
*Benchmarked on: MacBook Pro M2, 16GB RAM • Including AI processing time*
|
|
||||||
|
|
||||||
</div>
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🏗️ **Intelligent Architecture**
|
|
||||||
|
|
||||||
### **🧠 Multi-Library Intelligence System**
|
|
||||||
*Never worry about PDF compatibility or failure again*
|
|
||||||
|
|
||||||
```mermaid
|
|
||||||
graph TD
|
|
||||||
A[PDF Input] --> B{Smart Detection}
|
|
||||||
B --> C{Document Type}
|
|
||||||
C -->|Text-based| D[PyMuPDF Fast Path]
|
|
||||||
C -->|Scanned| E[OCR Processing]
|
|
||||||
C -->|Complex Layout| F[pdfplumber Analysis]
|
|
||||||
C -->|Tables Heavy| G[Camelot + Tabula]
|
|
||||||
|
|
||||||
D -->|Success| H[✅ Content Extracted]
|
|
||||||
D -->|Fail| I[pdfplumber Fallback]
|
|
||||||
I -->|Fail| J[pypdf Fallback]
|
|
||||||
|
|
||||||
E --> K[Tesseract OCR]
|
|
||||||
K --> L[AI Content Analysis]
|
|
||||||
|
|
||||||
F --> M[Layout Intelligence]
|
|
||||||
G --> N[Table Intelligence]
|
|
||||||
|
|
||||||
H --> O[🧠 AI Enhancement]
|
|
||||||
L --> O
|
|
||||||
M --> O
|
|
||||||
N --> O
|
|
||||||
|
|
||||||
O --> P[🎯 Structured Intelligence]
|
|
||||||
```
|
|
||||||
|
|
||||||
### **🎯 Intelligent Processing Pipeline**
|
|
||||||
|
|
||||||
1. **🔍 Smart Detection**: Automatically identify document type and optimal processing strategy
|
|
||||||
2. **⚡ Optimized Extraction**: Use the fastest, most accurate method for each document
|
|
||||||
3. **🛡️ Fallback Protection**: Seamless method switching if primary approach fails
|
|
||||||
4. **🧠 AI Enhancement**: Apply document intelligence and content analysis
|
|
||||||
5. **🧹 Clean Output**: Deliver perfectly structured, AI-ready intelligence
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🌍 **Real-World Success Stories**
|
|
||||||
|
|
||||||
<div align="center">
|
|
||||||
|
|
||||||
### **🏢 Proven at Enterprise Scale**
|
|
||||||
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<table>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
### **📊 Financial Services Giant**
|
|
||||||
*Processing 50,000+ reports monthly*
|
|
||||||
|
|
||||||
**Challenge**: Analyze quarterly reports from 2,000+ companies
|
|
||||||
|
|
||||||
**Results**:
|
|
||||||
- ⚡ **98% time reduction** (2 weeks → 4 hours)
|
|
||||||
- 🎯 **99.9% accuracy** in financial data extraction
|
|
||||||
- 💰 **$5M annual savings** in analyst time
|
|
||||||
- 🏆 **SEC compliance** maintained
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
### **🏥 Healthcare Research Institute**
|
|
||||||
*Processing 100,000+ research papers*
|
|
||||||
|
|
||||||
**Challenge**: Analyze medical literature for drug discovery
|
|
||||||
|
|
||||||
**Results**:
|
|
||||||
- 🚀 **25x faster** literature review process
|
|
||||||
- 📋 **95% accuracy** in data extraction
|
|
||||||
- 🧬 **12 new drug targets** identified
|
|
||||||
- 📚 **Publication in Nature** based on insights
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
### **⚖️ Legal Firm Network**
|
|
||||||
*Processing 500,000+ legal documents*
|
|
||||||
|
|
||||||
**Challenge**: Document review and compliance checking
|
|
||||||
|
|
||||||
**Results**:
|
|
||||||
- 🏃 **40x speed improvement** in document review
|
|
||||||
- 🛡️ **100% security compliance** maintained
|
|
||||||
- 💼 **$20M cost savings** across network
|
|
||||||
- 🏆 **Zero data breaches** during migration
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
### **🎓 Global University System**
|
|
||||||
*Processing 1M+ academic papers*
|
|
||||||
|
|
||||||
**Challenge**: Create searchable academic knowledge base
|
|
||||||
|
|
||||||
**Results**:
|
|
||||||
- 📖 **50x faster** knowledge extraction
|
|
||||||
- 🧠 **AI-ready** structured academic data
|
|
||||||
- 🔍 **97% search accuracy** improvement
|
|
||||||
- 📊 **3 Nobel Prize** papers processed
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🎯 **Advanced Features That Set Us Apart**
|
|
||||||
|
|
||||||
### **🌐 HTTPS URL Processing with Smart Caching**
|
|
||||||
```python
|
|
||||||
# Process PDFs directly from anywhere on the web
|
|
||||||
report_url = "https://company.com/annual-report.pdf"
|
|
||||||
analysis = await classify_content(report_url) # Downloads & caches automatically
|
|
||||||
tables = await extract_tables(report_url) # Uses cache - instant!
|
|
||||||
summary = await summarize_content(report_url) # Lightning fast!
|
|
||||||
```
|
|
||||||
|
|
||||||
### **🩺 Comprehensive Document Health Analysis**
|
|
||||||
```python
|
|
||||||
# Enterprise-grade document assessment
|
|
||||||
health = await analyze_pdf_health("critical-document.pdf")
|
|
||||||
|
|
||||||
{
|
|
||||||
"overall_health_score": 9.2,
|
|
||||||
"corruption_detected": false,
|
|
||||||
"optimization_potential": "23% size reduction possible",
|
|
||||||
"security_assessment": "enterprise_ready",
|
|
||||||
"recommendations": [
|
|
||||||
"Document is production-ready",
|
|
||||||
"Consider optimization for web delivery"
|
|
||||||
],
|
|
||||||
"processing_confidence": 99.8
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### **🔍 AI-Powered Content Classification**
|
|
||||||
```python
|
|
||||||
# Automatically understand document types
|
|
||||||
classification = await classify_content("mystery-document.pdf")
|
|
||||||
|
|
||||||
{
|
|
||||||
"document_type": "Financial Report",
|
|
||||||
"confidence": 97.3,
|
|
||||||
"key_topics": ["Revenue", "Operating Expenses", "Cash Flow"],
|
|
||||||
"complexity_level": "Professional",
|
|
||||||
"suggested_tools": ["extract_tables", "extract_charts", "summarize_content"],
|
|
||||||
"industry_vertical": "Technology"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🤝 **Perfect Integration Ecosystem**
|
|
||||||
|
|
||||||
### **💎 Companion to MCP Office Tools**
|
|
||||||
*The ultimate document processing powerhouse*
|
|
||||||
|
|
||||||
<div align="center">
|
|
||||||
|
|
||||||
| 🔧 **Processing Need** | 📄 **PDF Files** | 📊 **Office Files** | 🔗 **Integration** |
|
|
||||||
|-----------------------|------------------|-------------------|-------------------|
|
|
||||||
| **Text Extraction** | MCP PDF ✅ | [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools) ✅ | **Unified API** |
|
|
||||||
| **Table Processing** | Advanced ✅ | Advanced ✅ | **Cross-Format** |
|
|
||||||
| **Image Extraction** | Smart ✅ | Smart ✅ | **Consistent** |
|
|
||||||
| **Format Detection** | AI-Powered ✅ | AI-Powered ✅ | **Intelligent** |
|
|
||||||
| **Health Analysis** | Complete ✅ | Complete ✅ | **Comprehensive** |
|
|
||||||
|
|
||||||
[**🚀 Get Both Tools for Complete Document Intelligence**](https://git.supported.systems/MCP/mcp-office-tools)
|
|
||||||
|
|
||||||
</div>
|
|
||||||
|
|
||||||
### **🔗 Unified Document Processing Workflow**
|
|
||||||
```python
|
|
||||||
# Process ALL document formats with unified intelligence
|
|
||||||
pdf_analysis = await pdf_tools.classify_content("report.pdf")
|
|
||||||
word_analysis = await office_tools.detect_office_format("report.docx")
|
|
||||||
excel_data = await office_tools.extract_text("data.xlsx")
|
|
||||||
|
|
||||||
# Cross-format document comparison
|
|
||||||
comparison = await compare_cross_format_documents([
|
|
||||||
pdf_analysis, word_analysis, excel_data
|
|
||||||
])
|
|
||||||
```
|
|
||||||
|
|
||||||
### **⚡ Works Seamlessly With**
|
|
||||||
- **🤖 Claude Desktop**: Native MCP protocol integration
|
|
||||||
- **📊 Jupyter Notebooks**: Perfect for research and analysis
|
|
||||||
- **🐍 Python Applications**: Direct async/await API access
|
|
||||||
- **🌐 Web Services**: RESTful wrappers and microservices
|
|
||||||
- **☁️ Cloud Platforms**: AWS Lambda, Google Functions, Azure
|
|
||||||
- **🔄 Workflow Engines**: Zapier, Microsoft Power Automate
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🛡️ **Enterprise-Grade Security & Compliance**
|
|
||||||
|
|
||||||
<div align="center">
|
|
||||||
|
|
||||||
| 🔒 **Security Feature** | ✅ **Status** | 📋 **Enterprise Ready** |
|
|
||||||
|------------------------|---------------|------------------------|
|
|
||||||
| **Local Processing** | ✅ Enabled | Documents never leave your environment |
|
|
||||||
| **Memory Security** | ✅ Optimized | Automatic sensitive data cleanup |
|
|
||||||
| **HTTPS Validation** | ✅ Enforced | Certificate validation and secure headers |
|
|
||||||
| **Access Controls** | ✅ Configurable | Role-based processing permissions |
|
|
||||||
| **Audit Logging** | ✅ Available | Complete processing audit trails |
|
|
||||||
| **GDPR Compliant** | ✅ Certified | No personal data retention |
|
|
||||||
| **SOC2 Ready** | ✅ Verified | Enterprise security standards |
|
|
||||||
|
|
||||||
</div>
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📈 **Installation & Enterprise Setup**
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>🚀 <b>Quick Start</b> (Recommended)</summary>
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Clone repository
|
|
||||||
git clone https://github.com/rsp2k/mcp-pdf
|
|
||||||
cd mcp-pdf
|
|
||||||
|
|
||||||
# Install with uv (fastest)
|
|
||||||
uv sync
|
|
||||||
|
|
||||||
# Install system dependencies (Ubuntu/Debian)
|
|
||||||
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript
|
|
||||||
|
|
||||||
# Verify installation
|
|
||||||
uv run python examples/verify_installation.py
|
|
||||||
```
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>🐳 <b>Docker Enterprise Setup</b></summary>
|
|
||||||
|
|
||||||
```dockerfile
|
|
||||||
FROM python:3.11-slim
|
|
||||||
RUN apt-get update && apt-get install -y \
|
|
||||||
tesseract-ocr tesseract-ocr-eng \
|
|
||||||
poppler-utils ghostscript \
|
|
||||||
default-jre-headless
|
|
||||||
COPY . /app
|
|
||||||
WORKDIR /app
|
|
||||||
RUN pip install -e .
|
|
||||||
CMD ["mcp-pdf"]
|
|
||||||
```
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>🌐 <b>Claude Desktop Integration</b></summary>
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"mcpServers": {
|
|
||||||
"pdf-tools": {
|
|
||||||
"command": "uv",
|
|
||||||
"args": ["run", "mcp-pdf"],
|
|
||||||
"cwd": "/path/to/mcp-pdf"
|
|
||||||
},
|
|
||||||
"office-tools": {
|
|
||||||
"command": "mcp-office-tools"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
*Unified document processing across all formats!*
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>🔧 <b>Development Environment</b></summary>
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Clone and setup
|
|
||||||
git clone https://github.com/rsp2k/mcp-pdf
|
|
||||||
cd mcp-pdf
|
|
||||||
uv sync --dev
|
|
||||||
|
|
||||||
# Quality checks
|
|
||||||
uv run pytest --cov=mcp_pdf_tools
|
|
||||||
uv run black src/ tests/ examples/
|
|
||||||
uv run ruff check src/ tests/ examples/
|
|
||||||
uv run mypy src/
|
|
||||||
|
|
||||||
# Run all 23 tools demo
|
|
||||||
uv run python examples/verify_installation.py
|
uv run python examples/verify_installation.py
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -563,124 +63,162 @@ uv run python examples/verify_installation.py
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 🚀 **What's Coming Next?**
|
## Tools
|
||||||
|
|
||||||
<div align="center">
|
### Content Extraction
|
||||||
|
|
||||||
### **🔮 Innovation Roadmap 2024-2025**
|
| Tool | What it does |
|
||||||
|
|------|-------------|
|
||||||
|
| `extract_text` | Pull text from PDF pages with automatic chunking for large files |
|
||||||
|
| `extract_tables` | Extract tables to JSON, CSV, or Markdown |
|
||||||
|
| `extract_images` | Extract embedded images |
|
||||||
|
| `extract_links` | Get all hyperlinks with page filtering |
|
||||||
|
| `pdf_to_markdown` | Convert PDF to markdown preserving structure |
|
||||||
|
| `ocr_pdf` | OCR scanned documents using Tesseract |
|
||||||
|
| `extract_vector_graphics` | Export vector graphics to SVG (schematics, charts, drawings) |
|
||||||
|
|
||||||
|
### Document Analysis
|
||||||
|
|
||||||
|
| Tool | What it does |
|
||||||
|
|------|-------------|
|
||||||
|
| `extract_metadata` | Get title, author, creation date, page count, etc. |
|
||||||
|
| `get_document_structure` | Extract table of contents and bookmarks |
|
||||||
|
| `analyze_layout` | Detect columns, headers, footers |
|
||||||
|
| `is_scanned_pdf` | Check if PDF needs OCR |
|
||||||
|
| `compare_pdfs` | Diff two PDFs by text, structure, or metadata |
|
||||||
|
| `analyze_pdf_health` | Check for corruption, optimization opportunities |
|
||||||
|
| `analyze_pdf_security` | Report encryption, permissions, signatures |
|
||||||
|
|
||||||
|
### Forms
|
||||||
|
|
||||||
|
| Tool | What it does |
|
||||||
|
|------|-------------|
|
||||||
|
| `extract_form_data` | Get form field names and values |
|
||||||
|
| `fill_form_pdf` | Fill form fields from JSON |
|
||||||
|
| `create_form_pdf` | Create new forms with text fields, checkboxes, dropdowns |
|
||||||
|
| `add_form_fields` | Add fields to existing PDFs |
|
||||||
|
|
||||||
|
### Document Assembly
|
||||||
|
|
||||||
|
| Tool | What it does |
|
||||||
|
|------|-------------|
|
||||||
|
| `merge_pdfs` | Combine multiple PDFs with bookmark preservation |
|
||||||
|
| `split_pdf_by_pages` | Split by page ranges |
|
||||||
|
| `split_pdf_by_bookmarks` | Split at chapter/section boundaries |
|
||||||
|
| `reorder_pdf_pages` | Rearrange pages in custom order |
|
||||||
|
|
||||||
|
### Annotations
|
||||||
|
|
||||||
|
| Tool | What it does |
|
||||||
|
|------|-------------|
|
||||||
|
| `add_sticky_notes` | Add comment annotations |
|
||||||
|
| `add_highlights` | Highlight text regions |
|
||||||
|
| `add_stamps` | Add Approved/Draft/Confidential stamps |
|
||||||
|
| `extract_all_annotations` | Export annotations to JSON |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How Fallbacks Work
|
||||||
|
|
||||||
|
The server tries multiple libraries for each operation:
|
||||||
|
|
||||||
|
**Text extraction:**
|
||||||
|
1. PyMuPDF (fastest)
|
||||||
|
2. pdfplumber (better for complex layouts)
|
||||||
|
3. pypdf (most compatible)
|
||||||
|
|
||||||
|
**Table extraction:**
|
||||||
|
1. Camelot (best accuracy, requires Ghostscript)
|
||||||
|
2. pdfplumber (no dependencies)
|
||||||
|
3. Tabula (requires Java)
|
||||||
|
|
||||||
|
If a PDF fails with one library, the next is tried automatically.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Token Management
|
||||||
|
|
||||||
|
Large PDFs can overflow MCP response limits. The server handles this:
|
||||||
|
|
||||||
|
- **Automatic chunking** splits large documents into page groups
|
||||||
|
- **Table row limits** prevent huge tables from blowing up responses
|
||||||
|
- **Summary mode** returns structure without full content
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Get first 10 pages
|
||||||
|
result = await extract_text("huge.pdf", pages="1-10")
|
||||||
|
|
||||||
|
# Limit table rows
|
||||||
|
tables = await extract_tables("data.pdf", max_rows_per_table=50)
|
||||||
|
|
||||||
|
# Structure only
|
||||||
|
tables = await extract_tables("data.pdf", summary_only=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## URL Processing
|
||||||
|
|
||||||
|
PDFs can be fetched directly from HTTPS URLs:
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await extract_text("https://example.com/report.pdf")
|
||||||
|
```
|
||||||
|
|
||||||
|
Files are cached locally for subsequent operations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## System Dependencies
|
||||||
|
|
||||||
|
Some features require system packages:
|
||||||
|
|
||||||
|
| Feature | Dependency |
|
||||||
|
|---------|-----------|
|
||||||
|
| OCR | `tesseract-ocr` |
|
||||||
|
| Camelot tables | `ghostscript` |
|
||||||
|
| Tabula tables | `default-jre-headless` |
|
||||||
|
| PDF to images | `poppler-utils` |
|
||||||
|
|
||||||
|
Ubuntu/Debian:
|
||||||
|
```bash
|
||||||
|
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
Optional environment variables:
|
||||||
|
|
||||||
|
| Variable | Purpose |
|
||||||
|
|----------|---------|
|
||||||
|
| `MCP_PDF_ALLOWED_PATHS` | Colon-separated directories for file output |
|
||||||
|
| `PDF_TEMP_DIR` | Temp directory for processing (default: `/tmp/mcp-pdf-processing`) |
|
||||||
|
| `TESSDATA_PREFIX` | Tesseract language data location |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Development
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run tests
|
||||||
|
uv run pytest
|
||||||
|
|
||||||
|
# With coverage
|
||||||
|
uv run pytest --cov=mcp_pdf
|
||||||
|
|
||||||
|
# Format
|
||||||
|
uv run black src/ tests/
|
||||||
|
|
||||||
|
# Lint
|
||||||
|
uv run ruff check src/ tests/
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
MIT
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
| 🗓️ **Timeline** | 🎯 **Feature** | 📋 **Impact** |
|
|
||||||
|-----------------|---------------|--------------|
|
|
||||||
| **Q4 2024** | **Enhanced AI Analysis** | GPT-powered content understanding |
|
|
||||||
| **Q1 2025** | **Batch Processing** | Process 1000+ documents simultaneously |
|
|
||||||
| **Q2 2025** | **Cloud Integration** | Direct S3, GCS, Azure Blob support |
|
|
||||||
| **Q3 2025** | **Real-time Streaming** | Process documents as they're created |
|
|
||||||
| **Q4 2025** | **Multi-language OCR** | 50+ language support with AI translation |
|
|
||||||
| **2026** | **Blockchain Verification** | Cryptographic document integrity |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🎭 **Complete Tool Showcase**
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>📊 <b>Business Intelligence Tools</b> (click to expand)</summary>
|
|
||||||
|
|
||||||
### **Core Extraction**
|
|
||||||
- `extract_text` - Multi-method text extraction with layout preservation
|
|
||||||
- `extract_tables` - Intelligent table extraction (JSON, CSV, Markdown)
|
|
||||||
- `extract_images` - Image extraction with size filtering and format options
|
|
||||||
- `pdf_to_markdown` - Clean markdown conversion with structure preservation
|
|
||||||
|
|
||||||
### **AI-Powered Analysis**
|
|
||||||
- `classify_content` - AI document type classification and analysis
|
|
||||||
- `summarize_content` - Intelligent summarization with key insights
|
|
||||||
- `analyze_pdf_health` - Comprehensive quality assessment
|
|
||||||
- `analyze_pdf_security` - Security feature analysis and vulnerability detection
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>🔍 <b>Advanced Analysis Tools</b> (click to expand)</summary>
|
|
||||||
|
|
||||||
### **Document Intelligence**
|
|
||||||
- `compare_pdfs` - Advanced document comparison (text, structure, metadata)
|
|
||||||
- `is_scanned_pdf` - Smart detection of scanned vs. text-based documents
|
|
||||||
- `get_document_structure` - Document outline and structural analysis
|
|
||||||
- `extract_metadata` - Comprehensive metadata and statistics extraction
|
|
||||||
|
|
||||||
### **Visual Processing**
|
|
||||||
- `analyze_layout` - Page layout analysis with column and spacing detection
|
|
||||||
- `extract_charts` - Chart, diagram, and visual element extraction
|
|
||||||
- `detect_watermarks` - Watermark detection and analysis
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>🔨 <b>Document Manipulation Tools</b> (click to expand)</summary>
|
|
||||||
|
|
||||||
### **Content Operations**
|
|
||||||
- `extract_form_data` - Interactive PDF form data extraction
|
|
||||||
- `split_pdf` - Intelligent document splitting at specified pages
|
|
||||||
- `merge_pdfs` - Multi-document merging with page range tracking
|
|
||||||
- `rotate_pages` - Precise page rotation (90°/180°/270°)
|
|
||||||
|
|
||||||
### **Optimization & Repair**
|
|
||||||
- `convert_to_images` - PDF to image conversion with quality control
|
|
||||||
- `optimize_pdf` - Multi-level file size optimization
|
|
||||||
- `repair_pdf` - Automated corruption repair and recovery
|
|
||||||
- `ocr_pdf` - Advanced OCR with preprocessing for scanned documents
|
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 💝 **Enterprise Support & Community**
|
|
||||||
|
|
||||||
<div align="center">
|
|
||||||
|
|
||||||
### **🌟 Join the PDF Intelligence Revolution!**
|
|
||||||
|
|
||||||
[](https://github.com/rsp2k/mcp-pdf)
|
|
||||||
[](https://github.com/rsp2k/mcp-pdf/issues)
|
|
||||||
[](https://git.supported.systems/MCP/mcp-office-tools)
|
|
||||||
|
|
||||||
**💬 Enterprise Support Available** • **🐛 Bug Bounty Program** • **💡 Feature Requests Welcome**
|
|
||||||
|
|
||||||
</div>
|
|
||||||
|
|
||||||
### **🏢 Enterprise Services**
|
|
||||||
- **📞 Priority Support**: 24/7 enterprise support available
|
|
||||||
- **🎓 Training Programs**: Comprehensive team training
|
|
||||||
- **🔧 Custom Integration**: Tailored enterprise deployments
|
|
||||||
- **📊 Analytics Dashboard**: Usage analytics and insights
|
|
||||||
- **🛡️ Security Audits**: Comprehensive security assessments
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
<div align="center">
|
|
||||||
|
|
||||||
## 📜 **License & Ecosystem**
|
|
||||||
|
|
||||||
**MIT License** - Freedom to innovate everywhere
|
|
||||||
|
|
||||||
**🤝 Part of the MCP Document Processing Ecosystem**
|
|
||||||
|
|
||||||
*Powered by [FastMCP](https://github.com/jlowin/fastmcp) • [Model Context Protocol](https://modelcontextprotocol.io) • Enterprise Python*
|
|
||||||
|
|
||||||
### **🔗 Complete Document Processing Solution**
|
|
||||||
|
|
||||||
**PDF Intelligence** ➜ **[MCP PDF](https://github.com/rsp2k/mcp-pdf)** (You are here!)
|
|
||||||
**Office Intelligence** ➜ **[MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**
|
|
||||||
**Unified Power** ➜ **Both Tools Together**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### **⭐ Star both repositories for the complete solution! ⭐**
|
|
||||||
|
|
||||||
**📄 [Star MCP PDF](https://github.com/rsp2k/mcp-pdf)** • **📊 [Star MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**
|
|
||||||
|
|
||||||
*Building the future of intelligent document processing* 🚀
|
|
||||||
|
|
||||||
</div>
|
|
||||||
@ -1,6 +1,6 @@
|
|||||||
[project]
|
[project]
|
||||||
name = "mcp-pdf"
|
name = "mcp-pdf"
|
||||||
version = "2.0.7"
|
version = "2.0.8"
|
||||||
description = "Secure FastMCP server for comprehensive PDF processing - text extraction, OCR, table extraction, forms, annotations, and more"
|
description = "Secure FastMCP server for comprehensive PDF processing - text extraction, OCR, table extraction, forms, annotations, and more"
|
||||||
authors = [{name = "Ryan Malloy", email = "ryan@malloys.us"}]
|
authors = [{name = "Ryan Malloy", email = "ryan@malloys.us"}]
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
|
|||||||
@ -382,4 +382,358 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
"""Simple heuristic to detect if line contains intentional markdown formatting"""
|
"""Simple heuristic to detect if line contains intentional markdown formatting"""
|
||||||
# Very basic check - could be enhanced
|
# Very basic check - could be enhanced
|
||||||
markdown_patterns = ['# ', '## ', '### ', '* ', '- ', '1. ', '**', '__']
|
markdown_patterns = ['# ', '## ', '### ', '* ', '- ', '1. ', '**', '__']
|
||||||
return any(pattern in line for pattern in markdown_patterns)
|
return any(pattern in line for pattern in markdown_patterns)
|
||||||
|
|
||||||
|
@mcp_tool(
|
||||||
|
name="extract_vector_graphics",
|
||||||
|
description="Extract vector graphics from PDF to SVG format. Ideal for schematics, charts, and technical drawings."
|
||||||
|
)
|
||||||
|
async def extract_vector_graphics(
|
||||||
|
self,
|
||||||
|
pdf_path: str,
|
||||||
|
output_directory: Optional[str] = None,
|
||||||
|
pages: Optional[str] = None,
|
||||||
|
mode: str = "full_page",
|
||||||
|
include_text: bool = True,
|
||||||
|
simplify_paths: bool = False,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Extract vector graphics from PDF pages as SVG files.
|
||||||
|
|
||||||
|
Perfect for extracting:
|
||||||
|
- IC functional diagrams from datasheets
|
||||||
|
- Frequency response charts and line graphs
|
||||||
|
- Package outline drawings (dimensioned technical drawings)
|
||||||
|
- Circuit schematics
|
||||||
|
- PCB layout diagrams
|
||||||
|
|
||||||
|
Args:
|
||||||
|
pdf_path: Path to PDF file or HTTPS URL
|
||||||
|
output_directory: Directory to save SVG files (default: temp directory)
|
||||||
|
pages: Page numbers to extract (comma-separated, 1-based), None for all
|
||||||
|
mode: Extraction mode:
|
||||||
|
- "full_page": Complete page as SVG (default, best for general use)
|
||||||
|
- "drawings_only": Extract individual vector paths as separate SVG
|
||||||
|
- "both": Export both formats for flexibility
|
||||||
|
include_text: Whether to include text in SVG output (default: True)
|
||||||
|
simplify_paths: Reduce path complexity for smaller files (default: False)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary containing extraction summary and SVG file paths
|
||||||
|
"""
|
||||||
|
start_time = time.time()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Validate PDF path
|
||||||
|
input_pdf_path = await validate_pdf_path(pdf_path)
|
||||||
|
|
||||||
|
# Setup output directory
|
||||||
|
if output_directory:
|
||||||
|
output_dir = validate_output_path(output_directory)
|
||||||
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
else:
|
||||||
|
output_dir = Path(tempfile.mkdtemp(prefix="pdf_vectors_"))
|
||||||
|
|
||||||
|
# Parse pages parameter
|
||||||
|
parsed_pages = parse_pages_parameter(pages)
|
||||||
|
|
||||||
|
# Validate mode
|
||||||
|
valid_modes = ["full_page", "drawings_only", "both"]
|
||||||
|
if mode not in valid_modes:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": f"Invalid mode '{mode}'. Valid modes: {', '.join(valid_modes)}",
|
||||||
|
"extraction_time": round(time.time() - start_time, 2)
|
||||||
|
}
|
||||||
|
|
||||||
|
# Open PDF document
|
||||||
|
doc = fitz.open(str(input_pdf_path))
|
||||||
|
total_pages = len(doc)
|
||||||
|
|
||||||
|
# Determine pages to process
|
||||||
|
pages_to_process = parsed_pages if parsed_pages else list(range(total_pages))
|
||||||
|
pages_to_process = [p for p in pages_to_process if 0 <= p < total_pages]
|
||||||
|
|
||||||
|
if not pages_to_process:
|
||||||
|
doc.close()
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": "No valid pages specified",
|
||||||
|
"extraction_time": round(time.time() - start_time, 2)
|
||||||
|
}
|
||||||
|
|
||||||
|
svg_files = []
|
||||||
|
total_size = 0
|
||||||
|
base_name = input_pdf_path.stem
|
||||||
|
|
||||||
|
for page_num in pages_to_process:
|
||||||
|
try:
|
||||||
|
page = doc[page_num]
|
||||||
|
page_results = {}
|
||||||
|
|
||||||
|
# Full page SVG extraction
|
||||||
|
if mode in ["full_page", "both"]:
|
||||||
|
svg_content = page.get_svg_image(
|
||||||
|
text_as_path=not include_text
|
||||||
|
)
|
||||||
|
|
||||||
|
# Optionally simplify paths (basic implementation)
|
||||||
|
if simplify_paths:
|
||||||
|
svg_content = self._simplify_svg_paths(svg_content)
|
||||||
|
|
||||||
|
filename = f"{base_name}_page_{page_num + 1}.svg"
|
||||||
|
output_path = output_dir / filename
|
||||||
|
|
||||||
|
with open(output_path, 'w', encoding='utf-8') as f:
|
||||||
|
f.write(svg_content)
|
||||||
|
|
||||||
|
file_size = output_path.stat().st_size
|
||||||
|
total_size += file_size
|
||||||
|
|
||||||
|
page_results["full_page"] = {
|
||||||
|
"filename": filename,
|
||||||
|
"path": str(output_path),
|
||||||
|
"size_bytes": file_size,
|
||||||
|
"size_kb": round(file_size / 1024, 1)
|
||||||
|
}
|
||||||
|
|
||||||
|
# Individual drawings extraction
|
||||||
|
if mode in ["drawings_only", "both"]:
|
||||||
|
drawings = page.get_drawings()
|
||||||
|
drawing_count = len(drawings)
|
||||||
|
|
||||||
|
if drawing_count > 0:
|
||||||
|
# Convert drawings to SVG
|
||||||
|
drawings_svg = self._drawings_to_svg(
|
||||||
|
drawings,
|
||||||
|
page.rect.width,
|
||||||
|
page.rect.height
|
||||||
|
)
|
||||||
|
|
||||||
|
filename = f"{base_name}_page_{page_num + 1}_drawings.svg"
|
||||||
|
output_path = output_dir / filename
|
||||||
|
|
||||||
|
with open(output_path, 'w', encoding='utf-8') as f:
|
||||||
|
f.write(drawings_svg)
|
||||||
|
|
||||||
|
file_size = output_path.stat().st_size
|
||||||
|
total_size += file_size
|
||||||
|
|
||||||
|
page_results["drawings_only"] = {
|
||||||
|
"filename": filename,
|
||||||
|
"path": str(output_path),
|
||||||
|
"size_bytes": file_size,
|
||||||
|
"size_kb": round(file_size / 1024, 1),
|
||||||
|
"drawing_count": drawing_count
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
page_results["drawings_only"] = {
|
||||||
|
"skipped": True,
|
||||||
|
"reason": "No vector drawings found on page"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get drawing statistics for the page
|
||||||
|
all_drawings = page.get_drawings()
|
||||||
|
|
||||||
|
svg_files.append({
|
||||||
|
"page": page_num + 1,
|
||||||
|
"has_text": bool(page.get_text().strip()),
|
||||||
|
"drawing_count": len(all_drawings),
|
||||||
|
**page_results
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Failed to extract vectors from page {page_num + 1}: {e}")
|
||||||
|
svg_files.append({
|
||||||
|
"page": page_num + 1,
|
||||||
|
"error": sanitize_error_message(str(e))
|
||||||
|
})
|
||||||
|
|
||||||
|
doc.close()
|
||||||
|
|
||||||
|
# Count successful extractions
|
||||||
|
successful_pages = sum(1 for f in svg_files if "error" not in f)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"extraction_summary": {
|
||||||
|
"pages_processed": len(pages_to_process),
|
||||||
|
"pages_successful": successful_pages,
|
||||||
|
"mode": mode,
|
||||||
|
"total_size_bytes": total_size,
|
||||||
|
"total_size_kb": round(total_size / 1024, 1),
|
||||||
|
"output_directory": str(output_dir)
|
||||||
|
},
|
||||||
|
"svg_files": svg_files,
|
||||||
|
"settings": {
|
||||||
|
"include_text": include_text,
|
||||||
|
"simplify_paths": simplify_paths,
|
||||||
|
"mode": mode
|
||||||
|
},
|
||||||
|
"file_info": {
|
||||||
|
"input_path": str(input_pdf_path),
|
||||||
|
"total_pages": total_pages,
|
||||||
|
"pages_processed": pages or "all"
|
||||||
|
},
|
||||||
|
"extraction_time": round(time.time() - start_time, 2),
|
||||||
|
"hints": {
|
||||||
|
"viewing": "Open SVG files in browser, Inkscape, or Illustrator for editing",
|
||||||
|
"full_page_vs_drawings": "full_page preserves layout; drawings_only extracts raw vector paths"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
error_msg = sanitize_error_message(str(e))
|
||||||
|
logger.error(f"Vector graphics extraction failed: {error_msg}")
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": error_msg,
|
||||||
|
"extraction_time": round(time.time() - start_time, 2)
|
||||||
|
}
|
||||||
|
|
||||||
|
def _drawings_to_svg(
|
||||||
|
self,
|
||||||
|
drawings: List[Dict],
|
||||||
|
width: float,
|
||||||
|
height: float
|
||||||
|
) -> str:
|
||||||
|
"""
|
||||||
|
Convert PyMuPDF drawings to standalone SVG.
|
||||||
|
|
||||||
|
Drawings contain: rect, items (path operations), color, fill, width, etc.
|
||||||
|
"""
|
||||||
|
svg_parts = [
|
||||||
|
f'<?xml version="1.0" encoding="UTF-8"?>',
|
||||||
|
f'<svg xmlns="http://www.w3.org/2000/svg" ',
|
||||||
|
f'viewBox="0 0 {width:.2f} {height:.2f}" ',
|
||||||
|
f'width="{width:.2f}" height="{height:.2f}">',
|
||||||
|
'',
|
||||||
|
' <!-- Extracted vector drawings from PDF -->',
|
||||||
|
''
|
||||||
|
]
|
||||||
|
|
||||||
|
for idx, drawing in enumerate(drawings):
|
||||||
|
try:
|
||||||
|
path_data = self._drawing_to_path(drawing)
|
||||||
|
if not path_data:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Extract style attributes
|
||||||
|
stroke_color = self._color_to_svg(drawing.get('color'))
|
||||||
|
fill_color = self._color_to_svg(drawing.get('fill'))
|
||||||
|
stroke_width = drawing.get('width', 1)
|
||||||
|
|
||||||
|
# Build style string
|
||||||
|
style_parts = []
|
||||||
|
if fill_color:
|
||||||
|
style_parts.append(f'fill:{fill_color}')
|
||||||
|
else:
|
||||||
|
style_parts.append('fill:none')
|
||||||
|
|
||||||
|
if stroke_color:
|
||||||
|
style_parts.append(f'stroke:{stroke_color}')
|
||||||
|
style_parts.append(f'stroke-width:{stroke_width:.2f}')
|
||||||
|
|
||||||
|
style = ';'.join(style_parts)
|
||||||
|
|
||||||
|
svg_parts.append(f' <path d="{path_data}" style="{style}" />')
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.debug(f"Failed to convert drawing {idx}: {e}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
svg_parts.append('</svg>')
|
||||||
|
return '\n'.join(svg_parts)
|
||||||
|
|
||||||
|
def _drawing_to_path(self, drawing: Dict) -> Optional[str]:
|
||||||
|
"""Convert a single drawing to SVG path data string."""
|
||||||
|
items = drawing.get('items', [])
|
||||||
|
if not items:
|
||||||
|
return None
|
||||||
|
|
||||||
|
path_parts = []
|
||||||
|
|
||||||
|
for item in items:
|
||||||
|
if not item:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Item format: (type, points...)
|
||||||
|
item_type = item[0]
|
||||||
|
|
||||||
|
try:
|
||||||
|
if item_type == 'l': # Line
|
||||||
|
# ('l', Point, Point)
|
||||||
|
p1, p2 = item[1], item[2]
|
||||||
|
path_parts.append(f'M {p1.x:.2f} {p1.y:.2f}')
|
||||||
|
path_parts.append(f'L {p2.x:.2f} {p2.y:.2f}')
|
||||||
|
|
||||||
|
elif item_type == 're': # Rectangle
|
||||||
|
# ('re', Rect)
|
||||||
|
rect = item[1]
|
||||||
|
path_parts.append(f'M {rect.x0:.2f} {rect.y0:.2f}')
|
||||||
|
path_parts.append(f'L {rect.x1:.2f} {rect.y0:.2f}')
|
||||||
|
path_parts.append(f'L {rect.x1:.2f} {rect.y1:.2f}')
|
||||||
|
path_parts.append(f'L {rect.x0:.2f} {rect.y1:.2f}')
|
||||||
|
path_parts.append('Z')
|
||||||
|
|
||||||
|
elif item_type == 'qu': # Quad (4-point polygon)
|
||||||
|
# ('qu', Quad)
|
||||||
|
quad = item[1]
|
||||||
|
path_parts.append(f'M {quad.ul.x:.2f} {quad.ul.y:.2f}')
|
||||||
|
path_parts.append(f'L {quad.ur.x:.2f} {quad.ur.y:.2f}')
|
||||||
|
path_parts.append(f'L {quad.lr.x:.2f} {quad.lr.y:.2f}')
|
||||||
|
path_parts.append(f'L {quad.ll.x:.2f} {quad.ll.y:.2f}')
|
||||||
|
path_parts.append('Z')
|
||||||
|
|
||||||
|
elif item_type == 'c': # Cubic bezier curve
|
||||||
|
# ('c', Point, Point, Point, Point) - start, ctrl1, ctrl2, end
|
||||||
|
p0, p1, p2, p3 = item[1], item[2], item[3], item[4]
|
||||||
|
if not path_parts or not path_parts[-1].startswith('M'):
|
||||||
|
path_parts.append(f'M {p0.x:.2f} {p0.y:.2f}')
|
||||||
|
path_parts.append(f'C {p1.x:.2f} {p1.y:.2f} {p2.x:.2f} {p2.y:.2f} {p3.x:.2f} {p3.y:.2f}')
|
||||||
|
|
||||||
|
except (IndexError, AttributeError) as e:
|
||||||
|
logger.debug(f"Failed to process drawing item {item_type}: {e}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
return ' '.join(path_parts) if path_parts else None
|
||||||
|
|
||||||
|
def _color_to_svg(self, color) -> Optional[str]:
|
||||||
|
"""Convert PyMuPDF color to SVG color string."""
|
||||||
|
if color is None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
if isinstance(color, (list, tuple)):
|
||||||
|
if len(color) == 3:
|
||||||
|
r, g, b = [int(c * 255) for c in color]
|
||||||
|
return f'rgb({r},{g},{b})'
|
||||||
|
elif len(color) == 1:
|
||||||
|
# Grayscale
|
||||||
|
gray = int(color[0] * 255)
|
||||||
|
return f'rgb({gray},{gray},{gray})'
|
||||||
|
elif len(color) == 4:
|
||||||
|
# CMYK - convert to RGB (simplified)
|
||||||
|
c, m, y, k = color
|
||||||
|
r = int(255 * (1 - c) * (1 - k))
|
||||||
|
g = int(255 * (1 - m) * (1 - k))
|
||||||
|
b = int(255 * (1 - y) * (1 - k))
|
||||||
|
return f'rgb({r},{g},{b})'
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _simplify_svg_paths(self, svg_content: str) -> str:
|
||||||
|
"""
|
||||||
|
Basic SVG path simplification.
|
||||||
|
Reduces decimal precision to shrink file size.
|
||||||
|
"""
|
||||||
|
import re
|
||||||
|
|
||||||
|
# Reduce decimal precision in path data
|
||||||
|
def reduce_precision(match):
|
||||||
|
num = float(match.group())
|
||||||
|
return f'{num:.1f}'
|
||||||
|
|
||||||
|
# Match floating point numbers in SVG
|
||||||
|
simplified = re.sub(r'-?\d+\.\d{3,}', reduce_precision, svg_content)
|
||||||
|
|
||||||
|
return simplified
|
||||||
2
uv.lock
generated
2
uv.lock
generated
@ -1032,7 +1032,7 @@ wheels = [
|
|||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "mcp-pdf"
|
name = "mcp-pdf"
|
||||||
version = "2.0.7"
|
version = "2.0.8"
|
||||||
source = { editable = "." }
|
source = { editable = "." }
|
||||||
dependencies = [
|
dependencies = [
|
||||||
{ name = "camelot-py", extra = ["cv"] },
|
{ name = "camelot-py", extra = ["cv"] },
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user