📝 Rewrite README: remove marketing fluff, describe what tools do

2026-02-06 22:43:02 -07:00 · 2026-02-06 22:43:02 -07:00 · f32a014909
commit f32a014909
parent e4f77008bb
1 changed files with 183 additions and 647 deletions
--- a/README.md
+++ b/README.md
@ -4,559 +4,58 @@
 <img src="https://img.shields.io/badge/MCP-PDF%20Tools-red?style=for-the-badge&logo=adobe-acrobat-reader" alt="MCP PDF">
-**🚀 The Ultimate PDF Processing Intelligence Platform for AI**
+**A FastMCP server for PDF processing**
-*Transform any PDF into structured, actionable intelligence with 41 specialized tools*
+*41 tools for text extraction, OCR, tables, forms, annotations, and more*
 [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg?style=flat-square)](https://www.python.org/downloads/)
 [![FastMCP](https://img.shields.io/badge/FastMCP-2.0+-green.svg?style=flat-square)](https://github.com/jlowin/fastmcp)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square)](https://opensource.org/licenses/MIT)
-[![Production Ready](https://img.shields.io/badge/status-production%20ready-brightgreen?style=flat-square)](https://github.com/rsp2k/mcp-pdf)
+[![PyPI](https://img.shields.io/pypi/v/mcp-pdf?style=flat-square)](https://pypi.org/project/mcp-pdf/)
 [![MCP Protocol](https://img.shields.io/badge/MCP-1.13.0-purple?style=flat-square)](https://modelcontextprotocol.io)
-**🤝 Perfect Companion to [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**
+**Works great with [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**
 </div>
 ---
-## ✨ **What Makes MCP PDF Revolutionary?**
+## What It Does
-> 🎯 **The Problem**: PDFs contain incredible intelligence, but extracting it reliably is complex, slow, and often fails.
+MCP PDF extracts content from PDFs using multiple libraries with automatic fallbacks. If one method fails, it tries another.
 >
 > ⚡ **The Solution**: MCP PDF delivers **AI-powered document intelligence** with **41 specialized tools** that understand both content and structure.
-<table>
+**Core capabilities:**
-<tr>
+- **Text extraction** via PyMuPDF, pdfplumber, or pypdf (auto-fallback)
-<td>
+- **Table extraction** via Camelot, pdfplumber, or Tabula (auto-fallback)
-
+- **OCR** for scanned documents via Tesseract
-### 🏆 **Why MCP PDF Leads**
+- **Form handling** - extract, fill, and create PDF forms
- **🚀 41 Specialized Tools** for every PDF scenario
+- **Document assembly** - merge, split, reorder pages
- **🧠 AI-Powered Intelligence** beyond basic extraction
+- **Annotations** - sticky notes, highlights, stamps
- **🔄 Multi-Library Fallbacks** for 99.9% reliability
+- **Vector graphics** - extract to SVG for schematics and technical drawings
 - **⚡ 10x Faster** than traditional solutions
 - **🌐 URL Processing** with smart caching
 - **🎯 Smart Token Management** prevents MCP overflow errors
 </td>
 <td>
 ### 📊 **Enterprise-Proven For:**
 - **Business Intelligence** & financial analysis
 - **Document Security** assessment & compliance
 - **Academic Research** & content analysis
 - **Automated Workflows** & form processing
 - **Document Migration** & modernization
 - **Content Management** & archival
 </td>
 </tr>
 </table>
 ---
-## 🚀 **Get Intelligence in 60 Seconds**
+## Quick Start
 ```bash
 # Install from PyPI
 uvx mcp-pdf
 # Or add to Claude Code
 claude mcp add pdf-tools uvx mcp-pdf
 ```
 <details>
 <summary><b>Development Installation</b></summary>
 ```bash
 # 1️⃣ Clone and install
 git clone https://github.com/rsp2k/mcp-pdf
 cd mcp-pdf
 uv sync
-# 2️⃣ Install system dependencies (Ubuntu/Debian)
+# System dependencies (Ubuntu/Debian)
 sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript
-# 3️⃣ Verify installation
+# Verify
 uv run python examples/verify_installation.py
 # 4️⃣ Run the MCP server
 uv run mcp-pdf
 ```
 <details>
 <summary>🔧 <b>Claude Desktop Integration</b> (click to expand)</summary>
 ### **📦 Production Installation (PyPI)**
 ```bash
 # For personal use across all projects
 claude mcp add -s local pdf-tools uvx mcp-pdf
 # For project-specific use (isolated)
 claude mcp add -s project pdf-tools uvx mcp-pdf
 ```
 ### **🛠️ Development Installation (Source)**
 ```bash
 # For local development from source
 claude mcp add -s project pdf-tools-dev uv -- --directory /path/to/mcp-pdf run mcp-pdf
 ```
 ### **⚙️ Manual Configuration**
 Add to your `claude_desktop_config.json`:
 ```json
 {
  "mcpServers": {
    "pdf-tools": {
      "command": "uvx",
      "args": ["mcp-pdf"]
    }
  }
 }
 ```
 *Restart Claude Desktop and unlock PDF intelligence!*
 </details>
 ---
 ## 🎭 **See AI-Powered Intelligence In Action**
 ### **📊 Business Intelligence Workflow**
 ```python
 # Complete financial report analysis in seconds
 health = await analyze_pdf_health("quarterly-report.pdf")
 classification = await classify_content("quarterly-report.pdf")
 summary = await summarize_content("quarterly-report.pdf", summary_length="medium")
 # Smart table extraction - prevents token overflow on large tables
 tables = await extract_tables("quarterly-report.pdf", pages="5-7", max_rows_per_table=100)
 # Or get just table structure without data
 table_summary = await extract_tables("quarterly-report.pdf", pages="5-7", summary_only=True)
 charts = await extract_charts("quarterly-report.pdf")
 # Get instant insights
 {
  "document_type": "Financial Report",
  "health_score": 9.2,
  "key_insights": [
    "Revenue increased 23% YoY",
    "Operating margin improved to 15.3%",
    "Strong cash flow generation"
  ],
  "tables_extracted": 12,
  "charts_found": 8,
  "processing_time": 2.1
 }
 ```
 ### **🔒 Document Security Assessment**
 ```python
 # Comprehensive security analysis
 security = await analyze_pdf_security("sensitive-document.pdf")
 watermarks = await detect_watermarks("sensitive-document.pdf")
 health = await analyze_pdf_health("sensitive-document.pdf")
 # Enterprise-grade security insights
 {
  "encryption_type": "AES-256",
  "permissions": {
    "print": false,
    "copy": false,
    "modify": false
  },
  "security_warnings": [],
  "watermarks_detected": true,
  "compliance_ready": true
 }
 ```
 ### **📚 Academic Research Processing**
 ```python
 # Advanced research paper analysis
 layout = await analyze_layout("research-paper.pdf", pages=[1,2,3])
 summary = await summarize_content("research-paper.pdf", summary_length="long")
 citations = await extract_text("research-paper.pdf", pages=[15,16,17])
 # Research intelligence delivered
 {
  "reading_complexity": "Graduate Level",
  "main_topics": ["Machine Learning", "Natural Language Processing"],
  "citation_count": 127,
  "figures_detected": 15,
  "methodology_extracted": true
 }
 ```
 ---
 ## 🛠️ **Complete Arsenal: 41 Specialized Tools**
 <div align="center">
 ### **🎯 Document Intelligence & Analysis**
 | 🧠 **Tool** | 📋 **Purpose** | ⚡ **AI Powered** | 🎯 **Accuracy** |
 |-------------|---------------|-----------------|----------------|
 | `classify_content` | AI-powered document type detection | ✅ Yes | 97% |
 | `summarize_content` | Intelligent key insights extraction | ✅ Yes | 95% |
 | `analyze_pdf_health` | Comprehensive quality assessment | ✅ Yes | 99% |
 | `analyze_pdf_security` | Security & vulnerability analysis | ✅ Yes | 99% |
 | `compare_pdfs` | Advanced document comparison | ✅ Yes | 96% |
 ### **📊 Core Content Extraction**
 | 🔧 **Tool** | 📋 **Purpose** | ⚡ **Speed** | 🎯 **Accuracy** |
 |-------------|---------------|-------------|----------------|
 | `extract_text` | Multi-method text extraction with auto-chunking | **Ultra Fast** | 99.9% |
 | `extract_tables` | Smart table extraction with token overflow protection | **Fast** | 98% |
 | `ocr_pdf` | Advanced OCR for scanned docs | **Moderate** | 95% |
 | `extract_images` | Media extraction & processing | **Fast** | 99% |
 | `pdf_to_markdown` | Structure-preserving conversion | **Fast** | 97% |
 ### **📐 Visual & Layout Analysis**
 | 🎨 **Tool** | 📋 **Purpose** | 🔍 **Precision** | 💪 **Features** |
 |-------------|---------------|-----------------|----------------|
 | `analyze_layout` | Page structure & column detection | **High** | Advanced |
 | `extract_charts` | Visual element extraction | **High** | Smart |
 | `detect_watermarks` | Watermark identification | **Perfect** | Complete |
 | `extract_vector_graphics` | PDF to SVG for schematics & drawings | **Perfect** | Multi-mode |
 </div>
 ---
 ## 🌟 **Document Format Intelligence Matrix**
 <div align="center">
 ### **📄 Universal PDF Processing Capabilities**
 | 📋 **Document Type** | 🔍 **Detection** | 📊 **Text** | 📈 **Tables** | 🖼️ **Images** | 🧠 **Intelligence** |
 |---------------------|-----------------|------------|--------------|--------------|-------------------|
 | **Financial Reports** | ✅ Perfect | ✅ Perfect | ✅ Perfect | ✅ Perfect | 🧠 **AI-Enhanced** |
 | **Research Papers** | ✅ Perfect | ✅ Perfect | ✅ Excellent | ✅ Perfect | 🧠 **AI-Enhanced** |
 | **Legal Documents** | ✅ Perfect | ✅ Perfect | ✅ Good | ✅ Perfect | 🧠 **AI-Enhanced** |
 | **Scanned PDFs** | ✅ Auto-Detect | ✅ OCR | ✅ OCR | ✅ Perfect | 🧠 **AI-Enhanced** |
 | **Forms & Applications** | ✅ Perfect | ✅ Perfect | ✅ Excellent | ✅ Perfect | 🧠 **AI-Enhanced** |
 | **Technical Manuals** | ✅ Perfect | ✅ Perfect | ✅ Perfect | ✅ Perfect | 🧠 **AI-Enhanced** |
 *✅ Perfect • 🧠 AI-Enhanced Intelligence • 🔍 Auto-Detection*
 </div>
 ---
 ## ⚡ **Performance That Amazes**
 <div align="center">
 ### **🚀 Real-World Benchmarks**
 | 📄 **Document Type** | 📏 **Pages** | ⏱️ **Processing Time** | 🆚 **vs Competitors** | 🧠 **Intelligence Level** |
 |---------------------|-------------|----------------------|----------------------|---------------------------|
 | Financial Report | 50 pages | 2.1 seconds | **10x faster** | **AI-Powered** |
 | Research Paper | 25 pages | 1.3 seconds | **8x faster** | **Deep Analysis** |
 | Scanned Document | 100 pages | 45 seconds | **5x faster** | **OCR + AI** |
 | Complex Forms | 15 pages | 0.8 seconds | **12x faster** | **Structure Aware** |
 *Benchmarked on: MacBook Pro M2, 16GB RAM • Including AI processing time*
 </div>
 ---
 ## 🏗️ **Intelligent Architecture**
 ### **🧠 Multi-Library Intelligence System**
 *Never worry about PDF compatibility or failure again*
 ```mermaid
 graph TD
    A[PDF Input] --> B{Smart Detection}
    B --> C{Document Type}
    C -->|Text-based| D[PyMuPDF Fast Path]
    C -->|Scanned| E[OCR Processing]
    C -->|Complex Layout| F[pdfplumber Analysis]
    C -->|Tables Heavy| G[Camelot + Tabula]
    D -->|Success| H[✅ Content Extracted]
    D -->|Fail| I[pdfplumber Fallback]
    I -->|Fail| J[pypdf Fallback]
    E --> K[Tesseract OCR]
    K --> L[AI Content Analysis]
    F --> M[Layout Intelligence]
    G --> N[Table Intelligence]
    H --> O[🧠 AI Enhancement]
    L --> O
    M --> O  
    N --> O
    O --> P[🎯 Structured Intelligence]
 ```
 ### **🎯 Intelligent Processing Pipeline**
 1. **🔍 Smart Detection**: Automatically identify document type and optimal processing strategy
 2. **⚡ Optimized Extraction**: Use the fastest, most accurate method for each document
 3. **🛡️ Fallback Protection**: Seamless method switching if primary approach fails
 4. **🧠 AI Enhancement**: Apply document intelligence and content analysis
 5. **🧹 Clean Output**: Deliver perfectly structured, AI-ready intelligence
 ---
 ## 🌍 **Real-World Success Stories**
 <div align="center">
 ### **🏢 Proven at Enterprise Scale**
 </div>
 <table>
 <tr>
 <td>
 ### **📊 Financial Services Giant**
 *Processing 50,000+ reports monthly*
 **Challenge**: Analyze quarterly reports from 2,000+ companies
 **Results**: 
 - ⚡ **98% time reduction** (2 weeks → 4 hours)
 - 🎯 **99.9% accuracy** in financial data extraction
 - 💰 **$5M annual savings** in analyst time
 - 🏆 **SEC compliance** maintained
 </td>
 <td>
 ### **🏥 Healthcare Research Institute**
 *Processing 100,000+ research papers*
 **Challenge**: Analyze medical literature for drug discovery
 **Results**:
 - 🚀 **25x faster** literature review process
 - 📋 **95% accuracy** in data extraction  
 - 🧬 **12 new drug targets** identified
 - 📚 **Publication in Nature** based on insights
 </td>
 </tr>
 <tr>
 <td>
 ### **⚖️ Legal Firm Network**
 *Processing 500,000+ legal documents*
 **Challenge**: Document review and compliance checking
 **Results**:
 - 🏃 **40x speed improvement** in document review
 - 🛡️ **100% security compliance** maintained
 - 💼 **$20M cost savings** across network
 - 🏆 **Zero data breaches** during migration
 </td>
 <td>
 ### **🎓 Global University System**
 *Processing 1M+ academic papers*
 **Challenge**: Create searchable academic knowledge base
 **Results**:
 - 📖 **50x faster** knowledge extraction
 - 🧠 **AI-ready** structured academic data
 - 🔍 **97% search accuracy** improvement
 - 📊 **3 Nobel Prize** papers processed
 </td>
 </tr>
 </table>
 ---
 ## 🎯 **Advanced Features That Set Us Apart**
 ### **🌐 HTTPS URL Processing with Smart Caching**
 ```python
 # Process PDFs directly from anywhere on the web
 report_url = "https://company.com/annual-report.pdf"
 analysis = await classify_content(report_url)  # Downloads & caches automatically
 tables = await extract_tables(report_url)     # Uses cache - instant!
 summary = await summarize_content(report_url) # Lightning fast!
 ```
 ### **🩺 Comprehensive Document Health Analysis**
 ```python
 # Enterprise-grade document assessment
 health = await analyze_pdf_health("critical-document.pdf")
 {
  "overall_health_score": 9.2,
  "corruption_detected": false,
  "optimization_potential": "23% size reduction possible",
  "security_assessment": "enterprise_ready",
  "recommendations": [
    "Document is production-ready",
    "Consider optimization for web delivery"
  ],
  "processing_confidence": 99.8
 }
 ```
 ### **🔍 AI-Powered Content Classification**
 ```python
 # Automatically understand document types
 classification = await classify_content("mystery-document.pdf")
 {
  "document_type": "Financial Report",
  "confidence": 97.3,
  "key_topics": ["Revenue", "Operating Expenses", "Cash Flow"],
  "complexity_level": "Professional",
  "suggested_tools": ["extract_tables", "extract_charts", "summarize_content"],
  "industry_vertical": "Technology"
 }
 ```
 ---
 ## 🤝 **Perfect Integration Ecosystem**
 ### **💎 Companion to MCP Office Tools**
 *The ultimate document processing powerhouse*
 <div align="center">
 | 🔧 **Processing Need** | 📄 **PDF Files** | 📊 **Office Files** | 🔗 **Integration** |
 |-----------------------|------------------|-------------------|-------------------|
 | **Text Extraction** | MCP PDF ✅ | [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools) ✅ | **Unified API** |
 | **Table Processing** | Advanced ✅ | Advanced ✅ | **Cross-Format** |
 | **Image Extraction** | Smart ✅ | Smart ✅ | **Consistent** |
 | **Format Detection** | AI-Powered ✅ | AI-Powered ✅ | **Intelligent** |
 | **Health Analysis** | Complete ✅ | Complete ✅ | **Comprehensive** |
 [**🚀 Get Both Tools for Complete Document Intelligence**](https://git.supported.systems/MCP/mcp-office-tools)
 </div>
 ### **🔗 Unified Document Processing Workflow**
 ```python
 # Process ALL document formats with unified intelligence
 pdf_analysis = await pdf_tools.classify_content("report.pdf")
 word_analysis = await office_tools.detect_office_format("report.docx")
 excel_data = await office_tools.extract_text("data.xlsx")
 # Cross-format document comparison
 comparison = await compare_cross_format_documents([
    pdf_analysis, word_analysis, excel_data
 ])
 ```
 ### **⚡ Works Seamlessly With**
 - **🤖 Claude Desktop**: Native MCP protocol integration
 - **📊 Jupyter Notebooks**: Perfect for research and analysis
 - **🐍 Python Applications**: Direct async/await API access
 - **🌐 Web Services**: RESTful wrappers and microservices
 - **☁️ Cloud Platforms**: AWS Lambda, Google Functions, Azure
 - **🔄 Workflow Engines**: Zapier, Microsoft Power Automate
 ---
 ## 🛡️ **Enterprise-Grade Security & Compliance**
 <div align="center">
 | 🔒 **Security Feature** | ✅ **Status** | 📋 **Enterprise Ready** |
 |------------------------|---------------|------------------------|
 | **Local Processing** | ✅ Enabled | Documents never leave your environment |
 | **Memory Security** | ✅ Optimized | Automatic sensitive data cleanup |
 | **HTTPS Validation** | ✅ Enforced | Certificate validation and secure headers |
 | **Access Controls** | ✅ Configurable | Role-based processing permissions |
 | **Audit Logging** | ✅ Available | Complete processing audit trails |
 | **GDPR Compliant** | ✅ Certified | No personal data retention |
 | **SOC2 Ready** | ✅ Verified | Enterprise security standards |
 </div>
 ---
 ## 📈 **Installation & Enterprise Setup**
 <details>
 <summary>🚀 <b>Quick Start</b> (Recommended)</summary>
 ```bash
 # Clone repository
 git clone https://github.com/rsp2k/mcp-pdf
 cd mcp-pdf
 # Install with uv (fastest)
 uv sync
 # Install system dependencies (Ubuntu/Debian)
 sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript
 # Verify installation
 uv run python examples/verify_installation.py
 ```
 </details>
 <details>
 <summary>🐳 <b>Docker Enterprise Setup</b></summary>
 ```dockerfile
 FROM python:3.11-slim
 RUN apt-get update && apt-get install -y \
    tesseract-ocr tesseract-ocr-eng \
    poppler-utils ghostscript \
    default-jre-headless
 COPY . /app
 WORKDIR /app
 RUN pip install -e .
 CMD ["mcp-pdf"]
 ```
 </details>
 <details>
 <summary>🌐 <b>Claude Desktop Integration</b></summary>
 ```json
 {
  "mcpServers": {
    "pdf-tools": {
      "command": "uv",
      "args": ["run", "mcp-pdf"],
      "cwd": "/path/to/mcp-pdf"
    },
    "office-tools": {
      "command": "mcp-office-tools"
    }
  }
 }
 ```
 *Unified document processing across all formats!*
 </details>
 <details>
 <summary>🔧 <b>Development Environment</b></summary>
 ```bash
 # Clone and setup
 git clone https://github.com/rsp2k/mcp-pdf
 cd mcp-pdf
 uv sync --dev
 # Quality checks
 uv run pytest --cov=mcp_pdf_tools
 uv run black src/ tests/ examples/
 uv run ruff check src/ tests/ examples/
 uv run mypy src/
 # Run all 23 tools demo
 uv run python examples/verify_installation.py
 ```
@ -564,125 +63,162 @@ uv run python examples/verify_installation.py
 ---
-## 🚀 **What's Coming Next?**
+## Tools
-<div align="center">
+### Content Extraction
-### **🔮 Innovation Roadmap 2024-2025**
+| Tool | What it does |
-
+|------|-------------|
-</div>
+| `extract_text` | Pull text from PDF pages with automatic chunking for large files |
-
+| `extract_tables` | Extract tables to JSON, CSV, or Markdown |
-| 🗓️ **Timeline** | 🎯 **Feature** | 📋 **Impact** |
+| `extract_images` | Extract embedded images |
-|-----------------|---------------|--------------|
+| `extract_links` | Get all hyperlinks with page filtering |
-| **Q4 2024** | **Enhanced AI Analysis** | GPT-powered content understanding |
+| `pdf_to_markdown` | Convert PDF to markdown preserving structure |
-| **Q1 2025** | **Batch Processing** | Process 1000+ documents simultaneously |
+| `ocr_pdf` | OCR scanned documents using Tesseract |
-| **Q2 2025** | **Cloud Integration** | Direct S3, GCS, Azure Blob support |
+| `extract_vector_graphics` | Export vector graphics to SVG (schematics, charts, drawings) |
-| **Q3 2025** | **Real-time Streaming** | Process documents as they're created |
+
-| **Q4 2025** | **Multi-language OCR** | 50+ language support with AI translation |
+### Document Analysis
-| **2026** | **Blockchain Verification** | Cryptographic document integrity |
+
-
+| Tool | What it does |
---
+|------|-------------|
-
+| `extract_metadata` | Get title, author, creation date, page count, etc. |
-## 🎭 **Complete Tool Showcase**
+| `get_document_structure` | Extract table of contents and bookmarks |
-
+| `analyze_layout` | Detect columns, headers, footers |
-<details>
+| `is_scanned_pdf` | Check if PDF needs OCR |
-<summary>📊 <b>Business Intelligence Tools</b> (click to expand)</summary>
+| `compare_pdfs` | Diff two PDFs by text, structure, or metadata |
-
+| `analyze_pdf_health` | Check for corruption, optimization opportunities |
-### **Core Extraction**
+| `analyze_pdf_security` | Report encryption, permissions, signatures |
- `extract_text` - Multi-method text extraction with layout preservation
+
- `extract_tables` - Intelligent table extraction (JSON, CSV, Markdown)
+### Forms
- `extract_images` - Image extraction with size filtering and format options
+
- `pdf_to_markdown` - Clean markdown conversion with structure preservation
+| Tool | What it does |
-
+|------|-------------|
-### **AI-Powered Analysis**
+| `extract_form_data` | Get form field names and values |
- `classify_content` - AI document type classification and analysis
+| `fill_form_pdf` | Fill form fields from JSON |
- `summarize_content` - Intelligent summarization with key insights
+| `create_form_pdf` | Create new forms with text fields, checkboxes, dropdowns |
- `analyze_pdf_health` - Comprehensive quality assessment
+| `add_form_fields` | Add fields to existing PDFs |
- `analyze_pdf_security` - Security feature analysis and vulnerability detection
+
-
+### Document Assembly
-</details>
+
-
+| Tool | What it does |
-<details>
+|------|-------------|
-<summary>🔍 <b>Advanced Analysis Tools</b> (click to expand)</summary>
+| `merge_pdfs` | Combine multiple PDFs with bookmark preservation |
-
+| `split_pdf_by_pages` | Split by page ranges |
-### **Document Intelligence**
+| `split_pdf_by_bookmarks` | Split at chapter/section boundaries |
- `compare_pdfs` - Advanced document comparison (text, structure, metadata)
+| `reorder_pdf_pages` | Rearrange pages in custom order |
- `is_scanned_pdf` - Smart detection of scanned vs. text-based documents
+
- `get_document_structure` - Document outline and structural analysis
+### Annotations
- `extract_metadata` - Comprehensive metadata and statistics extraction
+
-
+| Tool | What it does |
-### **Visual Processing**
+|------|-------------|
- `analyze_layout` - Page layout analysis with column and spacing detection
+| `add_sticky_notes` | Add comment annotations |
- `extract_charts` - Chart, diagram, and visual element extraction
+| `add_highlights` | Highlight text regions |
- `detect_watermarks` - Watermark detection and analysis
+| `add_stamps` | Add Approved/Draft/Confidential stamps |
- `extract_vector_graphics` - Extract vector graphics to SVG (schematics, charts, technical drawings)
+| `extract_all_annotations` | Export annotations to JSON |
-
+
-</details>
+---
-
+
-<details>
+## How Fallbacks Work
-<summary>🔨 <b>Document Manipulation Tools</b> (click to expand)</summary>
+
-
+The server tries multiple libraries for each operation:
-### **Content Operations**
+
- `extract_form_data` - Interactive PDF form data extraction
+**Text extraction:**
- `split_pdf` - Intelligent document splitting at specified pages
+1. PyMuPDF (fastest)
- `merge_pdfs` - Multi-document merging with page range tracking
+2. pdfplumber (better for complex layouts)
- `rotate_pages` - Precise page rotation (90°/180°/270°)
+3. pypdf (most compatible)
-
+
-### **Optimization & Repair**
+**Table extraction:**
- `convert_to_images` - PDF to image conversion with quality control
+1. Camelot (best accuracy, requires Ghostscript)
- `optimize_pdf` - Multi-level file size optimization
+2. pdfplumber (no dependencies)
- `repair_pdf` - Automated corruption repair and recovery
+3. Tabula (requires Java)
- `ocr_pdf` - Advanced OCR with preprocessing for scanned documents
+
-
+If a PDF fails with one library, the next is tried automatically.
-</details>
+
-
+---
---
+
-
+## Token Management
-## 💝 **Enterprise Support & Community**
+
-
+Large PDFs can overflow MCP response limits. The server handles this:
-<div align="center">
+
-
+- **Automatic chunking** splits large documents into page groups
-### **🌟 Join the PDF Intelligence Revolution!**
+- **Table row limits** prevent huge tables from blowing up responses
-
+- **Summary mode** returns structure without full content
-[![GitHub](https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github)](https://github.com/rsp2k/mcp-pdf)
+
-[![Issues](https://img.shields.io/badge/Issues-Welcome-green?style=for-the-badge&logo=github)](https://github.com/rsp2k/mcp-pdf/issues)
+```python
-[![MCP Office Tools](https://img.shields.io/badge/Companion-MCP%20Office%20Tools-blue?style=for-the-badge)](https://git.supported.systems/MCP/mcp-office-tools)
+# Get first 10 pages
-
+result = await extract_text("huge.pdf", pages="1-10")
-**💬 Enterprise Support Available** • **🐛 Bug Bounty Program** • **💡 Feature Requests Welcome**
+
-
+# Limit table rows
-</div>
+tables = await extract_tables("data.pdf", max_rows_per_table=50)
-
+
-### **🏢 Enterprise Services**
+# Structure only
- **📞 Priority Support**: 24/7 enterprise support available
+tables = await extract_tables("data.pdf", summary_only=True)
- **🎓 Training Programs**: Comprehensive team training
+```
- **🔧 Custom Integration**: Tailored enterprise deployments
+
- **📊 Analytics Dashboard**: Usage analytics and insights
+---
- **🛡️ Security Audits**: Comprehensive security assessments
+
-
+## URL Processing
---
+
-
+PDFs can be fetched directly from HTTPS URLs:
-<div align="center">
+
-
+```python
-## 📜 **License & Ecosystem**
+result = await extract_text("https://example.com/report.pdf")
-
+```
-**MIT License** - Freedom to innovate everywhere
+
-
+Files are cached locally for subsequent operations.
-**🤝 Part of the MCP Document Processing Ecosystem**
+
-
+---
-*Powered by [FastMCP](https://github.com/jlowin/fastmcp) • [Model Context Protocol](https://modelcontextprotocol.io) • Enterprise Python*
+
-
+## System Dependencies
-### **🔗 Complete Document Processing Solution**
+
-
+Some features require system packages:
-**PDF Intelligence** ➜ **[MCP PDF](https://github.com/rsp2k/mcp-pdf)** (You are here!)  
+
-**Office Intelligence** ➜ **[MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**  
+| Feature | Dependency |
-**Unified Power** ➜ **Both Tools Together**
+|---------|-----------|
-
+| OCR | `tesseract-ocr` |
---
+| Camelot tables | `ghostscript` |
-
+| Tabula tables | `default-jre-headless` |
-### **⭐ Star both repositories for the complete solution! ⭐**
+| PDF to images | `poppler-utils` |
-
+
-**📄 [Star MCP PDF](https://github.com/rsp2k/mcp-pdf)** • **📊 [Star MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**
+Ubuntu/Debian:
-
+```bash
-*Building the future of intelligent document processing* 🚀
+sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless
 ```
 ---
 ## Configuration
 Optional environment variables:
 | Variable | Purpose |
 |----------|---------|
 | `MCP_PDF_ALLOWED_PATHS` | Colon-separated directories for file output |
 | `PDF_TEMP_DIR` | Temp directory for processing (default: `/tmp/mcp-pdf-processing`) |
 | `TESSDATA_PREFIX` | Tesseract language data location |
 ---
 ## Development
 ```bash
 # Run tests
 uv run pytest
 # With coverage
 uv run pytest --cov=mcp_pdf
 # Format
 uv run black src/ tests/
 # Lint
 uv run ruff check src/ tests/
 ```
 ---
 ## License
 MIT
 </div>