📝 Rewrite README: remove marketing fluff, describe what tools do

🚀 v2.0.8: Add extract_vector_graphics tool for PDF to SVG extraction
New tool extracts vector graphics from PDF pages as SVG files, supporting three modes: full_page (PyMuPDF native SVG), drawings_only (raw vector paths), and both. Handles lines, curves, rectangles, quads with proper color space conversion (RGB, grayscale, CMYK). No new dependencies.
2026-02-06 22:43:02 -07:00 · 2026-02-02 13:56:17 -07:00
4 changed files with 540 additions and 648 deletions
--- a/README.md
+++ b/README.md
@ -4,558 +4,58 @@

 <img src="https://img.shields.io/badge/MCP-PDF%20Tools-red?style=for-the-badge&logo=adobe-acrobat-reader" alt="MCP PDF">

-**🚀 The Ultimate PDF Processing Intelligence Platform for AI**
+**A FastMCP server for PDF processing**

-*Transform any PDF into structured, actionable intelligence with 24 specialized tools*
+*41 tools for text extraction, OCR, tables, forms, annotations, and more*

 [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg?style=flat-square)](https://www.python.org/downloads/)
 [![FastMCP](https://img.shields.io/badge/FastMCP-2.0+-green.svg?style=flat-square)](https://github.com/jlowin/fastmcp)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square)](https://opensource.org/licenses/MIT)
-[![Production Ready](https://img.shields.io/badge/status-production%20ready-brightgreen?style=flat-square)](https://github.com/rsp2k/mcp-pdf)
-[![MCP Protocol](https://img.shields.io/badge/MCP-1.13.0-purple?style=flat-square)](https://modelcontextprotocol.io)
+[![PyPI](https://img.shields.io/pypi/v/mcp-pdf?style=flat-square)](https://pypi.org/project/mcp-pdf/)

-**🤝 Perfect Companion to [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**
+**Works great with [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**

 </div>

 ---

-## ✨ **What Makes MCP PDF Revolutionary?**
+## What It Does

-> 🎯 **The Problem**: PDFs contain incredible intelligence, but extracting it reliably is complex, slow, and often fails.
->
-> ⚡ **The Solution**: MCP PDF delivers **AI-powered document intelligence** with **40 specialized tools** that understand both content and structure.
+MCP PDF extracts content from PDFs using multiple libraries with automatic fallbacks. If one method fails, it tries another.

-<table>
-<tr>
-<td>
-
-### 🏆 **Why MCP PDF Leads**
- **🚀 40 Specialized Tools** for every PDF scenario
- **🧠 AI-Powered Intelligence** beyond basic extraction
- **🔄 Multi-Library Fallbacks** for 99.9% reliability
- **⚡ 10x Faster** than traditional solutions
- **🌐 URL Processing** with smart caching
- **🎯 Smart Token Management** prevents MCP overflow errors
-
-</td>
-<td>
-
-### 📊 **Enterprise-Proven For:**
- **Business Intelligence** & financial analysis
- **Document Security** assessment & compliance
- **Academic Research** & content analysis
- **Automated Workflows** & form processing
- **Document Migration** & modernization
- **Content Management** & archival
-
-</td>
-</tr>
-</table>
+**Core capabilities:**
+- **Text extraction** via PyMuPDF, pdfplumber, or pypdf (auto-fallback)
+- **Table extraction** via Camelot, pdfplumber, or Tabula (auto-fallback)
+- **OCR** for scanned documents via Tesseract
+- **Form handling** - extract, fill, and create PDF forms
+- **Document assembly** - merge, split, reorder pages
+- **Annotations** - sticky notes, highlights, stamps
+- **Vector graphics** - extract to SVG for schematics and technical drawings

 ---

-## 🚀 **Get Intelligence in 60 Seconds**
+## Quick Start
+
+```bash
+# Install from PyPI
+uvx mcp-pdf
+
+# Or add to Claude Code
+claude mcp add pdf-tools uvx mcp-pdf
+```
+
+<details>
+<summary><b>Development Installation</b></summary>

 ```bash
-# 1️⃣ Clone and install
 git clone https://github.com/rsp2k/mcp-pdf
 cd mcp-pdf
 uv sync

-# 2️⃣ Install system dependencies (Ubuntu/Debian)
+# System dependencies (Ubuntu/Debian)
 sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript

-# 3️⃣ Verify installation
-uv run python examples/verify_installation.py
-
-# 4️⃣ Run the MCP server
-uv run mcp-pdf
-```
-
-<details>
-<summary>🔧 <b>Claude Desktop Integration</b> (click to expand)</summary>
-
-### **📦 Production Installation (PyPI)**
-
-```bash
-# For personal use across all projects
-claude mcp add -s local pdf-tools uvx mcp-pdf
-
-# For project-specific use (isolated)
-claude mcp add -s project pdf-tools uvx mcp-pdf
-```
-
-### **🛠️ Development Installation (Source)**
-
-```bash
-# For local development from source
-claude mcp add -s project pdf-tools-dev uv -- --directory /path/to/mcp-pdf run mcp-pdf
-```
-
-### **⚙️ Manual Configuration**
-Add to your `claude_desktop_config.json`:
-```json
-{
-  "mcpServers": {
-    "pdf-tools": {
-      "command": "uvx",
-      "args": ["mcp-pdf"]
-    }
-  }
-}
-```
-*Restart Claude Desktop and unlock PDF intelligence!*
-
-</details>
-
---
-
-## 🎭 **See AI-Powered Intelligence In Action**
-
-### **📊 Business Intelligence Workflow**
-```python
-# Complete financial report analysis in seconds
-health = await analyze_pdf_health("quarterly-report.pdf")
-classification = await classify_content("quarterly-report.pdf")
-summary = await summarize_content("quarterly-report.pdf", summary_length="medium")
-
-# Smart table extraction - prevents token overflow on large tables
-tables = await extract_tables("quarterly-report.pdf", pages="5-7", max_rows_per_table=100)
-# Or get just table structure without data
-table_summary = await extract_tables("quarterly-report.pdf", pages="5-7", summary_only=True)
-
-charts = await extract_charts("quarterly-report.pdf")
-
-# Get instant insights
-{
-  "document_type": "Financial Report",
-  "health_score": 9.2,
-  "key_insights": [
-    "Revenue increased 23% YoY",
-    "Operating margin improved to 15.3%",
-    "Strong cash flow generation"
-  ],
-  "tables_extracted": 12,
-  "charts_found": 8,
-  "processing_time": 2.1
-}
-```
-
-### **🔒 Document Security Assessment**
-```python
-# Comprehensive security analysis
-security = await analyze_pdf_security("sensitive-document.pdf")
-watermarks = await detect_watermarks("sensitive-document.pdf")
-health = await analyze_pdf_health("sensitive-document.pdf")
-
-# Enterprise-grade security insights
-{
-  "encryption_type": "AES-256",
-  "permissions": {
-    "print": false,
-    "copy": false,
-    "modify": false
-  },
-  "security_warnings": [],
-  "watermarks_detected": true,
-  "compliance_ready": true
-}
-```
-
-### **📚 Academic Research Processing**
-```python
-# Advanced research paper analysis
-layout = await analyze_layout("research-paper.pdf", pages=[1,2,3])
-summary = await summarize_content("research-paper.pdf", summary_length="long")
-citations = await extract_text("research-paper.pdf", pages=[15,16,17])
-
-# Research intelligence delivered
-{
-  "reading_complexity": "Graduate Level",
-  "main_topics": ["Machine Learning", "Natural Language Processing"],
-  "citation_count": 127,
-  "figures_detected": 15,
-  "methodology_extracted": true
-}
-```
-
---
-
-## 🛠️ **Complete Arsenal: 40+ Specialized Tools**
-
-<div align="center">
-
-### **🎯 Document Intelligence & Analysis**
-
-| 🧠 **Tool** | 📋 **Purpose** | ⚡ **AI Powered** | 🎯 **Accuracy** |
-|-------------|---------------|-----------------|----------------|
-| `classify_content` | AI-powered document type detection | ✅ Yes | 97% |
-| `summarize_content` | Intelligent key insights extraction | ✅ Yes | 95% |
-| `analyze_pdf_health` | Comprehensive quality assessment | ✅ Yes | 99% |
-| `analyze_pdf_security` | Security & vulnerability analysis | ✅ Yes | 99% |
-| `compare_pdfs` | Advanced document comparison | ✅ Yes | 96% |
-
-### **📊 Core Content Extraction**
-
-| 🔧 **Tool** | 📋 **Purpose** | ⚡ **Speed** | 🎯 **Accuracy** |
-|-------------|---------------|-------------|----------------|
-| `extract_text` | Multi-method text extraction with auto-chunking | **Ultra Fast** | 99.9% |
-| `extract_tables` | Smart table extraction with token overflow protection | **Fast** | 98% |
-| `ocr_pdf` | Advanced OCR for scanned docs | **Moderate** | 95% |
-| `extract_images` | Media extraction & processing | **Fast** | 99% |
-| `pdf_to_markdown` | Structure-preserving conversion | **Fast** | 97% |
-
-### **📐 Visual & Layout Analysis**
-
-| 🎨 **Tool** | 📋 **Purpose** | 🔍 **Precision** | 💪 **Features** |
-|-------------|---------------|-----------------|----------------|
-| `analyze_layout` | Page structure & column detection | **High** | Advanced |
-| `extract_charts` | Visual element extraction | **High** | Smart |
-| `detect_watermarks` | Watermark identification | **Perfect** | Complete |
-
-</div>
-
---
-
-## 🌟 **Document Format Intelligence Matrix**
-
-<div align="center">
-
-### **📄 Universal PDF Processing Capabilities**
-
-| 📋 **Document Type** | 🔍 **Detection** | 📊 **Text** | 📈 **Tables** | 🖼️ **Images** | 🧠 **Intelligence** |
-|---------------------|-----------------|------------|--------------|--------------|-------------------|
-| **Financial Reports** | ✅ Perfect | ✅ Perfect | ✅ Perfect | ✅ Perfect | 🧠 **AI-Enhanced** |
-| **Research Papers** | ✅ Perfect | ✅ Perfect | ✅ Excellent | ✅ Perfect | 🧠 **AI-Enhanced** |
-| **Legal Documents** | ✅ Perfect | ✅ Perfect | ✅ Good | ✅ Perfect | 🧠 **AI-Enhanced** |
-| **Scanned PDFs** | ✅ Auto-Detect | ✅ OCR | ✅ OCR | ✅ Perfect | 🧠 **AI-Enhanced** |
-| **Forms & Applications** | ✅ Perfect | ✅ Perfect | ✅ Excellent | ✅ Perfect | 🧠 **AI-Enhanced** |
-| **Technical Manuals** | ✅ Perfect | ✅ Perfect | ✅ Perfect | ✅ Perfect | 🧠 **AI-Enhanced** |
-
-*✅ Perfect • 🧠 AI-Enhanced Intelligence • 🔍 Auto-Detection*
-
-</div>
-
---
-
-## ⚡ **Performance That Amazes**
-
-<div align="center">
-
-### **🚀 Real-World Benchmarks**
-
-| 📄 **Document Type** | 📏 **Pages** | ⏱️ **Processing Time** | 🆚 **vs Competitors** | 🧠 **Intelligence Level** |
-|---------------------|-------------|----------------------|----------------------|---------------------------|
-| Financial Report | 50 pages | 2.1 seconds | **10x faster** | **AI-Powered** |
-| Research Paper | 25 pages | 1.3 seconds | **8x faster** | **Deep Analysis** |
-| Scanned Document | 100 pages | 45 seconds | **5x faster** | **OCR + AI** |
-| Complex Forms | 15 pages | 0.8 seconds | **12x faster** | **Structure Aware** |
-
-*Benchmarked on: MacBook Pro M2, 16GB RAM • Including AI processing time*
-
-</div>
-
---
-
-## 🏗️ **Intelligent Architecture**
-
-### **🧠 Multi-Library Intelligence System**
-*Never worry about PDF compatibility or failure again*
-
-```mermaid
-graph TD
-    A[PDF Input] --> B{Smart Detection}
-    B --> C{Document Type}
-    C -->|Text-based| D[PyMuPDF Fast Path]
-    C -->|Scanned| E[OCR Processing]
-    C -->|Complex Layout| F[pdfplumber Analysis]
-    C -->|Tables Heavy| G[Camelot + Tabula]
-    
-    D -->|Success| H[✅ Content Extracted]
-    D -->|Fail| I[pdfplumber Fallback]
-    I -->|Fail| J[pypdf Fallback]
-    
-    E --> K[Tesseract OCR]
-    K --> L[AI Content Analysis]
-    
-    F --> M[Layout Intelligence]
-    G --> N[Table Intelligence]
-    
-    H --> O[🧠 AI Enhancement]
-    L --> O
-    M --> O  
-    N --> O
-    
-    O --> P[🎯 Structured Intelligence]
-```
-
-### **🎯 Intelligent Processing Pipeline**
-
-1. **🔍 Smart Detection**: Automatically identify document type and optimal processing strategy
-2. **⚡ Optimized Extraction**: Use the fastest, most accurate method for each document
-3. **🛡️ Fallback Protection**: Seamless method switching if primary approach fails
-4. **🧠 AI Enhancement**: Apply document intelligence and content analysis
-5. **🧹 Clean Output**: Deliver perfectly structured, AI-ready intelligence
-
---
-
-## 🌍 **Real-World Success Stories**
-
-<div align="center">
-
-### **🏢 Proven at Enterprise Scale**
-
-</div>
-
-<table>
-<tr>
-<td>
-
-### **📊 Financial Services Giant**
-*Processing 50,000+ reports monthly*
-
-**Challenge**: Analyze quarterly reports from 2,000+ companies
-
-**Results**: 
- ⚡ **98% time reduction** (2 weeks → 4 hours)
- 🎯 **99.9% accuracy** in financial data extraction
- 💰 **$5M annual savings** in analyst time
- 🏆 **SEC compliance** maintained
-
-</td>
-<td>
-
-### **🏥 Healthcare Research Institute**
-*Processing 100,000+ research papers*
-
-**Challenge**: Analyze medical literature for drug discovery
-
-**Results**:
- 🚀 **25x faster** literature review process
- 📋 **95% accuracy** in data extraction  
- 🧬 **12 new drug targets** identified
- 📚 **Publication in Nature** based on insights
-
-</td>
-</tr>
-<tr>
-<td>
-
-### **⚖️ Legal Firm Network**
-*Processing 500,000+ legal documents*
-
-**Challenge**: Document review and compliance checking
-
-**Results**:
- 🏃 **40x speed improvement** in document review
- 🛡️ **100% security compliance** maintained
- 💼 **$20M cost savings** across network
- 🏆 **Zero data breaches** during migration
-
-</td>
-<td>
-
-### **🎓 Global University System**
-*Processing 1M+ academic papers*
-
-**Challenge**: Create searchable academic knowledge base
-
-**Results**:
- 📖 **50x faster** knowledge extraction
- 🧠 **AI-ready** structured academic data
- 🔍 **97% search accuracy** improvement
- 📊 **3 Nobel Prize** papers processed
-
-</td>
-</tr>
-</table>
-
---
-
-## 🎯 **Advanced Features That Set Us Apart**
-
-### **🌐 HTTPS URL Processing with Smart Caching**
-```python
-# Process PDFs directly from anywhere on the web
-report_url = "https://company.com/annual-report.pdf"
-analysis = await classify_content(report_url)  # Downloads & caches automatically
-tables = await extract_tables(report_url)     # Uses cache - instant!
-summary = await summarize_content(report_url) # Lightning fast!
-```
-
-### **🩺 Comprehensive Document Health Analysis**
-```python
-# Enterprise-grade document assessment
-health = await analyze_pdf_health("critical-document.pdf")
-
-{
-  "overall_health_score": 9.2,
-  "corruption_detected": false,
-  "optimization_potential": "23% size reduction possible",
-  "security_assessment": "enterprise_ready",
-  "recommendations": [
-    "Document is production-ready",
-    "Consider optimization for web delivery"
-  ],
-  "processing_confidence": 99.8
-}
-```
-
-### **🔍 AI-Powered Content Classification**
-```python
-# Automatically understand document types
-classification = await classify_content("mystery-document.pdf")
-
-{
-  "document_type": "Financial Report",
-  "confidence": 97.3,
-  "key_topics": ["Revenue", "Operating Expenses", "Cash Flow"],
-  "complexity_level": "Professional",
-  "suggested_tools": ["extract_tables", "extract_charts", "summarize_content"],
-  "industry_vertical": "Technology"
-}
-```
-
---
-
-## 🤝 **Perfect Integration Ecosystem**
-
-### **💎 Companion to MCP Office Tools**
-*The ultimate document processing powerhouse*
-
-<div align="center">
-
-| 🔧 **Processing Need** | 📄 **PDF Files** | 📊 **Office Files** | 🔗 **Integration** |
-|-----------------------|------------------|-------------------|-------------------|
-| **Text Extraction** | MCP PDF ✅ | [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools) ✅ | **Unified API** |
-| **Table Processing** | Advanced ✅ | Advanced ✅ | **Cross-Format** |
-| **Image Extraction** | Smart ✅ | Smart ✅ | **Consistent** |
-| **Format Detection** | AI-Powered ✅ | AI-Powered ✅ | **Intelligent** |
-| **Health Analysis** | Complete ✅ | Complete ✅ | **Comprehensive** |
-
-[**🚀 Get Both Tools for Complete Document Intelligence**](https://git.supported.systems/MCP/mcp-office-tools)
-
-</div>
-
-### **🔗 Unified Document Processing Workflow**
-```python
-# Process ALL document formats with unified intelligence
-pdf_analysis = await pdf_tools.classify_content("report.pdf")
-word_analysis = await office_tools.detect_office_format("report.docx")
-excel_data = await office_tools.extract_text("data.xlsx")
-
-# Cross-format document comparison
-comparison = await compare_cross_format_documents([
-    pdf_analysis, word_analysis, excel_data
-])
-```
-
-### **⚡ Works Seamlessly With**
- **🤖 Claude Desktop**: Native MCP protocol integration
- **📊 Jupyter Notebooks**: Perfect for research and analysis
- **🐍 Python Applications**: Direct async/await API access
- **🌐 Web Services**: RESTful wrappers and microservices
- **☁️ Cloud Platforms**: AWS Lambda, Google Functions, Azure
- **🔄 Workflow Engines**: Zapier, Microsoft Power Automate
-
---
-
-## 🛡️ **Enterprise-Grade Security & Compliance**
-
-<div align="center">
-
-| 🔒 **Security Feature** | ✅ **Status** | 📋 **Enterprise Ready** |
-|------------------------|---------------|------------------------|
-| **Local Processing** | ✅ Enabled | Documents never leave your environment |
-| **Memory Security** | ✅ Optimized | Automatic sensitive data cleanup |
-| **HTTPS Validation** | ✅ Enforced | Certificate validation and secure headers |
-| **Access Controls** | ✅ Configurable | Role-based processing permissions |
-| **Audit Logging** | ✅ Available | Complete processing audit trails |
-| **GDPR Compliant** | ✅ Certified | No personal data retention |
-| **SOC2 Ready** | ✅ Verified | Enterprise security standards |
-
-</div>
-
---
-
-## 📈 **Installation & Enterprise Setup**
-
-<details>
-<summary>🚀 <b>Quick Start</b> (Recommended)</summary>
-
-```bash
-# Clone repository
-git clone https://github.com/rsp2k/mcp-pdf
-cd mcp-pdf
-
-# Install with uv (fastest)
-uv sync
-
-# Install system dependencies (Ubuntu/Debian)
-sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript
-
-# Verify installation
-uv run python examples/verify_installation.py
-```
-
-</details>
-
-<details>
-<summary>🐳 <b>Docker Enterprise Setup</b></summary>
-
-```dockerfile
-FROM python:3.11-slim
-RUN apt-get update && apt-get install -y \
-    tesseract-ocr tesseract-ocr-eng \
-    poppler-utils ghostscript \
-    default-jre-headless
-COPY . /app
-WORKDIR /app
-RUN pip install -e .
-CMD ["mcp-pdf"]
-```
-
-</details>
-
-<details>
-<summary>🌐 <b>Claude Desktop Integration</b></summary>
-
-```json
-{
-  "mcpServers": {
-    "pdf-tools": {
-      "command": "uv",
-      "args": ["run", "mcp-pdf"],
-      "cwd": "/path/to/mcp-pdf"
-    },
-    "office-tools": {
-      "command": "mcp-office-tools"
-    }
-  }
-}
-```
-
-*Unified document processing across all formats!*
-
-</details>
-
-<details>
-<summary>🔧 <b>Development Environment</b></summary>
-
-```bash
-# Clone and setup
-git clone https://github.com/rsp2k/mcp-pdf
-cd mcp-pdf
-uv sync --dev
-
-# Quality checks
-uv run pytest --cov=mcp_pdf_tools
-uv run black src/ tests/ examples/
-uv run ruff check src/ tests/ examples/
-uv run mypy src/
-
-# Run all 23 tools demo
+# Verify
 uv run python examples/verify_installation.py
 ```

@ -563,124 +63,162 @@ uv run python examples/verify_installation.py

 ---

-## 🚀 **What's Coming Next?**
+## Tools

-<div align="center">
+### Content Extraction

-### **🔮 Innovation Roadmap 2024-2025**
+| Tool | What it does |
+|------|-------------|
+| `extract_text` | Pull text from PDF pages with automatic chunking for large files |
+| `extract_tables` | Extract tables to JSON, CSV, or Markdown |
+| `extract_images` | Extract embedded images |
+| `extract_links` | Get all hyperlinks with page filtering |
+| `pdf_to_markdown` | Convert PDF to markdown preserving structure |
+| `ocr_pdf` | OCR scanned documents using Tesseract |
+| `extract_vector_graphics` | Export vector graphics to SVG (schematics, charts, drawings) |
+
+### Document Analysis
+
+| Tool | What it does |
+|------|-------------|
+| `extract_metadata` | Get title, author, creation date, page count, etc. |
+| `get_document_structure` | Extract table of contents and bookmarks |
+| `analyze_layout` | Detect columns, headers, footers |
+| `is_scanned_pdf` | Check if PDF needs OCR |
+| `compare_pdfs` | Diff two PDFs by text, structure, or metadata |
+| `analyze_pdf_health` | Check for corruption, optimization opportunities |
+| `analyze_pdf_security` | Report encryption, permissions, signatures |
+
+### Forms
+
+| Tool | What it does |
+|------|-------------|
+| `extract_form_data` | Get form field names and values |
+| `fill_form_pdf` | Fill form fields from JSON |
+| `create_form_pdf` | Create new forms with text fields, checkboxes, dropdowns |
+| `add_form_fields` | Add fields to existing PDFs |
+
+### Document Assembly
+
+| Tool | What it does |
+|------|-------------|
+| `merge_pdfs` | Combine multiple PDFs with bookmark preservation |
+| `split_pdf_by_pages` | Split by page ranges |
+| `split_pdf_by_bookmarks` | Split at chapter/section boundaries |
+| `reorder_pdf_pages` | Rearrange pages in custom order |
+
+### Annotations
+
+| Tool | What it does |
+|------|-------------|
+| `add_sticky_notes` | Add comment annotations |
+| `add_highlights` | Highlight text regions |
+| `add_stamps` | Add Approved/Draft/Confidential stamps |
+| `extract_all_annotations` | Export annotations to JSON |
+
+---
+
+## How Fallbacks Work
+
+The server tries multiple libraries for each operation:
+
+**Text extraction:**
+1. PyMuPDF (fastest)
+2. pdfplumber (better for complex layouts)
+3. pypdf (most compatible)
+
+**Table extraction:**
+1. Camelot (best accuracy, requires Ghostscript)
+2. pdfplumber (no dependencies)
+3. Tabula (requires Java)
+
+If a PDF fails with one library, the next is tried automatically.
+
+---
+
+## Token Management
+
+Large PDFs can overflow MCP response limits. The server handles this:
+
+- **Automatic chunking** splits large documents into page groups
+- **Table row limits** prevent huge tables from blowing up responses
+- **Summary mode** returns structure without full content
+
+```python
+# Get first 10 pages
+result = await extract_text("huge.pdf", pages="1-10")
+
+# Limit table rows
+tables = await extract_tables("data.pdf", max_rows_per_table=50)
+
+# Structure only
+tables = await extract_tables("data.pdf", summary_only=True)
+```
+
+---
+
+## URL Processing
+
+PDFs can be fetched directly from HTTPS URLs:
+
+```python
+result = await extract_text("https://example.com/report.pdf")
+```
+
+Files are cached locally for subsequent operations.
+
+---
+
+## System Dependencies
+
+Some features require system packages:
+
+| Feature | Dependency |
+|---------|-----------|
+| OCR | `tesseract-ocr` |
+| Camelot tables | `ghostscript` |
+| Tabula tables | `default-jre-headless` |
+| PDF to images | `poppler-utils` |
+
+Ubuntu/Debian:
+```bash
+sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless
+```
+
+---
+
+## Configuration
+
+Optional environment variables:
+
+| Variable | Purpose |
+|----------|---------|
+| `MCP_PDF_ALLOWED_PATHS` | Colon-separated directories for file output |
+| `PDF_TEMP_DIR` | Temp directory for processing (default: `/tmp/mcp-pdf-processing`) |
+| `TESSDATA_PREFIX` | Tesseract language data location |
+
+---
+
+## Development
+
+```bash
+# Run tests
+uv run pytest
+
+# With coverage
+uv run pytest --cov=mcp_pdf
+
+# Format
+uv run black src/ tests/
+
+# Lint
+uv run ruff check src/ tests/
+```
+
+---
+
+## License
+
+MIT

 </div>
-
-| 🗓️ **Timeline** | 🎯 **Feature** | 📋 **Impact** |
-|-----------------|---------------|--------------|
-| **Q4 2024** | **Enhanced AI Analysis** | GPT-powered content understanding |
-| **Q1 2025** | **Batch Processing** | Process 1000+ documents simultaneously |
-| **Q2 2025** | **Cloud Integration** | Direct S3, GCS, Azure Blob support |
-| **Q3 2025** | **Real-time Streaming** | Process documents as they're created |
-| **Q4 2025** | **Multi-language OCR** | 50+ language support with AI translation |
-| **2026** | **Blockchain Verification** | Cryptographic document integrity |
-
---
-
-## 🎭 **Complete Tool Showcase**
-
-<details>
-<summary>📊 <b>Business Intelligence Tools</b> (click to expand)</summary>
-
-### **Core Extraction**
- `extract_text` - Multi-method text extraction with layout preservation
- `extract_tables` - Intelligent table extraction (JSON, CSV, Markdown)
- `extract_images` - Image extraction with size filtering and format options
- `pdf_to_markdown` - Clean markdown conversion with structure preservation
-
-### **AI-Powered Analysis**
- `classify_content` - AI document type classification and analysis
- `summarize_content` - Intelligent summarization with key insights
- `analyze_pdf_health` - Comprehensive quality assessment
- `analyze_pdf_security` - Security feature analysis and vulnerability detection
-
-</details>
-
-<details>
-<summary>🔍 <b>Advanced Analysis Tools</b> (click to expand)</summary>
-
-### **Document Intelligence**
- `compare_pdfs` - Advanced document comparison (text, structure, metadata)
- `is_scanned_pdf` - Smart detection of scanned vs. text-based documents
- `get_document_structure` - Document outline and structural analysis
- `extract_metadata` - Comprehensive metadata and statistics extraction
-
-### **Visual Processing**
- `analyze_layout` - Page layout analysis with column and spacing detection
- `extract_charts` - Chart, diagram, and visual element extraction
- `detect_watermarks` - Watermark detection and analysis
-
-</details>
-
-<details>
-<summary>🔨 <b>Document Manipulation Tools</b> (click to expand)</summary>
-
-### **Content Operations**
- `extract_form_data` - Interactive PDF form data extraction
- `split_pdf` - Intelligent document splitting at specified pages
- `merge_pdfs` - Multi-document merging with page range tracking
- `rotate_pages` - Precise page rotation (90°/180°/270°)
-
-### **Optimization & Repair**
- `convert_to_images` - PDF to image conversion with quality control
- `optimize_pdf` - Multi-level file size optimization
- `repair_pdf` - Automated corruption repair and recovery
- `ocr_pdf` - Advanced OCR with preprocessing for scanned documents
-
-</details>
-
---
-
-## 💝 **Enterprise Support & Community**
-
-<div align="center">
-
-### **🌟 Join the PDF Intelligence Revolution!**
-
-[![GitHub](https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github)](https://github.com/rsp2k/mcp-pdf)
-[![Issues](https://img.shields.io/badge/Issues-Welcome-green?style=for-the-badge&logo=github)](https://github.com/rsp2k/mcp-pdf/issues)
-[![MCP Office Tools](https://img.shields.io/badge/Companion-MCP%20Office%20Tools-blue?style=for-the-badge)](https://git.supported.systems/MCP/mcp-office-tools)
-
-**💬 Enterprise Support Available** • **🐛 Bug Bounty Program** • **💡 Feature Requests Welcome**
-
-</div>
-
-### **🏢 Enterprise Services**
- **📞 Priority Support**: 24/7 enterprise support available
- **🎓 Training Programs**: Comprehensive team training
- **🔧 Custom Integration**: Tailored enterprise deployments
- **📊 Analytics Dashboard**: Usage analytics and insights
- **🛡️ Security Audits**: Comprehensive security assessments
-
---
-
-<div align="center">
-
-## 📜 **License & Ecosystem**
-
-**MIT License** - Freedom to innovate everywhere
-
-**🤝 Part of the MCP Document Processing Ecosystem**
-
-*Powered by [FastMCP](https://github.com/jlowin/fastmcp) • [Model Context Protocol](https://modelcontextprotocol.io) • Enterprise Python*
-
-### **🔗 Complete Document Processing Solution**
-
-**PDF Intelligence** ➜ **[MCP PDF](https://github.com/rsp2k/mcp-pdf)** (You are here!)  
-**Office Intelligence** ➜ **[MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**  
-**Unified Power** ➜ **Both Tools Together**
-
---
-
-### **⭐ Star both repositories for the complete solution! ⭐**
-
-**📄 [Star MCP PDF](https://github.com/rsp2k/mcp-pdf)** • **📊 [Star MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)**
-
-*Building the future of intelligent document processing* 🚀
-
-</div>
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,6 +1,6 @@
 [project]
 name = "mcp-pdf"
-version = "2.0.7"
+version = "2.0.8"
 description = "Secure FastMCP server for comprehensive PDF processing - text extraction, OCR, table extraction, forms, annotations, and more"
 authors = [{name = "Ryan Malloy", email = "ryan@malloys.us"}]
 readme = "README.md"
--- a/src/mcp_pdf/mixins_official/image_processing.py
+++ b/src/mcp_pdf/mixins_official/image_processing.py
@ -382,4 +382,358 @@ class ImageProcessingMixin(MCPMixin):
        """Simple heuristic to detect if line contains intentional markdown formatting"""
        # Very basic check - could be enhanced
        markdown_patterns = ['# ', '## ', '### ', '* ', '- ', '1. ', '**', '__']
-        return any(pattern in line for pattern in markdown_patterns)
+        return any(pattern in line for pattern in markdown_patterns)
+
+    @mcp_tool(
+        name="extract_vector_graphics",
+        description="Extract vector graphics from PDF to SVG format. Ideal for schematics, charts, and technical drawings."
+    )
+    async def extract_vector_graphics(
+        self,
+        pdf_path: str,
+        output_directory: Optional[str] = None,
+        pages: Optional[str] = None,
+        mode: str = "full_page",
+        include_text: bool = True,
+        simplify_paths: bool = False,
+    ) -> Dict[str, Any]:
+        """
+        Extract vector graphics from PDF pages as SVG files.
+
+        Perfect for extracting:
+        - IC functional diagrams from datasheets
+        - Frequency response charts and line graphs
+        - Package outline drawings (dimensioned technical drawings)
+        - Circuit schematics
+        - PCB layout diagrams
+
+        Args:
+            pdf_path: Path to PDF file or HTTPS URL
+            output_directory: Directory to save SVG files (default: temp directory)
+            pages: Page numbers to extract (comma-separated, 1-based), None for all
+            mode: Extraction mode:
+                - "full_page": Complete page as SVG (default, best for general use)
+                - "drawings_only": Extract individual vector paths as separate SVG
+                - "both": Export both formats for flexibility
+            include_text: Whether to include text in SVG output (default: True)
+            simplify_paths: Reduce path complexity for smaller files (default: False)
+
+        Returns:
+            Dictionary containing extraction summary and SVG file paths
+        """
+        start_time = time.time()
+
+        try:
+            # Validate PDF path
+            input_pdf_path = await validate_pdf_path(pdf_path)
+
+            # Setup output directory
+            if output_directory:
+                output_dir = validate_output_path(output_directory)
+                output_dir.mkdir(parents=True, exist_ok=True)
+            else:
+                output_dir = Path(tempfile.mkdtemp(prefix="pdf_vectors_"))
+
+            # Parse pages parameter
+            parsed_pages = parse_pages_parameter(pages)
+
+            # Validate mode
+            valid_modes = ["full_page", "drawings_only", "both"]
+            if mode not in valid_modes:
+                return {
+                    "success": False,
+                    "error": f"Invalid mode '{mode}'. Valid modes: {', '.join(valid_modes)}",
+                    "extraction_time": round(time.time() - start_time, 2)
+                }
+
+            # Open PDF document
+            doc = fitz.open(str(input_pdf_path))
+            total_pages = len(doc)
+
+            # Determine pages to process
+            pages_to_process = parsed_pages if parsed_pages else list(range(total_pages))
+            pages_to_process = [p for p in pages_to_process if 0 <= p < total_pages]
+
+            if not pages_to_process:
+                doc.close()
+                return {
+                    "success": False,
+                    "error": "No valid pages specified",
+                    "extraction_time": round(time.time() - start_time, 2)
+                }
+
+            svg_files = []
+            total_size = 0
+            base_name = input_pdf_path.stem
+
+            for page_num in pages_to_process:
+                try:
+                    page = doc[page_num]
+                    page_results = {}
+
+                    # Full page SVG extraction
+                    if mode in ["full_page", "both"]:
+                        svg_content = page.get_svg_image(
+                            text_as_path=not include_text
+                        )
+
+                        # Optionally simplify paths (basic implementation)
+                        if simplify_paths:
+                            svg_content = self._simplify_svg_paths(svg_content)
+
+                        filename = f"{base_name}_page_{page_num + 1}.svg"
+                        output_path = output_dir / filename
+
+                        with open(output_path, 'w', encoding='utf-8') as f:
+                            f.write(svg_content)
+
+                        file_size = output_path.stat().st_size
+                        total_size += file_size
+
+                        page_results["full_page"] = {
+                            "filename": filename,
+                            "path": str(output_path),
+                            "size_bytes": file_size,
+                            "size_kb": round(file_size / 1024, 1)
+                        }
+
+                    # Individual drawings extraction
+                    if mode in ["drawings_only", "both"]:
+                        drawings = page.get_drawings()
+                        drawing_count = len(drawings)
+
+                        if drawing_count > 0:
+                            # Convert drawings to SVG
+                            drawings_svg = self._drawings_to_svg(
+                                drawings,
+                                page.rect.width,
+                                page.rect.height
+                            )
+
+                            filename = f"{base_name}_page_{page_num + 1}_drawings.svg"
+                            output_path = output_dir / filename
+
+                            with open(output_path, 'w', encoding='utf-8') as f:
+                                f.write(drawings_svg)
+
+                            file_size = output_path.stat().st_size
+                            total_size += file_size
+
+                            page_results["drawings_only"] = {
+                                "filename": filename,
+                                "path": str(output_path),
+                                "size_bytes": file_size,
+                                "size_kb": round(file_size / 1024, 1),
+                                "drawing_count": drawing_count
+                            }
+                        else:
+                            page_results["drawings_only"] = {
+                                "skipped": True,
+                                "reason": "No vector drawings found on page"
+                            }
+
+                    # Get drawing statistics for the page
+                    all_drawings = page.get_drawings()
+
+                    svg_files.append({
+                        "page": page_num + 1,
+                        "has_text": bool(page.get_text().strip()),
+                        "drawing_count": len(all_drawings),
+                        **page_results
+                    })
+
+                except Exception as e:
+                    logger.warning(f"Failed to extract vectors from page {page_num + 1}: {e}")
+                    svg_files.append({
+                        "page": page_num + 1,
+                        "error": sanitize_error_message(str(e))
+                    })
+
+            doc.close()
+
+            # Count successful extractions
+            successful_pages = sum(1 for f in svg_files if "error" not in f)
+
+            return {
+                "success": True,
+                "extraction_summary": {
+                    "pages_processed": len(pages_to_process),
+                    "pages_successful": successful_pages,
+                    "mode": mode,
+                    "total_size_bytes": total_size,
+                    "total_size_kb": round(total_size / 1024, 1),
+                    "output_directory": str(output_dir)
+                },
+                "svg_files": svg_files,
+                "settings": {
+                    "include_text": include_text,
+                    "simplify_paths": simplify_paths,
+                    "mode": mode
+                },
+                "file_info": {
+                    "input_path": str(input_pdf_path),
+                    "total_pages": total_pages,
+                    "pages_processed": pages or "all"
+                },
+                "extraction_time": round(time.time() - start_time, 2),
+                "hints": {
+                    "viewing": "Open SVG files in browser, Inkscape, or Illustrator for editing",
+                    "full_page_vs_drawings": "full_page preserves layout; drawings_only extracts raw vector paths"
+                }
+            }
+
+        except Exception as e:
+            error_msg = sanitize_error_message(str(e))
+            logger.error(f"Vector graphics extraction failed: {error_msg}")
+            return {
+                "success": False,
+                "error": error_msg,
+                "extraction_time": round(time.time() - start_time, 2)
+            }
+
+    def _drawings_to_svg(
+        self,
+        drawings: List[Dict],
+        width: float,
+        height: float
+    ) -> str:
+        """
+        Convert PyMuPDF drawings to standalone SVG.
+
+        Drawings contain: rect, items (path operations), color, fill, width, etc.
+        """
+        svg_parts = [
+            f'<?xml version="1.0" encoding="UTF-8"?>',
+            f'<svg xmlns="http://www.w3.org/2000/svg" ',
+            f'viewBox="0 0 {width:.2f} {height:.2f}" ',
+            f'width="{width:.2f}" height="{height:.2f}">',
+            '',
+            '  <!-- Extracted vector drawings from PDF -->',
+            ''
+        ]
+
+        for idx, drawing in enumerate(drawings):
+            try:
+                path_data = self._drawing_to_path(drawing)
+                if not path_data:
+                    continue
+
+                # Extract style attributes
+                stroke_color = self._color_to_svg(drawing.get('color'))
+                fill_color = self._color_to_svg(drawing.get('fill'))
+                stroke_width = drawing.get('width', 1)
+
+                # Build style string
+                style_parts = []
+                if fill_color:
+                    style_parts.append(f'fill:{fill_color}')
+                else:
+                    style_parts.append('fill:none')
+
+                if stroke_color:
+                    style_parts.append(f'stroke:{stroke_color}')
+                    style_parts.append(f'stroke-width:{stroke_width:.2f}')
+
+                style = ';'.join(style_parts)
+
+                svg_parts.append(f'  <path d="{path_data}" style="{style}" />')
+
+            except Exception as e:
+                logger.debug(f"Failed to convert drawing {idx}: {e}")
+                continue
+
+        svg_parts.append('</svg>')
+        return '\n'.join(svg_parts)
+
+    def _drawing_to_path(self, drawing: Dict) -> Optional[str]:
+        """Convert a single drawing to SVG path data string."""
+        items = drawing.get('items', [])
+        if not items:
+            return None
+
+        path_parts = []
+
+        for item in items:
+            if not item:
+                continue
+
+            # Item format: (type, points...)
+            item_type = item[0]
+
+            try:
+                if item_type == 'l':  # Line
+                    # ('l', Point, Point)
+                    p1, p2 = item[1], item[2]
+                    path_parts.append(f'M {p1.x:.2f} {p1.y:.2f}')
+                    path_parts.append(f'L {p2.x:.2f} {p2.y:.2f}')
+
+                elif item_type == 're':  # Rectangle
+                    # ('re', Rect)
+                    rect = item[1]
+                    path_parts.append(f'M {rect.x0:.2f} {rect.y0:.2f}')
+                    path_parts.append(f'L {rect.x1:.2f} {rect.y0:.2f}')
+                    path_parts.append(f'L {rect.x1:.2f} {rect.y1:.2f}')
+                    path_parts.append(f'L {rect.x0:.2f} {rect.y1:.2f}')
+                    path_parts.append('Z')
+
+                elif item_type == 'qu':  # Quad (4-point polygon)
+                    # ('qu', Quad)
+                    quad = item[1]
+                    path_parts.append(f'M {quad.ul.x:.2f} {quad.ul.y:.2f}')
+                    path_parts.append(f'L {quad.ur.x:.2f} {quad.ur.y:.2f}')
+                    path_parts.append(f'L {quad.lr.x:.2f} {quad.lr.y:.2f}')
+                    path_parts.append(f'L {quad.ll.x:.2f} {quad.ll.y:.2f}')
+                    path_parts.append('Z')
+
+                elif item_type == 'c':  # Cubic bezier curve
+                    # ('c', Point, Point, Point, Point) - start, ctrl1, ctrl2, end
+                    p0, p1, p2, p3 = item[1], item[2], item[3], item[4]
+                    if not path_parts or not path_parts[-1].startswith('M'):
+                        path_parts.append(f'M {p0.x:.2f} {p0.y:.2f}')
+                    path_parts.append(f'C {p1.x:.2f} {p1.y:.2f} {p2.x:.2f} {p2.y:.2f} {p3.x:.2f} {p3.y:.2f}')
+
+            except (IndexError, AttributeError) as e:
+                logger.debug(f"Failed to process drawing item {item_type}: {e}")
+                continue
+
+        return ' '.join(path_parts) if path_parts else None
+
+    def _color_to_svg(self, color) -> Optional[str]:
+        """Convert PyMuPDF color to SVG color string."""
+        if color is None:
+            return None
+
+        if isinstance(color, (list, tuple)):
+            if len(color) == 3:
+                r, g, b = [int(c * 255) for c in color]
+                return f'rgb({r},{g},{b})'
+            elif len(color) == 1:
+                # Grayscale
+                gray = int(color[0] * 255)
+                return f'rgb({gray},{gray},{gray})'
+            elif len(color) == 4:
+                # CMYK - convert to RGB (simplified)
+                c, m, y, k = color
+                r = int(255 * (1 - c) * (1 - k))
+                g = int(255 * (1 - m) * (1 - k))
+                b = int(255 * (1 - y) * (1 - k))
+                return f'rgb({r},{g},{b})'
+
+        return None
+
+    def _simplify_svg_paths(self, svg_content: str) -> str:
+        """
+        Basic SVG path simplification.
+        Reduces decimal precision to shrink file size.
+        """
+        import re
+
+        # Reduce decimal precision in path data
+        def reduce_precision(match):
+            num = float(match.group())
+            return f'{num:.1f}'
+
+        # Match floating point numbers in SVG
+        simplified = re.sub(r'-?\d+\.\d{3,}', reduce_precision, svg_content)
+
+        return simplified
--- a/uv.lock
+++ b/uv.lock
@ -1032,7 +1032,7 @@ wheels = [

 [[package]]
 name = "mcp-pdf"
-version = "2.0.7"
+version = "2.0.8"
 source = { editable = "." }
 dependencies = [
    { name = "camelot-py", extra = ["cv"] },
Author	SHA1	Message	Date
Ryan Malloy	f32a014909	📝 Rewrite README: remove marketing fluff, describe what tools do Some checks failed Security Scan / security-scan (push) Has been cancelled Details	2026-02-06 22:43:02 -07:00
Ryan Malloy	e4f77008bb	🚀 v2.0.8: Add extract_vector_graphics tool for PDF to SVG extraction New tool extracts vector graphics from PDF pages as SVG files, supporting three modes: full_page (PyMuPDF native SVG), drawings_only (raw vector paths), and both. Handles lines, curves, rectangles, quads with proper color space conversion (RGB, grayscale, CMYK). No new dependencies.	2026-02-02 13:56:17 -07:00