diff --git a/README.md b/README.md index c37d706..f0aef17 100644 --- a/README.md +++ b/README.md @@ -4,559 +4,58 @@ MCP PDF -**๐Ÿš€ The Ultimate PDF Processing Intelligence Platform for AI** +**A FastMCP server for PDF processing** -*Transform any PDF into structured, actionable intelligence with 41 specialized tools* +*41 tools for text extraction, OCR, tables, forms, annotations, and more* [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg?style=flat-square)](https://www.python.org/downloads/) [![FastMCP](https://img.shields.io/badge/FastMCP-2.0+-green.svg?style=flat-square)](https://github.com/jlowin/fastmcp) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square)](https://opensource.org/licenses/MIT) -[![Production Ready](https://img.shields.io/badge/status-production%20ready-brightgreen?style=flat-square)](https://github.com/rsp2k/mcp-pdf) -[![MCP Protocol](https://img.shields.io/badge/MCP-1.13.0-purple?style=flat-square)](https://modelcontextprotocol.io) +[![PyPI](https://img.shields.io/pypi/v/mcp-pdf?style=flat-square)](https://pypi.org/project/mcp-pdf/) -**๐Ÿค Perfect Companion to [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)** +**Works great with [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)** --- -## โœจ **What Makes MCP PDF Revolutionary?** +## What It Does -> ๐ŸŽฏ **The Problem**: PDFs contain incredible intelligence, but extracting it reliably is complex, slow, and often fails. -> -> โšก **The Solution**: MCP PDF delivers **AI-powered document intelligence** with **41 specialized tools** that understand both content and structure. +MCP PDF extracts content from PDFs using multiple libraries with automatic fallbacks. If one method fails, it tries another. - - - - - -
- -### ๐Ÿ† **Why MCP PDF Leads** -- **๐Ÿš€ 41 Specialized Tools** for every PDF scenario -- **๐Ÿง  AI-Powered Intelligence** beyond basic extraction -- **๐Ÿ”„ Multi-Library Fallbacks** for 99.9% reliability -- **โšก 10x Faster** than traditional solutions -- **๐ŸŒ URL Processing** with smart caching -- **๐ŸŽฏ Smart Token Management** prevents MCP overflow errors - - - -### ๐Ÿ“Š **Enterprise-Proven For:** -- **Business Intelligence** & financial analysis -- **Document Security** assessment & compliance -- **Academic Research** & content analysis -- **Automated Workflows** & form processing -- **Document Migration** & modernization -- **Content Management** & archival - -
+**Core capabilities:** +- **Text extraction** via PyMuPDF, pdfplumber, or pypdf (auto-fallback) +- **Table extraction** via Camelot, pdfplumber, or Tabula (auto-fallback) +- **OCR** for scanned documents via Tesseract +- **Form handling** - extract, fill, and create PDF forms +- **Document assembly** - merge, split, reorder pages +- **Annotations** - sticky notes, highlights, stamps +- **Vector graphics** - extract to SVG for schematics and technical drawings --- -## ๐Ÿš€ **Get Intelligence in 60 Seconds** +## Quick Start + +```bash +# Install from PyPI +uvx mcp-pdf + +# Or add to Claude Code +claude mcp add pdf-tools uvx mcp-pdf +``` + +
+Development Installation ```bash -# 1๏ธโƒฃ Clone and install git clone https://github.com/rsp2k/mcp-pdf cd mcp-pdf uv sync -# 2๏ธโƒฃ Install system dependencies (Ubuntu/Debian) +# System dependencies (Ubuntu/Debian) sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript -# 3๏ธโƒฃ Verify installation -uv run python examples/verify_installation.py - -# 4๏ธโƒฃ Run the MCP server -uv run mcp-pdf -``` - -
-๐Ÿ”ง Claude Desktop Integration (click to expand) - -### **๐Ÿ“ฆ Production Installation (PyPI)** - -```bash -# For personal use across all projects -claude mcp add -s local pdf-tools uvx mcp-pdf - -# For project-specific use (isolated) -claude mcp add -s project pdf-tools uvx mcp-pdf -``` - -### **๐Ÿ› ๏ธ Development Installation (Source)** - -```bash -# For local development from source -claude mcp add -s project pdf-tools-dev uv -- --directory /path/to/mcp-pdf run mcp-pdf -``` - -### **โš™๏ธ Manual Configuration** -Add to your `claude_desktop_config.json`: -```json -{ - "mcpServers": { - "pdf-tools": { - "command": "uvx", - "args": ["mcp-pdf"] - } - } -} -``` -*Restart Claude Desktop and unlock PDF intelligence!* - -
- ---- - -## ๐ŸŽญ **See AI-Powered Intelligence In Action** - -### **๐Ÿ“Š Business Intelligence Workflow** -```python -# Complete financial report analysis in seconds -health = await analyze_pdf_health("quarterly-report.pdf") -classification = await classify_content("quarterly-report.pdf") -summary = await summarize_content("quarterly-report.pdf", summary_length="medium") - -# Smart table extraction - prevents token overflow on large tables -tables = await extract_tables("quarterly-report.pdf", pages="5-7", max_rows_per_table=100) -# Or get just table structure without data -table_summary = await extract_tables("quarterly-report.pdf", pages="5-7", summary_only=True) - -charts = await extract_charts("quarterly-report.pdf") - -# Get instant insights -{ - "document_type": "Financial Report", - "health_score": 9.2, - "key_insights": [ - "Revenue increased 23% YoY", - "Operating margin improved to 15.3%", - "Strong cash flow generation" - ], - "tables_extracted": 12, - "charts_found": 8, - "processing_time": 2.1 -} -``` - -### **๐Ÿ”’ Document Security Assessment** -```python -# Comprehensive security analysis -security = await analyze_pdf_security("sensitive-document.pdf") -watermarks = await detect_watermarks("sensitive-document.pdf") -health = await analyze_pdf_health("sensitive-document.pdf") - -# Enterprise-grade security insights -{ - "encryption_type": "AES-256", - "permissions": { - "print": false, - "copy": false, - "modify": false - }, - "security_warnings": [], - "watermarks_detected": true, - "compliance_ready": true -} -``` - -### **๐Ÿ“š Academic Research Processing** -```python -# Advanced research paper analysis -layout = await analyze_layout("research-paper.pdf", pages=[1,2,3]) -summary = await summarize_content("research-paper.pdf", summary_length="long") -citations = await extract_text("research-paper.pdf", pages=[15,16,17]) - -# Research intelligence delivered -{ - "reading_complexity": "Graduate Level", - "main_topics": ["Machine Learning", "Natural Language Processing"], - "citation_count": 127, - "figures_detected": 15, - "methodology_extracted": true -} -``` - ---- - -## ๐Ÿ› ๏ธ **Complete Arsenal: 41 Specialized Tools** - -
- -### **๐ŸŽฏ Document Intelligence & Analysis** - -| ๐Ÿง  **Tool** | ๐Ÿ“‹ **Purpose** | โšก **AI Powered** | ๐ŸŽฏ **Accuracy** | -|-------------|---------------|-----------------|----------------| -| `classify_content` | AI-powered document type detection | โœ… Yes | 97% | -| `summarize_content` | Intelligent key insights extraction | โœ… Yes | 95% | -| `analyze_pdf_health` | Comprehensive quality assessment | โœ… Yes | 99% | -| `analyze_pdf_security` | Security & vulnerability analysis | โœ… Yes | 99% | -| `compare_pdfs` | Advanced document comparison | โœ… Yes | 96% | - -### **๐Ÿ“Š Core Content Extraction** - -| ๐Ÿ”ง **Tool** | ๐Ÿ“‹ **Purpose** | โšก **Speed** | ๐ŸŽฏ **Accuracy** | -|-------------|---------------|-------------|----------------| -| `extract_text` | Multi-method text extraction with auto-chunking | **Ultra Fast** | 99.9% | -| `extract_tables` | Smart table extraction with token overflow protection | **Fast** | 98% | -| `ocr_pdf` | Advanced OCR for scanned docs | **Moderate** | 95% | -| `extract_images` | Media extraction & processing | **Fast** | 99% | -| `pdf_to_markdown` | Structure-preserving conversion | **Fast** | 97% | - -### **๐Ÿ“ Visual & Layout Analysis** - -| ๐ŸŽจ **Tool** | ๐Ÿ“‹ **Purpose** | ๐Ÿ” **Precision** | ๐Ÿ’ช **Features** | -|-------------|---------------|-----------------|----------------| -| `analyze_layout` | Page structure & column detection | **High** | Advanced | -| `extract_charts` | Visual element extraction | **High** | Smart | -| `detect_watermarks` | Watermark identification | **Perfect** | Complete | -| `extract_vector_graphics` | PDF to SVG for schematics & drawings | **Perfect** | Multi-mode | - -
- ---- - -## ๐ŸŒŸ **Document Format Intelligence Matrix** - -
- -### **๐Ÿ“„ Universal PDF Processing Capabilities** - -| ๐Ÿ“‹ **Document Type** | ๐Ÿ” **Detection** | ๐Ÿ“Š **Text** | ๐Ÿ“ˆ **Tables** | ๐Ÿ–ผ๏ธ **Images** | ๐Ÿง  **Intelligence** | -|---------------------|-----------------|------------|--------------|--------------|-------------------| -| **Financial Reports** | โœ… Perfect | โœ… Perfect | โœ… Perfect | โœ… Perfect | ๐Ÿง  **AI-Enhanced** | -| **Research Papers** | โœ… Perfect | โœ… Perfect | โœ… Excellent | โœ… Perfect | ๐Ÿง  **AI-Enhanced** | -| **Legal Documents** | โœ… Perfect | โœ… Perfect | โœ… Good | โœ… Perfect | ๐Ÿง  **AI-Enhanced** | -| **Scanned PDFs** | โœ… Auto-Detect | โœ… OCR | โœ… OCR | โœ… Perfect | ๐Ÿง  **AI-Enhanced** | -| **Forms & Applications** | โœ… Perfect | โœ… Perfect | โœ… Excellent | โœ… Perfect | ๐Ÿง  **AI-Enhanced** | -| **Technical Manuals** | โœ… Perfect | โœ… Perfect | โœ… Perfect | โœ… Perfect | ๐Ÿง  **AI-Enhanced** | - -*โœ… Perfect โ€ข ๐Ÿง  AI-Enhanced Intelligence โ€ข ๐Ÿ” Auto-Detection* - -
- ---- - -## โšก **Performance That Amazes** - -
- -### **๐Ÿš€ Real-World Benchmarks** - -| ๐Ÿ“„ **Document Type** | ๐Ÿ“ **Pages** | โฑ๏ธ **Processing Time** | ๐Ÿ†š **vs Competitors** | ๐Ÿง  **Intelligence Level** | -|---------------------|-------------|----------------------|----------------------|---------------------------| -| Financial Report | 50 pages | 2.1 seconds | **10x faster** | **AI-Powered** | -| Research Paper | 25 pages | 1.3 seconds | **8x faster** | **Deep Analysis** | -| Scanned Document | 100 pages | 45 seconds | **5x faster** | **OCR + AI** | -| Complex Forms | 15 pages | 0.8 seconds | **12x faster** | **Structure Aware** | - -*Benchmarked on: MacBook Pro M2, 16GB RAM โ€ข Including AI processing time* - -
- ---- - -## ๐Ÿ—๏ธ **Intelligent Architecture** - -### **๐Ÿง  Multi-Library Intelligence System** -*Never worry about PDF compatibility or failure again* - -```mermaid -graph TD - A[PDF Input] --> B{Smart Detection} - B --> C{Document Type} - C -->|Text-based| D[PyMuPDF Fast Path] - C -->|Scanned| E[OCR Processing] - C -->|Complex Layout| F[pdfplumber Analysis] - C -->|Tables Heavy| G[Camelot + Tabula] - - D -->|Success| H[โœ… Content Extracted] - D -->|Fail| I[pdfplumber Fallback] - I -->|Fail| J[pypdf Fallback] - - E --> K[Tesseract OCR] - K --> L[AI Content Analysis] - - F --> M[Layout Intelligence] - G --> N[Table Intelligence] - - H --> O[๐Ÿง  AI Enhancement] - L --> O - M --> O - N --> O - - O --> P[๐ŸŽฏ Structured Intelligence] -``` - -### **๐ŸŽฏ Intelligent Processing Pipeline** - -1. **๐Ÿ” Smart Detection**: Automatically identify document type and optimal processing strategy -2. **โšก Optimized Extraction**: Use the fastest, most accurate method for each document -3. **๐Ÿ›ก๏ธ Fallback Protection**: Seamless method switching if primary approach fails -4. **๐Ÿง  AI Enhancement**: Apply document intelligence and content analysis -5. **๐Ÿงน Clean Output**: Deliver perfectly structured, AI-ready intelligence - ---- - -## ๐ŸŒ **Real-World Success Stories** - -
- -### **๐Ÿข Proven at Enterprise Scale** - -
- - - - - - - - - - -
- -### **๐Ÿ“Š Financial Services Giant** -*Processing 50,000+ reports monthly* - -**Challenge**: Analyze quarterly reports from 2,000+ companies - -**Results**: -- โšก **98% time reduction** (2 weeks โ†’ 4 hours) -- ๐ŸŽฏ **99.9% accuracy** in financial data extraction -- ๐Ÿ’ฐ **$5M annual savings** in analyst time -- ๐Ÿ† **SEC compliance** maintained - - - -### **๐Ÿฅ Healthcare Research Institute** -*Processing 100,000+ research papers* - -**Challenge**: Analyze medical literature for drug discovery - -**Results**: -- ๐Ÿš€ **25x faster** literature review process -- ๐Ÿ“‹ **95% accuracy** in data extraction -- ๐Ÿงฌ **12 new drug targets** identified -- ๐Ÿ“š **Publication in Nature** based on insights - -
- -### **โš–๏ธ Legal Firm Network** -*Processing 500,000+ legal documents* - -**Challenge**: Document review and compliance checking - -**Results**: -- ๐Ÿƒ **40x speed improvement** in document review -- ๐Ÿ›ก๏ธ **100% security compliance** maintained -- ๐Ÿ’ผ **$20M cost savings** across network -- ๐Ÿ† **Zero data breaches** during migration - - - -### **๐ŸŽ“ Global University System** -*Processing 1M+ academic papers* - -**Challenge**: Create searchable academic knowledge base - -**Results**: -- ๐Ÿ“– **50x faster** knowledge extraction -- ๐Ÿง  **AI-ready** structured academic data -- ๐Ÿ” **97% search accuracy** improvement -- ๐Ÿ“Š **3 Nobel Prize** papers processed - -
- ---- - -## ๐ŸŽฏ **Advanced Features That Set Us Apart** - -### **๐ŸŒ HTTPS URL Processing with Smart Caching** -```python -# Process PDFs directly from anywhere on the web -report_url = "https://company.com/annual-report.pdf" -analysis = await classify_content(report_url) # Downloads & caches automatically -tables = await extract_tables(report_url) # Uses cache - instant! -summary = await summarize_content(report_url) # Lightning fast! -``` - -### **๐Ÿฉบ Comprehensive Document Health Analysis** -```python -# Enterprise-grade document assessment -health = await analyze_pdf_health("critical-document.pdf") - -{ - "overall_health_score": 9.2, - "corruption_detected": false, - "optimization_potential": "23% size reduction possible", - "security_assessment": "enterprise_ready", - "recommendations": [ - "Document is production-ready", - "Consider optimization for web delivery" - ], - "processing_confidence": 99.8 -} -``` - -### **๐Ÿ” AI-Powered Content Classification** -```python -# Automatically understand document types -classification = await classify_content("mystery-document.pdf") - -{ - "document_type": "Financial Report", - "confidence": 97.3, - "key_topics": ["Revenue", "Operating Expenses", "Cash Flow"], - "complexity_level": "Professional", - "suggested_tools": ["extract_tables", "extract_charts", "summarize_content"], - "industry_vertical": "Technology" -} -``` - ---- - -## ๐Ÿค **Perfect Integration Ecosystem** - -### **๐Ÿ’Ž Companion to MCP Office Tools** -*The ultimate document processing powerhouse* - -
- -| ๐Ÿ”ง **Processing Need** | ๐Ÿ“„ **PDF Files** | ๐Ÿ“Š **Office Files** | ๐Ÿ”— **Integration** | -|-----------------------|------------------|-------------------|-------------------| -| **Text Extraction** | MCP PDF โœ… | [MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools) โœ… | **Unified API** | -| **Table Processing** | Advanced โœ… | Advanced โœ… | **Cross-Format** | -| **Image Extraction** | Smart โœ… | Smart โœ… | **Consistent** | -| **Format Detection** | AI-Powered โœ… | AI-Powered โœ… | **Intelligent** | -| **Health Analysis** | Complete โœ… | Complete โœ… | **Comprehensive** | - -[**๐Ÿš€ Get Both Tools for Complete Document Intelligence**](https://git.supported.systems/MCP/mcp-office-tools) - -
- -### **๐Ÿ”— Unified Document Processing Workflow** -```python -# Process ALL document formats with unified intelligence -pdf_analysis = await pdf_tools.classify_content("report.pdf") -word_analysis = await office_tools.detect_office_format("report.docx") -excel_data = await office_tools.extract_text("data.xlsx") - -# Cross-format document comparison -comparison = await compare_cross_format_documents([ - pdf_analysis, word_analysis, excel_data -]) -``` - -### **โšก Works Seamlessly With** -- **๐Ÿค– Claude Desktop**: Native MCP protocol integration -- **๐Ÿ“Š Jupyter Notebooks**: Perfect for research and analysis -- **๐Ÿ Python Applications**: Direct async/await API access -- **๐ŸŒ Web Services**: RESTful wrappers and microservices -- **โ˜๏ธ Cloud Platforms**: AWS Lambda, Google Functions, Azure -- **๐Ÿ”„ Workflow Engines**: Zapier, Microsoft Power Automate - ---- - -## ๐Ÿ›ก๏ธ **Enterprise-Grade Security & Compliance** - -
- -| ๐Ÿ”’ **Security Feature** | โœ… **Status** | ๐Ÿ“‹ **Enterprise Ready** | -|------------------------|---------------|------------------------| -| **Local Processing** | โœ… Enabled | Documents never leave your environment | -| **Memory Security** | โœ… Optimized | Automatic sensitive data cleanup | -| **HTTPS Validation** | โœ… Enforced | Certificate validation and secure headers | -| **Access Controls** | โœ… Configurable | Role-based processing permissions | -| **Audit Logging** | โœ… Available | Complete processing audit trails | -| **GDPR Compliant** | โœ… Certified | No personal data retention | -| **SOC2 Ready** | โœ… Verified | Enterprise security standards | - -
- ---- - -## ๐Ÿ“ˆ **Installation & Enterprise Setup** - -
-๐Ÿš€ Quick Start (Recommended) - -```bash -# Clone repository -git clone https://github.com/rsp2k/mcp-pdf -cd mcp-pdf - -# Install with uv (fastest) -uv sync - -# Install system dependencies (Ubuntu/Debian) -sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript - -# Verify installation -uv run python examples/verify_installation.py -``` - -
- -
-๐Ÿณ Docker Enterprise Setup - -```dockerfile -FROM python:3.11-slim -RUN apt-get update && apt-get install -y \ - tesseract-ocr tesseract-ocr-eng \ - poppler-utils ghostscript \ - default-jre-headless -COPY . /app -WORKDIR /app -RUN pip install -e . -CMD ["mcp-pdf"] -``` - -
- -
-๐ŸŒ Claude Desktop Integration - -```json -{ - "mcpServers": { - "pdf-tools": { - "command": "uv", - "args": ["run", "mcp-pdf"], - "cwd": "/path/to/mcp-pdf" - }, - "office-tools": { - "command": "mcp-office-tools" - } - } -} -``` - -*Unified document processing across all formats!* - -
- -
-๐Ÿ”ง Development Environment - -```bash -# Clone and setup -git clone https://github.com/rsp2k/mcp-pdf -cd mcp-pdf -uv sync --dev - -# Quality checks -uv run pytest --cov=mcp_pdf_tools -uv run black src/ tests/ examples/ -uv run ruff check src/ tests/ examples/ -uv run mypy src/ - -# Run all 23 tools demo +# Verify uv run python examples/verify_installation.py ``` @@ -564,125 +63,162 @@ uv run python examples/verify_installation.py --- -## ๐Ÿš€ **What's Coming Next?** +## Tools -
+### Content Extraction -### **๐Ÿ”ฎ Innovation Roadmap 2024-2025** +| Tool | What it does | +|------|-------------| +| `extract_text` | Pull text from PDF pages with automatic chunking for large files | +| `extract_tables` | Extract tables to JSON, CSV, or Markdown | +| `extract_images` | Extract embedded images | +| `extract_links` | Get all hyperlinks with page filtering | +| `pdf_to_markdown` | Convert PDF to markdown preserving structure | +| `ocr_pdf` | OCR scanned documents using Tesseract | +| `extract_vector_graphics` | Export vector graphics to SVG (schematics, charts, drawings) | + +### Document Analysis + +| Tool | What it does | +|------|-------------| +| `extract_metadata` | Get title, author, creation date, page count, etc. | +| `get_document_structure` | Extract table of contents and bookmarks | +| `analyze_layout` | Detect columns, headers, footers | +| `is_scanned_pdf` | Check if PDF needs OCR | +| `compare_pdfs` | Diff two PDFs by text, structure, or metadata | +| `analyze_pdf_health` | Check for corruption, optimization opportunities | +| `analyze_pdf_security` | Report encryption, permissions, signatures | + +### Forms + +| Tool | What it does | +|------|-------------| +| `extract_form_data` | Get form field names and values | +| `fill_form_pdf` | Fill form fields from JSON | +| `create_form_pdf` | Create new forms with text fields, checkboxes, dropdowns | +| `add_form_fields` | Add fields to existing PDFs | + +### Document Assembly + +| Tool | What it does | +|------|-------------| +| `merge_pdfs` | Combine multiple PDFs with bookmark preservation | +| `split_pdf_by_pages` | Split by page ranges | +| `split_pdf_by_bookmarks` | Split at chapter/section boundaries | +| `reorder_pdf_pages` | Rearrange pages in custom order | + +### Annotations + +| Tool | What it does | +|------|-------------| +| `add_sticky_notes` | Add comment annotations | +| `add_highlights` | Highlight text regions | +| `add_stamps` | Add Approved/Draft/Confidential stamps | +| `extract_all_annotations` | Export annotations to JSON | + +--- + +## How Fallbacks Work + +The server tries multiple libraries for each operation: + +**Text extraction:** +1. PyMuPDF (fastest) +2. pdfplumber (better for complex layouts) +3. pypdf (most compatible) + +**Table extraction:** +1. Camelot (best accuracy, requires Ghostscript) +2. pdfplumber (no dependencies) +3. Tabula (requires Java) + +If a PDF fails with one library, the next is tried automatically. + +--- + +## Token Management + +Large PDFs can overflow MCP response limits. The server handles this: + +- **Automatic chunking** splits large documents into page groups +- **Table row limits** prevent huge tables from blowing up responses +- **Summary mode** returns structure without full content + +```python +# Get first 10 pages +result = await extract_text("huge.pdf", pages="1-10") + +# Limit table rows +tables = await extract_tables("data.pdf", max_rows_per_table=50) + +# Structure only +tables = await extract_tables("data.pdf", summary_only=True) +``` + +--- + +## URL Processing + +PDFs can be fetched directly from HTTPS URLs: + +```python +result = await extract_text("https://example.com/report.pdf") +``` + +Files are cached locally for subsequent operations. + +--- + +## System Dependencies + +Some features require system packages: + +| Feature | Dependency | +|---------|-----------| +| OCR | `tesseract-ocr` | +| Camelot tables | `ghostscript` | +| Tabula tables | `default-jre-headless` | +| PDF to images | `poppler-utils` | + +Ubuntu/Debian: +```bash +sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless +``` + +--- + +## Configuration + +Optional environment variables: + +| Variable | Purpose | +|----------|---------| +| `MCP_PDF_ALLOWED_PATHS` | Colon-separated directories for file output | +| `PDF_TEMP_DIR` | Temp directory for processing (default: `/tmp/mcp-pdf-processing`) | +| `TESSDATA_PREFIX` | Tesseract language data location | + +--- + +## Development + +```bash +# Run tests +uv run pytest + +# With coverage +uv run pytest --cov=mcp_pdf + +# Format +uv run black src/ tests/ + +# Lint +uv run ruff check src/ tests/ +``` + +--- + +## License + +MIT
- -| ๐Ÿ—“๏ธ **Timeline** | ๐ŸŽฏ **Feature** | ๐Ÿ“‹ **Impact** | -|-----------------|---------------|--------------| -| **Q4 2024** | **Enhanced AI Analysis** | GPT-powered content understanding | -| **Q1 2025** | **Batch Processing** | Process 1000+ documents simultaneously | -| **Q2 2025** | **Cloud Integration** | Direct S3, GCS, Azure Blob support | -| **Q3 2025** | **Real-time Streaming** | Process documents as they're created | -| **Q4 2025** | **Multi-language OCR** | 50+ language support with AI translation | -| **2026** | **Blockchain Verification** | Cryptographic document integrity | - ---- - -## ๐ŸŽญ **Complete Tool Showcase** - -
-๐Ÿ“Š Business Intelligence Tools (click to expand) - -### **Core Extraction** -- `extract_text` - Multi-method text extraction with layout preservation -- `extract_tables` - Intelligent table extraction (JSON, CSV, Markdown) -- `extract_images` - Image extraction with size filtering and format options -- `pdf_to_markdown` - Clean markdown conversion with structure preservation - -### **AI-Powered Analysis** -- `classify_content` - AI document type classification and analysis -- `summarize_content` - Intelligent summarization with key insights -- `analyze_pdf_health` - Comprehensive quality assessment -- `analyze_pdf_security` - Security feature analysis and vulnerability detection - -
- -
-๐Ÿ” Advanced Analysis Tools (click to expand) - -### **Document Intelligence** -- `compare_pdfs` - Advanced document comparison (text, structure, metadata) -- `is_scanned_pdf` - Smart detection of scanned vs. text-based documents -- `get_document_structure` - Document outline and structural analysis -- `extract_metadata` - Comprehensive metadata and statistics extraction - -### **Visual Processing** -- `analyze_layout` - Page layout analysis with column and spacing detection -- `extract_charts` - Chart, diagram, and visual element extraction -- `detect_watermarks` - Watermark detection and analysis -- `extract_vector_graphics` - Extract vector graphics to SVG (schematics, charts, technical drawings) - -
- -
-๐Ÿ”จ Document Manipulation Tools (click to expand) - -### **Content Operations** -- `extract_form_data` - Interactive PDF form data extraction -- `split_pdf` - Intelligent document splitting at specified pages -- `merge_pdfs` - Multi-document merging with page range tracking -- `rotate_pages` - Precise page rotation (90ยฐ/180ยฐ/270ยฐ) - -### **Optimization & Repair** -- `convert_to_images` - PDF to image conversion with quality control -- `optimize_pdf` - Multi-level file size optimization -- `repair_pdf` - Automated corruption repair and recovery -- `ocr_pdf` - Advanced OCR with preprocessing for scanned documents - -
- ---- - -## ๐Ÿ’ **Enterprise Support & Community** - -
- -### **๐ŸŒŸ Join the PDF Intelligence Revolution!** - -[![GitHub](https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github)](https://github.com/rsp2k/mcp-pdf) -[![Issues](https://img.shields.io/badge/Issues-Welcome-green?style=for-the-badge&logo=github)](https://github.com/rsp2k/mcp-pdf/issues) -[![MCP Office Tools](https://img.shields.io/badge/Companion-MCP%20Office%20Tools-blue?style=for-the-badge)](https://git.supported.systems/MCP/mcp-office-tools) - -**๐Ÿ’ฌ Enterprise Support Available** โ€ข **๐Ÿ› Bug Bounty Program** โ€ข **๐Ÿ’ก Feature Requests Welcome** - -
- -### **๐Ÿข Enterprise Services** -- **๐Ÿ“ž Priority Support**: 24/7 enterprise support available -- **๐ŸŽ“ Training Programs**: Comprehensive team training -- **๐Ÿ”ง Custom Integration**: Tailored enterprise deployments -- **๐Ÿ“Š Analytics Dashboard**: Usage analytics and insights -- **๐Ÿ›ก๏ธ Security Audits**: Comprehensive security assessments - ---- - -
- -## ๐Ÿ“œ **License & Ecosystem** - -**MIT License** - Freedom to innovate everywhere - -**๐Ÿค Part of the MCP Document Processing Ecosystem** - -*Powered by [FastMCP](https://github.com/jlowin/fastmcp) โ€ข [Model Context Protocol](https://modelcontextprotocol.io) โ€ข Enterprise Python* - -### **๐Ÿ”— Complete Document Processing Solution** - -**PDF Intelligence** โžœ **[MCP PDF](https://github.com/rsp2k/mcp-pdf)** (You are here!) -**Office Intelligence** โžœ **[MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)** -**Unified Power** โžœ **Both Tools Together** - ---- - -### **โญ Star both repositories for the complete solution! โญ** - -**๐Ÿ“„ [Star MCP PDF](https://github.com/rsp2k/mcp-pdf)** โ€ข **๐Ÿ“Š [Star MCP Office Tools](https://git.supported.systems/MCP/mcp-office-tools)** - -*Building the future of intelligent document processing* ๐Ÿš€ - -
\ No newline at end of file