Go to file

Ryan Malloy 78a8c40e71 Transform README into comprehensive project showcase

**Major Enhancement**: Combined blog post storytelling with technical documentation
to create an engaging, comprehensive project showcase.

**What's New:**
📖 **Compelling Narrative**: Tells the complete story from 8 tools → 23 tools
🎯 **Real-World Examples**: Business intelligence, academic research, security workflows
🧠 **Technical Deep-Dives**: Architecture decisions, intelligent fallbacks, UX design
⚡ **Performance Insights**: Async architecture, caching strategies, resource management
🔧 **Complete Documentation**: Installation, usage, troubleshooting, contributing

**Key Sections Added:**
- "What We Built" - Project overview and use cases
- "Key Innovations" - Document intelligence, layout processing, web integration
- "Real-World Usage Examples" - 4 comprehensive workflow examples
- "Performance & Architecture" - Technical implementation details
- "Architecture Deep-Dive" - Code examples and design decisions
- "Why MCP PDF Tools?" - Value proposition and differentiators

**Impact**:
- Much more engaging for new users and contributors
- Showcases the full scope of capabilities (23 tools\!)
- Provides clear guidance for different use cases
- Demonstrates technical sophistication and quality
- Perfect for sharing, contributing, and adoption

Now developers can understand not just HOW to use the tools, but WHY this
project exists and what makes it special in the PDF processing landscape.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-08-12 08:40:59 -06:00

examples

Add HTTPS URL support and fix MCP parameter validation

2025-08-11 02:25:53 -06:00

src/mcp_pdf_tools

Fix page numbering: Switch to user-friendly 1-based indexing

2025-08-11 04:32:20 -06:00

tests

Initial commit: Complete MCP PDF Tools server implementation

2025-08-10 16:36:21 -06:00

.env.example

Initial commit: Complete MCP PDF Tools server implementation

2025-08-10 16:36:21 -06:00

.gitignore

Initial commit: Complete MCP PDF Tools server implementation

2025-08-10 16:36:21 -06:00

.mcp.json

Fix page numbering: Switch to user-friendly 1-based indexing

2025-08-11 04:32:20 -06:00

claude_desktop_config.json

Fix page numbering: Switch to user-friendly 1-based indexing

2025-08-11 04:32:20 -06:00

CLAUDE_DESKTOP_SETUP.md

Fix page numbering: Switch to user-friendly 1-based indexing

2025-08-11 04:32:20 -06:00

CLAUDE.md

Initial commit: Complete MCP PDF Tools server implementation

2025-08-10 16:36:21 -06:00

docker-compose.yml

Initial commit: Complete MCP PDF Tools server implementation

2025-08-10 16:36:21 -06:00

Dockerfile

Initial commit: Complete MCP PDF Tools server implementation

2025-08-10 16:36:21 -06:00

LICENSE

Initial commit: Complete MCP PDF Tools server implementation

2025-08-10 16:36:21 -06:00

MANIFEST.in

Initial commit: Complete MCP PDF Tools server implementation

2025-08-10 16:36:21 -06:00

mcp-config-example.json

Initial commit: Complete MCP PDF Tools server implementation

2025-08-10 16:36:21 -06:00

mcp-pdf-tools-launcher.sh

Fix page numbering: Switch to user-friendly 1-based indexing

2025-08-11 04:32:20 -06:00

pyproject.toml

Initial commit: Complete MCP PDF Tools server implementation

2025-08-10 16:36:21 -06:00

QUICKSTART.md

Initial commit: Complete MCP PDF Tools server implementation

2025-08-10 16:36:21 -06:00

README.md

Transform README into comprehensive project showcase

2025-08-12 08:40:59 -06:00

run-mcp-server.sh

Initial commit: Complete MCP PDF Tools server implementation

2025-08-10 16:36:21 -06:00

test_pages_parameter.py

Fix page numbering: Switch to user-friendly 1-based indexing

2025-08-11 04:32:20 -06:00

test_url_support.py

Add HTTPS URL support and fix MCP parameter validation

2025-08-11 02:25:53 -06:00

uv.lock

Initial commit: Complete MCP PDF Tools server implementation

2025-08-10 16:36:21 -06:00

README.md

MCP PDF Tools: A Complete PDF Processing Powerhouse

From basic text extraction to AI-powered document intelligence - 23 comprehensive tools for every PDF processing need

🚀 What We Built

MCP PDF Tools has evolved from a simple 8-tool PDF processor into a comprehensive 23-tool document intelligence platform. Whether you're extracting tables from financial reports, analyzing document security, or building automated workflows, we've got you covered.

🎯 Perfect for:

Business Intelligence: Financial report analysis, data extraction, document comparison
Academic Research: Paper analysis, citation extraction, content summarization
Document Security: Security assessment, watermark detection, integrity verification
Automated Workflows: Form processing, document splitting/merging, batch optimization

✨ Key Innovations

🧠 Document Intelligence

Go beyond simple extraction with AI-powered analysis:

Smart Classification: Automatically detect document types (academic, legal, financial, etc.)
Intelligent Summarization: Extract key insights and generate summaries
Content Analysis: Topic extraction, language detection, complexity assessment
Quality Assessment: Comprehensive health checks and optimization recommendations

📐 Advanced Layout Processing

Understand document structure, not just content:

Layout Analysis: Column detection, reading order, text block analysis
Visual Element Extraction: Charts, diagrams, and image processing
Watermark Detection: Identify and analyze document watermarks
Form Processing: Extract interactive form fields and values

🔧 Professional Document Operations

Handle complex document workflows:

Intelligent Splitting/Merging: Precise page-level control
Security Analysis: Encryption, permissions, vulnerability assessment
Document Repair: Recover corrupted or damaged PDFs
Smart Optimization: Multi-level compression with quality preservation

🌐 Modern Web Integration

Process PDFs from anywhere:

HTTPS URL Support: Direct processing from web URLs
Intelligent Caching: 1-hour smart caching to avoid repeated downloads
Content Validation: Automatic PDF format verification
User-Friendly: 1-based page numbering (page 1 = first page, not page 0!)

📊 Complete Tool Suite (23 Tools)

🔧 Core Processing Tools

Tool	Description
`extract_text`	Multi-method text extraction with layout preservation
`extract_tables`	Intelligent table extraction (JSON, CSV, Markdown)
`ocr_pdf`	Advanced OCR with preprocessing for scanned documents
`extract_images`	Image extraction with size filtering and format options
`pdf_to_markdown`	Clean markdown conversion with structure preservation

🧠 Document Analysis & Intelligence

Tool	Description
`classify_content`	AI-powered document type classification and analysis
`summarize_content`	Intelligent summarization with key insights extraction
`analyze_pdf_health`	Comprehensive quality assessment and optimization suggestions
`analyze_pdf_security`	Security feature analysis and vulnerability detection
`compare_pdfs`	Advanced document comparison (text, structure, metadata)
`is_scanned_pdf`	Smart detection of scanned vs. text-based documents
`get_document_structure`	Document outline and structural analysis
`extract_metadata`	Comprehensive metadata and statistics extraction

📐 Layout & Visual Analysis

Tool	Description
`analyze_layout`	Page layout analysis with column and spacing detection
`extract_charts`	Chart, diagram, and visual element extraction
`detect_watermarks`	Watermark detection and analysis

🔨 Content Manipulation

Tool	Description
`extract_form_data`	Interactive PDF form data extraction
`split_pdf`	Intelligent document splitting at specified pages
`merge_pdfs`	Multi-document merging with page range tracking
`rotate_pages`	Precise page rotation (90°/180°/270°)

⚡ Optimization & Utilities

Tool	Description
`convert_to_images`	PDF to image conversion with quality control
`optimize_pdf`	Multi-level file size optimization
`repair_pdf`	Automated corruption repair and recovery

🎯 Real-World Usage Examples

📊 Business Intelligence Workflow

# Comprehensive financial report analysis
health = await analyze_pdf_health("quarterly-report.pdf")
classification = await classify_content("quarterly-report.pdf") 
summary = await summarize_content("quarterly-report.pdf", summary_length="medium")
tables = await extract_tables("quarterly-report.pdf", pages="5,6,7")
charts = await extract_charts("quarterly-report.pdf")

print(f"Document type: {classification['document_type']}")
print(f"Health score: {health['overall_health_score']}")
print(f"Key insights: {summary['key_insights']}")

📚 Academic Research Processing

# Process research papers with full analysis
layout = await analyze_layout("research-paper.pdf", pages="1,2,3")
summary = await summarize_content("research-paper.pdf", summary_length="long")
references = await extract_text("research-paper.pdf", pages="15,16,17")
document_health = await analyze_pdf_health("research-paper.pdf")

print(f"Reading complexity: {layout['layout_statistics']['reading_complexity']}")
print(f"Main topics: {summary['key_topics']}")

🔒 Document Security Assessment

# Comprehensive security analysis
security = await analyze_pdf_security("sensitive-document.pdf")
watermarks = await detect_watermarks("sensitive-document.pdf")
health = await analyze_pdf_health("sensitive-document.pdf")

print(f"Encryption status: {security['encryption']['encryption_type']}")
print(f"Security warnings: {security['security_warnings']}")
print(f"Watermarks detected: {watermarks['has_watermarks']}")

📋 Automated Form Processing

# Extract and process form data
forms = await extract_form_data("application-form.pdf")
health = await analyze_pdf_health("application-form.pdf")

required_fields = [f for f in forms['form_fields'] if f['is_required']]
filled_fields = [f for f in forms['form_fields'] if f['field_value']]

print(f"Form completion: {len(filled_fields)}/{len(required_fields)} required fields")

🌐 URL Processing - Work with PDFs Anywhere

All tools support direct HTTPS URL processing:

# Process PDFs directly from the web
await extract_text("https://example.com/report.pdf")
await analyze_layout("https://company.com/whitepaper.pdf", pages="1,2,3")
await extract_tables("https://research.org/data.pdf", output_format="csv")

Advanced URL Features:

Intelligent Caching: 1-hour cache prevents repeated downloads
Content Validation: Verifies PDF format and integrity
Security Headers: Proper User-Agent and secure requests
Error Handling: Clear messages for network/content issues

🛠 Installation & Setup

Quick Start

# Clone and install
git clone https://github.com/rpm/mcp-pdf-tools
cd mcp-pdf-tools
uv sync

# Install system dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript

# Verify installation
uv run python examples/verify_installation.py

Claude Desktop Integration

Add to your Claude configuration (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "pdf-tools": {
      "command": "uv",
      "args": ["run", "mcp-pdf-tools"],
      "cwd": "/path/to/mcp-pdf-tools"
    }
  }
}

Claude Code Integration

claude mcp add pdf-tools "uvx --from /path/to/mcp-pdf-tools mcp-pdf-tools"

📖 Usage Examples

Text Extraction with Layout Preservation

# Basic text extraction
result = await extract_text("document.pdf")

# Extract specific pages with layout preservation
result = await extract_text(
    pdf_path="document.pdf",
    pages=[1, 2, 3],  # First 3 pages (1-based numbering)
    preserve_layout=True,
    method="pdfplumber"
)

Advanced Table Extraction

# Extract all tables
result = await extract_tables("document.pdf")

# Extract tables from specific pages in markdown format
result = await extract_tables(
    pdf_path="document.pdf",
    pages=[2, 3],  # Pages 2 and 3 (1-based numbering)
    output_format="markdown"
)

Document Analysis & Intelligence

# Comprehensive document analysis
health = await analyze_pdf_health("document.pdf")
classification = await classify_content("document.pdf")
summary = await summarize_content(
    pdf_path="document.pdf",
    summary_length="medium",
    pages="1,2,3"  # Specific pages (1-based numbering)
)

Content Manipulation

# Split PDF into separate files  
result = await split_pdf(
    pdf_path="document.pdf",
    split_pages="5,10,15",  # Split after pages 5, 10, 15 (1-based)
    output_prefix="section"
)

# Merge multiple PDFs
result = await merge_pdfs(
    pdf_paths=["/path/to/doc1.pdf", "/path/to/doc2.pdf"],
    output_filename="merged_document.pdf"
)

# Rotate specific pages
result = await rotate_pages(
    pdf_path="document.pdf",
    page_rotations={"1": 90, "3": 180}  # Page 1: 90°, Page 3: 180° (1-based)
)

Visual Analysis

# Extract charts and diagrams
result = await extract_charts(
    pdf_path="/path/to/report.pdf",
    pages="2,3,4",  # Pages 2, 3, 4 (1-based numbering)
    min_size=150
)

# Detect watermarks
result = await detect_watermarks("document.pdf")

# Security analysis
result = await analyze_pdf_security("document.pdf")

Optimization & Repair

# Optimize PDF file size
result = await optimize_pdf(
    pdf_path="large-document.pdf",
    optimization_level="balanced",  # "light", "balanced", "aggressive"
    preserve_quality=True
)

# Repair corrupted PDF
result = await repair_pdf("corrupted-document.pdf")

⚡ Performance & Architecture

Multi-Library Intelligence

Rather than relying on a single approach, we use intelligent fallback systems:

Text Extraction: PyMuPDF → pdfplumber → pypdf (automatic selection)
Table Extraction: Camelot → pdfplumber → Tabula (tries until success)
Smart Detection: Automatically detects scanned PDFs and suggests OCR

Async-First Design

All operations are built with modern async/await patterns:

# All tools are fully async
results = await asyncio.gather(
    extract_text("doc1.pdf"),
    analyze_layout("doc2.pdf"),  
    extract_tables("doc3.pdf")
)

Resource Management

Memory Efficient: Streaming processing for large documents
Smart Caching: Intelligent URL caching and resource cleanup
Performance Monitoring: All operations include timing metrics

🔧 Development

Setup Development Environment

# Install with development dependencies
uv sync --dev

# Run tests
uv run pytest

# Format code
uv run black src/ tests/ examples/
uv run ruff check src/ tests/ examples/

# Type checking
uv run mypy src/

Quality Standards

✅ 100% Lint-Free: All code passes ruff checks
✅ Type Safety: Comprehensive type hints with mypy
✅ Error Handling: Consistent error patterns across all tools
✅ Documentation: Clear docstrings and usage examples
✅ Testing: Comprehensive test coverage

🧪 Testing

# Run all tests
uv run pytest

# Test with coverage
uv run pytest --cov=mcp_pdf_tools

# Test specific functionality
uv run pytest tests/test_server.py::test_extract_text

# Verify page numbering (1-based conversion)
uv run python test_pages_parameter.py

🚀 Advanced Features

Environment Variables

# Optional configuration
TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata  # Tesseract data location
PDF_TEMP_DIR=/tmp/pdf_processing                     # Temporary file directory  
DEBUG=true                                           # Enable debug logging

Docker Support

FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
    tesseract-ocr tesseract-ocr-eng \
    poppler-utils ghostscript \
    default-jre-headless
# ... rest of Dockerfile

🔍 Troubleshooting

OCR Issues

# Install language packs
sudo apt-get install tesseract-ocr-fra tesseract-ocr-deu

# macOS
brew install tesseract-lang

Table Extraction Issues

# Install Java (required for Tabula)
sudo apt-get install default-jre-headless

# Install Ghostscript (required for Camelot)
sudo apt-get install ghostscript

Memory Issues with Large PDFs

Process specific page ranges: pages="1,2,3"
Use streaming capabilities: method="pdfplumber"
Consider splitting large documents first

🏗 Architecture Deep-Dive

Intelligent Method Selection

# Automatic fallback system
async def extract_text_with_fallback(pdf_path: str):
    try:
        return await extract_with_pymupdf(pdf_path)  # Fast, good for most PDFs
    except Exception:
        try:
            return await extract_with_pdfplumber(pdf_path)  # Layout-aware
        except Exception:
            return await extract_with_pypdf(pdf_path)  # Maximum compatibility

User Experience Design

# Before: Confusing zero-based indexing
pages=[0, 1, 2]  # First 3 pages - not intuitive!

# After: Natural 1-based indexing
pages=[1, 2, 3]  # First 3 pages - makes perfect sense!

# Internal conversion happens automatically
def parse_pages_parameter(pages):
    # Convert 1-based user input to 0-based internal representation
    return [max(0, p - 1) for p in user_pages]

🤝 Contributing

We welcome contributions! Here's how to get involved:

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Add tests for new functionality
Ensure code quality: uv run ruff check && uv run pytest
Submit a pull request

Development Workflow

# Setup development environment
git clone https://github.com/your-username/mcp-pdf-tools
cd mcp-pdf-tools
uv sync --dev

# Make changes and test
uv run pytest
uv run ruff check src/

# Submit changes
git add .
git commit -m "Add amazing new feature"
git push origin feature/amazing-feature

📜 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

This project leverages several excellent libraries:

PyMuPDF: Fast PDF operations and rendering
pdfplumber: Layout-aware text extraction
Camelot: Advanced table extraction
Tabula-py: Java-based table extraction
Tesseract: Industry-standard OCR
FastMCP: Modern MCP server framework

🔗 Links & Resources

🌟 Why MCP PDF Tools?

🚀 Comprehensive: 23 specialized tools covering every PDF processing need
🧠 Intelligent: AI-powered analysis and smart method selection
🌐 Modern: HTTPS URL support with intelligent caching
👥 User-Friendly: Intuitive 1-based page numbering and clear APIs
🔧 Production-Ready: Robust error handling and performance optimization
📈 Scalable: Async architecture with efficient resource management

Whether you're building document analysis pipelines, creating intelligent workflows, or need reliable PDF processing for your applications, MCP PDF Tools provides the comprehensive foundation you need.

Ready to get started? Clone the repo and run uv run python examples/verify_installation.py to see all 23 tools in action!

Built with ❤️ using modern Python, FastMCP, and the power of intelligent document processing. Questions? Open an issue or contribute - we'd love to hear about your use cases!