Ryan Malloy 478ab41b1f Merge remote repository with local MCP PDF Tools implementation
Resolved README.md conflict by preserving comprehensive documentation
while maintaining repository structure from git.supported.systems

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-10 17:00:49 -06:00

MCP PDF Tools

A comprehensive FastMCP server for PDF processing operations. This server provides powerful tools for extracting text, tables, images, and metadata from PDFs, performing OCR on scanned documents, and converting PDFs to various formats.

Features

  • Text Extraction: Multiple methods (PyMuPDF, pdfplumber, pypdf) with automatic selection
  • Table Extraction: Support for both bordered and borderless tables using Camelot, Tabula, and pdfplumber
  • OCR: Process scanned PDFs with Tesseract OCR, including preprocessing for better results
  • Document Analysis: Extract structure, metadata, and check if PDFs are scanned
  • Image Extraction: Extract images with size filtering
  • Format Conversion: Convert PDFs to clean Markdown format
  • Smart Detection: Automatically detect the best method for each operation

Installation

# Clone the repository
git clone https://github.com/rpm/mcp-pdf-tools
cd mcp-pdf-tools

# Install with uv
uv sync

# Install Tesseract OCR (required for OCR functionality)
# On Ubuntu/Debian:
sudo apt-get install tesseract-ocr tesseract-ocr-eng

# On macOS:
brew install tesseract

# On Windows:
# Download installer from: https://github.com/UB-Mannheim/tesseract/wiki

Using pip

pip install mcp-pdf-tools

# Install system dependencies for OCR
# Same as above for Tesseract

Configuration

Claude Desktop Integration

Add to your Claude configuration (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

{
  "mcpServers": {
    "pdf-tools": {
      "command": "uv",
      "args": ["run", "mcp-pdf-tools"],
      "cwd": "/path/to/mcp-pdf-tools"
    }
  }
}

Or if installed via pip:

{
  "mcpServers": {
    "pdf-tools": {
      "command": "mcp-pdf-tools"
    }
  }
}

Claude Code Integration

For development with Claude Code, add the MCP server from your local development directory:

claude mcp add pdf-tools "uvx --from /path/to/mcp-pdf-tools mcp-pdf-tools"

Environment Variables

Create a .env file in your project directory:

# Optional: Tesseract configuration
TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata

# Optional: Temporary file directory
PDF_TEMP_DIR=/tmp/pdf_processing

# Optional: Enable debug logging
DEBUG=true

Usage Examples

Text Extraction

# Basic text extraction
result = await extract_text(
    pdf_path="/path/to/document.pdf"
)

# Extract specific pages with layout preservation
result = await extract_text(
    pdf_path="/path/to/document.pdf",
    pages=[0, 1, 2],  # First 3 pages
    preserve_layout=True,
    method="pdfplumber"  # Or "auto", "pymupdf", "pypdf"
)

Table Extraction

# Extract all tables
result = await extract_tables(
    pdf_path="/path/to/document.pdf"
)

# Extract tables from specific pages in markdown format
result = await extract_tables(
    pdf_path="/path/to/document.pdf",
    pages=[2, 3],
    output_format="markdown"  # Or "json", "csv"
)

OCR for Scanned PDFs

# Basic OCR
result = await ocr_pdf(
    pdf_path="/path/to/scanned.pdf"
)

# OCR with multiple languages and preprocessing
result = await ocr_pdf(
    pdf_path="/path/to/scanned.pdf",
    languages=["eng", "fra", "deu"],
    preprocess=True,
    dpi=300
)

Document Analysis

# Check if PDF is scanned
result = await is_scanned_pdf(
    pdf_path="/path/to/document.pdf"
)

# Get document structure and metadata
result = await get_document_structure(
    pdf_path="/path/to/document.pdf"
)

# Extract comprehensive metadata
result = await extract_metadata(
    pdf_path="/path/to/document.pdf"
)

Format Conversion

# Convert to Markdown
result = await pdf_to_markdown(
    pdf_path="/path/to/document.pdf",
    include_images=True,
    include_metadata=True
)

Image Extraction

# Extract images with size filtering
result = await extract_images(
    pdf_path="/path/to/document.pdf",
    min_width=200,
    min_height=200,
    output_format="png"  # Or "jpeg"
)

Available Tools

Tool Description
extract_text Extract text with multiple methods and layout preservation
extract_tables Extract tables in various formats (JSON, CSV, Markdown)
ocr_pdf Perform OCR on scanned PDFs with preprocessing
is_scanned_pdf Check if a PDF is scanned or text-based
get_document_structure Extract document structure, outline, and basic metadata
extract_metadata Extract comprehensive metadata and file statistics
pdf_to_markdown Convert PDF to clean Markdown format
extract_images Extract images with filtering options

Development

Setup Development Environment

# Clone and enter directory
git clone https://github.com/rpm/mcp-pdf-tools
cd mcp-pdf-tools

# Install with development dependencies
uv sync --dev

# Run tests
uv run pytest

# Format code
uv run black src/ tests/
uv run ruff check src/ tests/

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=mcp_pdf_tools

# Run specific test
uv run pytest tests/test_server.py::test_extract_text

Building for PyPI

# Build the package
uv build

# Upload to PyPI (requires credentials)
uv publish

Troubleshooting

OCR Not Working

  1. Tesseract not installed: Make sure Tesseract is installed on your system
  2. Language data missing: Install additional language packs:
    # Ubuntu/Debian
    sudo apt-get install tesseract-ocr-fra tesseract-ocr-deu
    
    # macOS
    brew install tesseract-lang
    

Table Extraction Issues

  1. Java not found: Tabula requires Java. Install Java 8 or higher.
  2. Camelot dependencies: Install system dependencies:
    # Ubuntu/Debian
    sudo apt-get install python3-tk ghostscript
    
    # macOS
    brew install ghostscript tcl-tk
    

Memory Issues with Large PDFs

For very large PDFs, consider:

  1. Processing specific page ranges instead of the entire document
  2. Increasing available memory for Python
  3. Using the streaming capabilities of pdfplumber for text extraction

Architecture

The server uses intelligent fallback mechanisms:

  1. Text Extraction: Automatically detects if a PDF is scanned and suggests OCR
  2. Table Extraction: Tries multiple methods (Camelot → pdfplumber → Tabula) until tables are found
  3. Error Handling: Graceful degradation with informative error messages

Performance Tips

  • For large PDFs, process in chunks using page ranges
  • Use method="pymupdf" for fastest text extraction
  • For complex tables, start with method="camelot"
  • Enable preprocessing for better OCR results on poor quality scans

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

License

MIT License - see LICENSE file for details

Acknowledgments

This MCP server leverages several excellent PDF processing libraries:

Description
MCP PDF Tools - Comprehensive PDF processing server for the Model Context Protocol with intelligent method selection and automatic fallbacks
Readme MIT 383 KiB
Languages
Python 99%
Dockerfile 0.9%
Shell 0.1%