mcp-pdf-tools/CLAUDE.md
Ryan Malloy c902e81e4d Initial commit: Complete MCP PDF Tools server implementation
Features:
- 8 comprehensive PDF processing tools with intelligent fallbacks
- Text extraction (PyMuPDF, pdfplumber, pypdf with auto-selection)
- Table extraction (Camelot → pdfplumber → Tabula fallback chain)
- OCR processing with Tesseract and preprocessing options
- Document analysis (structure, metadata, scanned detection)
- Image extraction with filtering capabilities
- PDF to markdown conversion with metadata
- Built on FastMCP framework with full MCP protocol support
- Comprehensive error handling and user-friendly messages
- Docker support and cross-platform compatibility
- Complete test suite and examples

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-10 16:36:21 -06:00

3.9 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

MCP PDF Tools is a FastMCP server that provides comprehensive PDF processing capabilities including text extraction, table extraction, OCR, image extraction, and format conversion. The server is built on the FastMCP framework and provides intelligent method selection with automatic fallbacks.

Development Commands

Environment Setup

# Install with development dependencies
uv sync --dev

# Install system dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript python3-tk default-jre-headless

Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=mcp_pdf_tools

# Run specific test file
uv run pytest tests/test_server.py

# Run specific test
uv run pytest tests/test_server.py::TestTextExtraction::test_extract_text_success

Code Quality

# Format code
uv run black src/ tests/ examples/

# Lint code
uv run ruff check src/ tests/ examples/

# Type checking
uv run mypy src/

Running the Server

# Run MCP server directly
uv run mcp-pdf-tools

# Verify installation
uv run python examples/verify_installation.py

# Test with sample PDF
uv run python examples/test_pdf_tools.py /path/to/test.pdf

Building and Distribution

# Build package
uv build

# Upload to PyPI (requires credentials)
uv publish

Architecture

Core Components

  • src/mcp_pdf_tools/server.py: Main server implementation with all PDF processing tools
  • FastMCP Framework: Uses FastMCP for MCP protocol implementation
  • Multi-library approach: Integrates PyMuPDF, pdfplumber, pypdf, Camelot, Tabula, and Tesseract

Tool Categories

  1. Text Extraction: extract_text - Intelligent method selection (PyMuPDF, pdfplumber, pypdf)
  2. Table Extraction: extract_tables - Auto-fallback through Camelot → pdfplumber → Tabula
  3. OCR Processing: ocr_pdf - Tesseract with preprocessing options
  4. Document Analysis: is_scanned_pdf, get_document_structure, extract_metadata
  5. Format Conversion: pdf_to_markdown - Clean markdown with optional images
  6. Image Processing: extract_images - Size filtering and format conversion

Intelligent Fallbacks

The server implements smart fallback mechanisms:

  • Text extraction automatically detects scanned PDFs and suggests OCR
  • Table extraction tries multiple methods until tables are found
  • All operations include comprehensive error handling with helpful hints

Dependencies Management

Critical system dependencies:

  • Tesseract OCR: Required for ocr_pdf functionality
  • Java: Required for Tabula table extraction
  • Ghostscript: Required for Camelot table extraction
  • Poppler: Required for PDF to image conversion

Configuration

Environment variables (optional):

  • TESSDATA_PREFIX: Tesseract language data location
  • PDF_TEMP_DIR: Temporary file processing directory
  • DEBUG: Enable debug logging

Development Notes

Testing Strategy

  • Comprehensive unit tests with mocked PDF libraries
  • Test fixtures for consistent PDF document simulation
  • Error handling tests for all major failure modes
  • Server initialization and tool registration validation

Tool Implementation Pattern

All tools follow this pattern:

  1. Validate PDF path using validate_pdf_path()
  2. Try primary method with intelligent selection
  3. Implement fallbacks where applicable
  4. Return structured results with metadata
  5. Include timing information and method used
  6. Provide helpful error messages with troubleshooting hints

Docker Support

The project includes Docker support with all system dependencies pre-installed, useful for consistent cross-platform development and deployment.

MCP Integration

Tools are registered using FastMCP decorators and follow MCP protocol standards for tool descriptions and parameter validation.