Ryan Malloy 95596e0236 ✨ Add comprehensive PDF form creation and validation tools

- Add complete PDF form lifecycle management
- Create new forms with text, checkbox, dropdown, signature fields
- Fill existing forms with JSON data and optional flattening
- Add fields to existing PDFs with flexible positioning
- Advanced field types: radio groups, textareas, date fields
- Comprehensive validation engine with regex patterns
- Email, phone, number, date format validation
- Required field checking and length constraints
- Visual validation cues with asterisks and format hints
- Multi-field error reporting with detailed feedback
- International character support and edge case handling
- Enterprise-ready for complex business forms

2025-09-03 02:33:01 -06:00

5.6 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

MCP PDF Tools is a FastMCP server that provides comprehensive PDF processing capabilities including text extraction, table extraction, OCR, image extraction, and format conversion. The server is built on the FastMCP framework and provides intelligent method selection with automatic fallbacks.

Development Commands

Environment Setup

# Install with development dependencies
uv sync --dev

# Install system dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript python3-tk default-jre-headless

Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=mcp_pdf_tools

# Run specific test file
uv run pytest tests/test_server.py

# Run specific test
uv run pytest tests/test_server.py::TestTextExtraction::test_extract_text_success

Code Quality

# Format code
uv run black src/ tests/ examples/

# Lint code
uv run ruff check src/ tests/ examples/

# Type checking
uv run mypy src/

Running the Server

# Run MCP server directly
uv run mcp-pdf-tools

# Verify installation
uv run python examples/verify_installation.py

# Test with sample PDF
uv run python examples/test_pdf_tools.py /path/to/test.pdf

Building and Distribution

# Build package
uv build

# Upload to PyPI (requires credentials)
uv publish

Architecture

Core Components

src/mcp_pdf_tools/server.py: Main server implementation with all PDF processing tools
FastMCP Framework: Uses FastMCP for MCP protocol implementation
Multi-library approach: Integrates PyMuPDF, pdfplumber, pypdf, Camelot, Tabula, and Tesseract

Tool Categories

Text Extraction: extract_text - Intelligent method selection (PyMuPDF, pdfplumber, pypdf)
Table Extraction: extract_tables - Auto-fallback through Camelot → pdfplumber → Tabula
OCR Processing: ocr_pdf - Tesseract with preprocessing options
Document Analysis: is_scanned_pdf, get_document_structure, extract_metadata
Format Conversion: pdf_to_markdown - Clean markdown with MCP resource URIs for images
Image Processing: extract_images - Extract images with custom output paths and clean summary output
PDF Forms: extract_form_data, create_form_pdf, fill_form_pdf, add_form_fields - Complete form lifecycle management

MCP Client-Friendly Design

Optimized for MCP Context Management:

Custom Output Paths: extract_images allows users to specify where images are saved
Clean Summary Output: Returns concise extraction summary instead of verbose image metadata
Resource URIs: pdf_to_markdown uses pdf-image://{image_id} protocol for seamless client integration
Prevents Context Overflow: Avoids verbose output that fills client message windows
User Control: Flexible output directory support with automatic directory creation

Intelligent Fallbacks

The server implements smart fallback mechanisms:

Text extraction automatically detects scanned PDFs and suggests OCR
Table extraction tries multiple methods until tables are found
All operations include comprehensive error handling with helpful hints

Dependencies Management

Critical system dependencies:

Tesseract OCR: Required for ocr_pdf functionality
Java: Required for Tabula table extraction
Ghostscript: Required for Camelot table extraction
Poppler: Required for PDF to image conversion

Configuration

Environment variables (optional):

TESSDATA_PREFIX: Tesseract language data location
PDF_TEMP_DIR: Temporary file processing directory
DEBUG: Enable debug logging

Development Notes

Testing Strategy

Comprehensive unit tests with mocked PDF libraries
Test fixtures for consistent PDF document simulation
Error handling tests for all major failure modes
Server initialization and tool registration validation

Tool Implementation Pattern

All tools follow this pattern:

Validate PDF path using validate_pdf_path()
Try primary method with intelligent selection
Implement fallbacks where applicable
Return structured results with metadata
Include timing information and method used
Provide helpful error messages with troubleshooting hints

PDF Form Tools

The server provides comprehensive PDF form capabilities:

Form Creation (create_form_pdf):

Create new interactive PDF forms from scratch
Support for text fields, checkboxes, dropdowns, and signature fields
Automatic field positioning with customizable layouts
Multiple page size options (A4, Letter, Legal)

Form Filling (fill_form_pdf):

Fill existing PDF forms with JSON data
Intelligent field type handling (text, checkbox, dropdown)
Optional form flattening (make fields non-editable)
Comprehensive error reporting for failed field fills

Form Enhancement (add_form_fields):

Add interactive fields to existing PDFs
Preserve original document content and formatting
Support for multi-page field placement
Flexible field positioning and styling

Form Extraction (extract_form_data):

Extract all form fields and their current values
Identify field types and constraints
Form validation and structure analysis

Docker Support

The project includes Docker support with all system dependencies pre-installed, useful for consistent cross-platform development and deployment.

MCP Integration

Tools are registered using FastMCP decorators and follow MCP protocol standards for tool descriptions and parameter validation.

5.6 KiB Raw Blame History