mcp-pdf-tools/CLAUDE.md
Ryan Malloy ae80388ec4 🎯 Add custom output paths and clean summary for image extraction
Enhance extract_images with user-specified output directories and concise
summary responses to improve user control and reduce context window clutter.

Key Features:
• Custom Output Directory: Users can specify where images are saved
• Clean Summary Output: Concise extraction results instead of verbose metadata
• Automatic Directory Creation: Creates output directories as needed
• File-Level Details: Individual file info with human-readable sizes
• Extraction Summary: Quick overview with total size and file count

New Parameters:
+ output_directory: Optional custom path for saving extracted images
+ Defaults to cache directory if not specified
+ Creates directories automatically with proper permissions

Response Format:
- Removed: Verbose image metadata arrays that fill context windows
+ Added: Clean summary with extraction statistics
+ Added: File list with essential details (filename, path, size, dimensions)
+ Added: Human-readable extraction summary

Benefits:
 User control over image file locations
 Reduced context window pollution
 Essential information without verbosity
 Better integration with user workflows
 Maintains MCP resource compatibility for cached images

Example Response:
{
  "success": true,
  "images_extracted": 3,
  "total_size": "2.4 MB",
  "output_directory": "/path/to/custom/dir",
  "files": [{"filename": "page_1_image_0.png", "path": "/path/...", "size": "800 KB", "dimensions": "1920x1080"}]
}

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-20 13:50:09 -06:00

137 lines
4.5 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
MCP PDF Tools is a FastMCP server that provides comprehensive PDF processing capabilities including text extraction, table extraction, OCR, image extraction, and format conversion. The server is built on the FastMCP framework and provides intelligent method selection with automatic fallbacks.
## Development Commands
### Environment Setup
```bash
# Install with development dependencies
uv sync --dev
# Install system dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript python3-tk default-jre-headless
```
### Testing
```bash
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=mcp_pdf_tools
# Run specific test file
uv run pytest tests/test_server.py
# Run specific test
uv run pytest tests/test_server.py::TestTextExtraction::test_extract_text_success
```
### Code Quality
```bash
# Format code
uv run black src/ tests/ examples/
# Lint code
uv run ruff check src/ tests/ examples/
# Type checking
uv run mypy src/
```
### Running the Server
```bash
# Run MCP server directly
uv run mcp-pdf-tools
# Verify installation
uv run python examples/verify_installation.py
# Test with sample PDF
uv run python examples/test_pdf_tools.py /path/to/test.pdf
```
### Building and Distribution
```bash
# Build package
uv build
# Upload to PyPI (requires credentials)
uv publish
```
## Architecture
### Core Components
- **`src/mcp_pdf_tools/server.py`**: Main server implementation with all PDF processing tools
- **FastMCP Framework**: Uses FastMCP for MCP protocol implementation
- **Multi-library approach**: Integrates PyMuPDF, pdfplumber, pypdf, Camelot, Tabula, and Tesseract
### Tool Categories
1. **Text Extraction**: `extract_text` - Intelligent method selection (PyMuPDF, pdfplumber, pypdf)
2. **Table Extraction**: `extract_tables` - Auto-fallback through Camelot → pdfplumber → Tabula
3. **OCR Processing**: `ocr_pdf` - Tesseract with preprocessing options
4. **Document Analysis**: `is_scanned_pdf`, `get_document_structure`, `extract_metadata`
5. **Format Conversion**: `pdf_to_markdown` - Clean markdown with MCP resource URIs for images
6. **Image Processing**: `extract_images` - Extract images with custom output paths and clean summary output
### MCP Client-Friendly Design
**Optimized for MCP Context Management:**
- **Custom Output Paths**: `extract_images` allows users to specify where images are saved
- **Clean Summary Output**: Returns concise extraction summary instead of verbose image metadata
- **Resource URIs**: `pdf_to_markdown` uses `pdf-image://{image_id}` protocol for seamless client integration
- **Prevents Context Overflow**: Avoids verbose output that fills client message windows
- **User Control**: Flexible output directory support with automatic directory creation
### Intelligent Fallbacks
The server implements smart fallback mechanisms:
- Text extraction automatically detects scanned PDFs and suggests OCR
- Table extraction tries multiple methods until tables are found
- All operations include comprehensive error handling with helpful hints
### Dependencies Management
Critical system dependencies:
- **Tesseract OCR**: Required for `ocr_pdf` functionality
- **Java**: Required for Tabula table extraction
- **Ghostscript**: Required for Camelot table extraction
- **Poppler**: Required for PDF to image conversion
### Configuration
Environment variables (optional):
- `TESSDATA_PREFIX`: Tesseract language data location
- `PDF_TEMP_DIR`: Temporary file processing directory
- `DEBUG`: Enable debug logging
## Development Notes
### Testing Strategy
- Comprehensive unit tests with mocked PDF libraries
- Test fixtures for consistent PDF document simulation
- Error handling tests for all major failure modes
- Server initialization and tool registration validation
### Tool Implementation Pattern
All tools follow this pattern:
1. Validate PDF path using `validate_pdf_path()`
2. Try primary method with intelligent selection
3. Implement fallbacks where applicable
4. Return structured results with metadata
5. Include timing information and method used
6. Provide helpful error messages with troubleshooting hints
### Docker Support
The project includes Docker support with all system dependencies pre-installed, useful for consistent cross-platform development and deployment.
### MCP Integration
Tools are registered using FastMCP decorators and follow MCP protocol standards for tool descriptions and parameter validation.