# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview MCP Legacy Files is a comprehensive FastMCP server that provides revolutionary vintage document processing capabilities for 25+ legacy formats from the 1980s-2000s computing era. The server transforms inaccessible historical documents into AI-ready intelligence through multi-library fallback chains, intelligent format detection, and advanced AI enhancement pipelines. ## Development Commands ### Environment Setup ```bash # Install with development dependencies uv sync --dev # Install optional system dependencies (Ubuntu/Debian) sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript python3-tk default-jre-headless # For WordPerfect support (libwpd) sudo apt-get install libwpd-dev libwpd-tools # For Mac format support sudo apt-get install libgsf-1-dev libgsf-bin ``` ### Testing ```bash # Run core detection tests (no external dependencies required) uv run python examples/test_detection_only.py # Run comprehensive tests with all dependencies uv run pytest # Run with coverage uv run pytest --cov=mcp_legacy_files # Run specific processor tests uv run pytest tests/test_processors.py::TestDBaseProcessor uv run pytest tests/test_processors.py::TestWordPerfectProcessor # Test specific format detection uv run pytest tests/test_detection.py::TestLegacyFormatDetector::test_wordperfect_detection ``` ### Code Quality ```bash # Format code uv run black src/ tests/ examples/ # Lint code uv run ruff check src/ tests/ examples/ # Type checking uv run mypy src/ ``` ### Running the Server ```bash # Run MCP server directly uv run mcp-legacy-files # Use CLI interface uv run legacy-files-cli detect vintage_file.dbf uv run legacy-files-cli process customer_db.dbf uv run legacy-files-cli formats --list-all # Test with sample legacy files uv run python examples/test_legacy_processing.py /path/to/vintage/files/ ``` ### Building and Distribution ```bash # Build package uv build # Upload to PyPI (requires credentials) uv publish ``` ## Architecture ### Core Components - **`src/mcp_legacy_files/core/server.py`**: Main FastMCP server with 4 comprehensive tools for legacy document processing - **`src/mcp_legacy_files/core/detection.py`**: Advanced multi-layer format detection engine (99.9% accuracy) - **`src/mcp_legacy_files/core/processing.py`**: Processing orchestration and result management - **`src/mcp_legacy_files/processors/`**: Format-specific processors with multi-library fallback chains ### Format Processors 1. **dBASE Processor** (`processors/dbase.py`) - **PRODUCTION READY** ✅ - Multi-library chain: `dbfread` → `simpledbf` → `pandas` → custom parser - Supports dBASE III/IV/5, FoxPro, memo files (.dbt/.fpt) - Comprehensive corruption recovery and business intelligence 2. **WordPerfect Processor** (`processors/wordperfect.py`) - **IN DEVELOPMENT** 🔄 - Primary: `libwpd` system tools → `wpd2text` → `strings` fallback - Supports .wpd, .wp, .wp4, .wp5, .wp6 formats - Document structure preservation and legal document handling 3. **Lotus 1-2-3 Processor** (`processors/lotus123.py`) - **PLANNED** 📋 - Target libraries: `gnumeric` tools → custom binary parser - Supports .wk1, .wk3, .wk4, .wks formats - Formula reconstruction and financial model awareness 4. **AppleWorks Processor** (`processors/appleworks.py`) - **PLANNED** 📋 - Mac-aware processing with resource fork handling - Supports .cwk, .appleworks formats - Cross-platform variant detection ### Intelligent Detection Engine The multi-layer format detection system provides 99.9% accuracy through: - **Magic Byte Analysis**: 8 format families, 20+ variants - **Extension Mapping**: 27 legacy extensions with historical metadata - **Content Structure Heuristics**: Format-specific pattern recognition - **Vintage Authenticity Scoring**: Age-based file assessment ### AI Enhancement Pipeline - **Content Classification**: Document type detection (business/legal/technical) - **Quality Assessment**: Extraction completeness + text coherence scoring - **Historical Context**: Era-appropriate document analysis with business intelligence - **Processing Insights**: Method reliability + performance optimization ## Development Notes ### Implementation Priority Order **Phase 1 (COMPLETED)**: Foundation + dBASE - ✅ Core architecture with FastMCP server - ✅ Multi-layer format detection engine - ✅ Production-ready dBASE processor - ✅ AI enhancement framework - ✅ Testing infrastructure **Phase 2 (CURRENT)**: WordPerfect Implementation - 🔄 WordPerfect processor with libwpd integration - 📋 Document structure preservation - 📋 Legal document handling optimizations **Phase 3**: PC Era Expansion (Lotus 1-2-3, Quattro Pro, WordStar) **Phase 4**: Mac Heritage Collection (AppleWorks, HyperCard, MacWrite) **Phase 5**: Advanced AI Intelligence (ML reconstruction, cross-format analysis) ### Format Support Matrix | **Format Family** | **Status** | **Extensions** | **Business Impact** | |------------------|------------|----------------|-------------------| | **dBASE** | 🟢 Production | `.dbf`, `.db`, `.dbt` | CRITICAL | | **WordPerfect** | 🟡 In Development | `.wpd`, `.wp`, `.wp5`, `.wp6` | CRITICAL | | **Lotus 1-2-3** | ⚪ Planned | `.wk1`, `.wk3`, `.wk4`, `.wks` | HIGH | | **AppleWorks** | ⚪ Planned | `.cwk`, `.appleworks` | MEDIUM | | **HyperCard** | ⚪ Planned | `.hc`, `.stack` | HIGH | ### Testing Strategy - **Core Detection Tests**: No external dependencies, test format detection engine - **Processor Integration Tests**: Test with mocked format libraries - **End-to-End Tests**: Real vintage files with full dependency stack - **Performance Tests**: Large file handling and memory efficiency - **Regression Tests**: Historical accuracy preservation across updates ### Tool Implementation Pattern All format processors follow this architectural pattern: 1. **Format Detection**: Use detection engine for confidence scoring 2. **Multi-Library Fallback**: Try primary → secondary → emergency methods 3. **AI Enhancement**: Apply content classification and quality assessment 4. **Result Packaging**: Return structured ProcessingResult with metadata 5. **Error Recovery**: Comprehensive error handling with troubleshooting hints ### Dependency Management **Core Dependencies** (always required): - `fastmcp>=0.5.0` - FastMCP protocol server - `aiofiles>=23.2.0` - Async file operations - `structlog>=23.2.0` - Structured logging **Format-Specific Dependencies** (optional, graceful fallbacks): - `dbfread>=2.0.7` - dBASE processing (primary method) - `simpledbf>=0.2.6` - dBASE fallback processing - `pandas>=2.0.0` - Data processing and dBASE tertiary method **System Dependencies** (install via package manager): - `libwpd-tools` - WordPerfect document processing - `tesseract-ocr` - OCR for corrupted/scanned documents - `poppler-utils` - PDF conversion utilities - `ghostscript` - PostScript/PDF processing - `libgsf-bin` - Mac format support ### Configuration Environment variables for customization: ```bash # Processing configuration LEGACY_MAX_FILE_SIZE=500MB # Maximum file size to process LEGACY_CACHE_DIR=/tmp/legacy_cache # Cache directory for downloads LEGACY_PROCESSING_TIMEOUT=300 # Timeout in seconds # AI enhancement settings LEGACY_AI_ENHANCEMENT=true # Enable AI processing pipeline LEGACY_AI_MODEL=gpt-3.5-turbo # AI model for enhancement LEGACY_QUALITY_THRESHOLD=0.8 # Minimum quality score # Debug settings DEBUG=false # Enable debug logging LEGACY_PRESERVE_TEMP_FILES=false # Keep temporary files for debugging ``` ### MCP Integration Tools are registered using FastMCP decorators: ```python @app.tool() async def extract_legacy_document( file_path: str = Field(description="Path to legacy document or HTTPS URL"), preserve_formatting: bool = Field(default=True), method: str = Field(default="auto"), enable_ai_enhancement: bool = Field(default=True) ) -> Dict[str, Any]: ``` All tools follow MCP protocol standards for: - Parameter validation and type hints - Structured error responses with troubleshooting - Comprehensive metadata in results - Async processing with progress indicators ### Docker Support The project includes Docker support with pre-installed system dependencies: ```bash # Build Docker image docker build -t mcp-legacy-files . # Run with volume mounts docker run -v /path/to/legacy/files:/data mcp-legacy-files process /data/vintage.dbf # Run MCP server in container docker run -p 8000:8000 mcp-legacy-files server ``` ## Current Development Focus ### WordPerfect Implementation (Phase 2) Currently implementing comprehensive WordPerfect support: 1. **Library Integration**: Using system-level `libwpd-tools` with Python subprocess calls 2. **Format Detection**: Enhanced magic byte detection for WP 4.2, 5.0-5.1, 6.0+ 3. **Document Structure**: Preserving formatting, styles, and document metadata 4. **Fallback Chain**: `wpd2text` → `wpd2html` → `strings` extraction → binary analysis 5. **Legal Document Optimization**: Special handling for legal/government document patterns ### Integration Testing Priority testing scenarios: - **Real-world WPD files** from 1980s-2000s era - **Corrupted document recovery** with partial extraction - **Cross-platform compatibility** (DOS, Windows, Mac variants) - **Large document performance** (500+ page documents) - **Batch processing** of document archives ## Important Development Guidelines ### Code Quality Standards - **Error Handling**: All processors must handle corruption gracefully - **Performance**: < 5 seconds processing for typical files, smart caching - **Compatibility**: Support files from original hardware/OS contexts - **Documentation**: Historical context and business value in all format descriptions ### Historical Accuracy - Preserve original document metadata and timestamps - Maintain era-appropriate processing methods - Document format evolution and variant handling - Respect original creator intent and document purpose ### Business Focus - Prioritize formats with highest business/legal impact - Focus on document types with compliance/discovery value - Ensure enterprise-grade security and validation - Provide actionable business intelligence from vintage data ## Success Metrics - **Format Coverage**: 25+ legacy formats supported - **Processing Accuracy**: >95% successful extraction rate - **Performance**: <5 second average processing time - **Business Impact**: Legal discovery, digital preservation, AI training data - **User Adoption**: Integration with Claude Desktop, enterprise workflows