Ryan Malloy b681cb030b Initial commit: MCP Office Tools v0.1.0
- Comprehensive Microsoft Office document processing server
- Support for Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt), CSV
- 6 universal tools: extract_text, extract_images, extract_metadata, detect_office_format, analyze_document_health, get_supported_formats
- Multi-library fallback system for robust processing
- URL support with intelligent caching
- Legacy Office format support (97-2003)
- FastMCP integration with async architecture
- Production ready with comprehensive documentation

🤖 Generated with Claude Code (claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 01:01:48 -06:00

7.5 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with the MCP Office Tools codebase.

Project Overview

MCP Office Tools is a FastMCP server that provides comprehensive Microsoft Office document processing capabilities including text extraction, image extraction, metadata extraction, and format detection. The server supports Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt), and CSV files with intelligent method selection and automatic fallbacks.

Development Commands

Environment Setup

# Install with development dependencies
uv sync --dev

# Install system dependencies if needed
# (Most dependencies are Python-only)

Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=mcp_office_tools

# Run specific test file
uv run pytest tests/test_server.py

# Run specific test
uv run pytest tests/test_server.py::TestTextExtraction::test_extract_text_success

Code Quality

# Format code
uv run black src/ tests/ examples/

# Lint code
uv run ruff check src/ tests/ examples/

# Type checking
uv run mypy src/

Running the Server

# Run MCP server directly
uv run mcp-office-tools

# Run with Python module
uv run python -m mcp_office_tools.server

# Test with sample documents
uv run python examples/test_office_tools.py /path/to/test.docx

Building and Distribution

# Build package
uv build

# Upload to PyPI (requires credentials)
uv publish

Architecture

Core Components

  • src/mcp_office_tools/server.py: Main server implementation with all Office processing tools
  • src/mcp_office_tools/utils/: Utility modules for validation, caching, and file detection
  • FastMCP Framework: Uses FastMCP for MCP protocol implementation
  • Multi-library approach: Integrates python-docx, openpyxl, python-pptx, pandas, and legacy format handlers

Tool Categories

  1. Universal Tools: Work across all Office formats

    • extract_text - Intelligent text extraction
    • extract_images - Image extraction with filtering
    • extract_metadata - Document metadata extraction
    • detect_office_format - Format detection and analysis
    • analyze_document_health - Document integrity checking
  2. Format-Specific Processing: Specialized handlers for Word, Excel, PowerPoint

  3. Legacy Format Support: OLE Compound Document processing for .doc, .xls, .ppt

  4. URL Processing: Direct URL document processing with caching

Intelligent Fallbacks

The server implements smart fallback mechanisms:

  • Text extraction uses multiple libraries in order of preference
  • Automatic format detection determines best processing method
  • Legacy format support with graceful degradation
  • Comprehensive error handling with helpful diagnostics

Dependencies Management

Core dependencies:

  • python-docx: Modern Word document processing
  • openpyxl: Excel XLSX file processing
  • python-pptx: PowerPoint PPTX processing
  • pandas: CSV and data analysis
  • xlrd: Legacy Excel XLS support
  • olefile: Legacy OLE Compound Document support
  • Pillow: Image processing
  • aiohttp/aiofiles: Async file and URL handling

Optional dependencies:

  • msoffcrypto-tool: Encrypted file detection
  • mammoth: Enhanced Word to HTML/Markdown conversion

Configuration

Environment variables:

  • OFFICE_TEMP_DIR: Temporary file processing directory
  • DEBUG: Enable debug logging and detailed error reporting

Development Notes

Testing Strategy

  • Unit tests for each tool with mocked Office libraries
  • Test fixtures for consistent document simulation
  • Error handling tests for all major failure modes
  • Format detection and validation testing
  • URL processing and caching tests

Tool Implementation Pattern

All tools follow this pattern:

  1. Validate and resolve file path (including URL downloads)
  2. Detect format and validate document integrity
  3. Try primary method with intelligent selection based on format
  4. Implement fallbacks where applicable
  5. Return structured results with metadata
  6. Include timing information and method used
  7. Provide helpful error messages with troubleshooting hints

Format Support Matrix

  • Modern formats (.docx, .xlsx, .pptx): Full feature support
  • Legacy formats (.doc, .xls, .ppt): Basic extraction with graceful degradation
  • CSV files: Specialized pandas-based processing
  • Template files (.dotx, .xltx, .potx): Standard processing as documents

URL and Caching Support

  • HTTPS URL processing with validation
  • Intelligent caching system (1-hour default)
  • Temporary file management with automatic cleanup
  • Security headers and content validation

MCP Integration

Tools are registered using FastMCP decorators and follow MCP protocol standards for:

  • Tool descriptions and parameter validation
  • Structured result formatting
  • Error handling and reporting
  • Async operation patterns

Error Handling

  • Custom OfficeFileError exception for Office-specific errors
  • Comprehensive validation before processing
  • Helpful error messages with processing hints
  • Graceful degradation for unsupported features
  • Debug mode for detailed troubleshooting

Project Structure

mcp-office-tools/
├── src/mcp_office_tools/
│   ├── __init__.py           # Package initialization
│   ├── server.py             # Main FastMCP server with tools
│   ├── utils/                # Utility modules
│   │   ├── __init__.py       # Utils package
│   │   ├── validation.py     # File validation and format detection
│   │   ├── file_detection.py # Advanced format analysis
│   │   └── caching.py        # URL caching system
│   ├── word/                 # Word-specific processors (future)
│   ├── excel/                # Excel-specific processors (future)
│   └── powerpoint/           # PowerPoint-specific processors (future)
├── tests/                    # Test suite
├── examples/                 # Usage examples
├── docs/                     # Documentation
├── pyproject.toml           # Project configuration
├── README.md                # Project documentation
├── LICENSE                  # MIT license
└── CLAUDE.md               # This file

Implementation Status

Phase 1: Foundation COMPLETE

  • Project structure setup with FastMCP
  • Universal tools: extract_text, extract_images, extract_metadata
  • Format detection and validation
  • URL processing with caching
  • Basic Word, Excel, PowerPoint support

Phase 2: Enhancement (In Progress)

  • Advanced Word document tools (tables, comments, structure)
  • Excel-specific tools (formulas, charts, data analysis)
  • PowerPoint tools (slides, speaker notes, animations)
  • Legacy format optimization

Phase 3: Advanced Features (Planned)

  • Document manipulation tools (merge, split, convert)
  • Cross-format comparison and analysis
  • Batch processing capabilities
  • Enhanced metadata extraction

Testing Approach

The project uses pytest with:

  • Async test support via pytest-asyncio
  • Coverage reporting with pytest-cov
  • Mock Office documents for consistent testing
  • Parameterized tests for multiple format support
  • Integration tests with real Office files

Relationship to MCP PDF Tools

MCP Office Tools is designed as a companion to MCP PDF Tools:

  • Consistent API design patterns
  • Similar caching and URL handling
  • Parallel tool organization
  • Compatible error handling approaches
  • Complementary document processing capabilities