- Comprehensive Microsoft Office document processing server
- Support for Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt), CSV
- 6 universal tools: extract_text, extract_images, extract_metadata, detect_office_format, analyze_document_health, get_supported_formats
- Multi-library fallback system for robust processing
- URL support with intelligent caching
- Legacy Office format support (97-2003)
- FastMCP integration with async architecture
- Production ready with comprehensive documentation
🤖 Generated with Claude Code (claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
7.5 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with the MCP Office Tools codebase.
Project Overview
MCP Office Tools is a FastMCP server that provides comprehensive Microsoft Office document processing capabilities including text extraction, image extraction, metadata extraction, and format detection. The server supports Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt), and CSV files with intelligent method selection and automatic fallbacks.
Development Commands
Environment Setup
# Install with development dependencies
uv sync --dev
# Install system dependencies if needed
# (Most dependencies are Python-only)
Testing
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=mcp_office_tools
# Run specific test file
uv run pytest tests/test_server.py
# Run specific test
uv run pytest tests/test_server.py::TestTextExtraction::test_extract_text_success
Code Quality
# Format code
uv run black src/ tests/ examples/
# Lint code
uv run ruff check src/ tests/ examples/
# Type checking
uv run mypy src/
Running the Server
# Run MCP server directly
uv run mcp-office-tools
# Run with Python module
uv run python -m mcp_office_tools.server
# Test with sample documents
uv run python examples/test_office_tools.py /path/to/test.docx
Building and Distribution
# Build package
uv build
# Upload to PyPI (requires credentials)
uv publish
Architecture
Core Components
src/mcp_office_tools/server.py
: Main server implementation with all Office processing toolssrc/mcp_office_tools/utils/
: Utility modules for validation, caching, and file detection- FastMCP Framework: Uses FastMCP for MCP protocol implementation
- Multi-library approach: Integrates python-docx, openpyxl, python-pptx, pandas, and legacy format handlers
Tool Categories
-
Universal Tools: Work across all Office formats
extract_text
- Intelligent text extractionextract_images
- Image extraction with filteringextract_metadata
- Document metadata extractiondetect_office_format
- Format detection and analysisanalyze_document_health
- Document integrity checking
-
Format-Specific Processing: Specialized handlers for Word, Excel, PowerPoint
-
Legacy Format Support: OLE Compound Document processing for .doc, .xls, .ppt
-
URL Processing: Direct URL document processing with caching
Intelligent Fallbacks
The server implements smart fallback mechanisms:
- Text extraction uses multiple libraries in order of preference
- Automatic format detection determines best processing method
- Legacy format support with graceful degradation
- Comprehensive error handling with helpful diagnostics
Dependencies Management
Core dependencies:
- python-docx: Modern Word document processing
- openpyxl: Excel XLSX file processing
- python-pptx: PowerPoint PPTX processing
- pandas: CSV and data analysis
- xlrd: Legacy Excel XLS support
- olefile: Legacy OLE Compound Document support
- Pillow: Image processing
- aiohttp/aiofiles: Async file and URL handling
Optional dependencies:
- msoffcrypto-tool: Encrypted file detection
- mammoth: Enhanced Word to HTML/Markdown conversion
Configuration
Environment variables:
OFFICE_TEMP_DIR
: Temporary file processing directoryDEBUG
: Enable debug logging and detailed error reporting
Development Notes
Testing Strategy
- Unit tests for each tool with mocked Office libraries
- Test fixtures for consistent document simulation
- Error handling tests for all major failure modes
- Format detection and validation testing
- URL processing and caching tests
Tool Implementation Pattern
All tools follow this pattern:
- Validate and resolve file path (including URL downloads)
- Detect format and validate document integrity
- Try primary method with intelligent selection based on format
- Implement fallbacks where applicable
- Return structured results with metadata
- Include timing information and method used
- Provide helpful error messages with troubleshooting hints
Format Support Matrix
- Modern formats (.docx, .xlsx, .pptx): Full feature support
- Legacy formats (.doc, .xls, .ppt): Basic extraction with graceful degradation
- CSV files: Specialized pandas-based processing
- Template files (.dotx, .xltx, .potx): Standard processing as documents
URL and Caching Support
- HTTPS URL processing with validation
- Intelligent caching system (1-hour default)
- Temporary file management with automatic cleanup
- Security headers and content validation
MCP Integration
Tools are registered using FastMCP decorators and follow MCP protocol standards for:
- Tool descriptions and parameter validation
- Structured result formatting
- Error handling and reporting
- Async operation patterns
Error Handling
- Custom
OfficeFileError
exception for Office-specific errors - Comprehensive validation before processing
- Helpful error messages with processing hints
- Graceful degradation for unsupported features
- Debug mode for detailed troubleshooting
Project Structure
mcp-office-tools/
├── src/mcp_office_tools/
│ ├── __init__.py # Package initialization
│ ├── server.py # Main FastMCP server with tools
│ ├── utils/ # Utility modules
│ │ ├── __init__.py # Utils package
│ │ ├── validation.py # File validation and format detection
│ │ ├── file_detection.py # Advanced format analysis
│ │ └── caching.py # URL caching system
│ ├── word/ # Word-specific processors (future)
│ ├── excel/ # Excel-specific processors (future)
│ └── powerpoint/ # PowerPoint-specific processors (future)
├── tests/ # Test suite
├── examples/ # Usage examples
├── docs/ # Documentation
├── pyproject.toml # Project configuration
├── README.md # Project documentation
├── LICENSE # MIT license
└── CLAUDE.md # This file
Implementation Status
Phase 1: Foundation ✅ COMPLETE
- Project structure setup with FastMCP
- Universal tools: extract_text, extract_images, extract_metadata
- Format detection and validation
- URL processing with caching
- Basic Word, Excel, PowerPoint support
Phase 2: Enhancement (In Progress)
- Advanced Word document tools (tables, comments, structure)
- Excel-specific tools (formulas, charts, data analysis)
- PowerPoint tools (slides, speaker notes, animations)
- Legacy format optimization
Phase 3: Advanced Features (Planned)
- Document manipulation tools (merge, split, convert)
- Cross-format comparison and analysis
- Batch processing capabilities
- Enhanced metadata extraction
Testing Approach
The project uses pytest with:
- Async test support via pytest-asyncio
- Coverage reporting with pytest-cov
- Mock Office documents for consistent testing
- Parameterized tests for multiple format support
- Integration tests with real Office files
Relationship to MCP PDF Tools
MCP Office Tools is designed as a companion to MCP PDF Tools:
- Consistent API design patterns
- Similar caching and URL handling
- Parallel tool organization
- Compatible error handling approaches
- Complementary document processing capabilities