Ryan Malloy b681cb030b Initial commit: MCP Office Tools v0.1.0
- Comprehensive Microsoft Office document processing server
- Support for Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt), CSV
- 6 universal tools: extract_text, extract_images, extract_metadata, detect_office_format, analyze_document_health, get_supported_formats
- Multi-library fallback system for robust processing
- URL support with intelligent caching
- Legacy Office format support (97-2003)
- FastMCP integration with async architecture
- Production ready with comprehensive documentation

🤖 Generated with Claude Code (claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 01:01:48 -06:00

226 lines
7.5 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with the MCP Office Tools codebase.
## Project Overview
MCP Office Tools is a FastMCP server that provides comprehensive Microsoft Office document processing capabilities including text extraction, image extraction, metadata extraction, and format detection. The server supports Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt), and CSV files with intelligent method selection and automatic fallbacks.
## Development Commands
### Environment Setup
```bash
# Install with development dependencies
uv sync --dev
# Install system dependencies if needed
# (Most dependencies are Python-only)
```
### Testing
```bash
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=mcp_office_tools
# Run specific test file
uv run pytest tests/test_server.py
# Run specific test
uv run pytest tests/test_server.py::TestTextExtraction::test_extract_text_success
```
### Code Quality
```bash
# Format code
uv run black src/ tests/ examples/
# Lint code
uv run ruff check src/ tests/ examples/
# Type checking
uv run mypy src/
```
### Running the Server
```bash
# Run MCP server directly
uv run mcp-office-tools
# Run with Python module
uv run python -m mcp_office_tools.server
# Test with sample documents
uv run python examples/test_office_tools.py /path/to/test.docx
```
### Building and Distribution
```bash
# Build package
uv build
# Upload to PyPI (requires credentials)
uv publish
```
## Architecture
### Core Components
- **`src/mcp_office_tools/server.py`**: Main server implementation with all Office processing tools
- **`src/mcp_office_tools/utils/`**: Utility modules for validation, caching, and file detection
- **FastMCP Framework**: Uses FastMCP for MCP protocol implementation
- **Multi-library approach**: Integrates python-docx, openpyxl, python-pptx, pandas, and legacy format handlers
### Tool Categories
1. **Universal Tools**: Work across all Office formats
- `extract_text` - Intelligent text extraction
- `extract_images` - Image extraction with filtering
- `extract_metadata` - Document metadata extraction
- `detect_office_format` - Format detection and analysis
- `analyze_document_health` - Document integrity checking
2. **Format-Specific Processing**: Specialized handlers for Word, Excel, PowerPoint
3. **Legacy Format Support**: OLE Compound Document processing for .doc, .xls, .ppt
4. **URL Processing**: Direct URL document processing with caching
### Intelligent Fallbacks
The server implements smart fallback mechanisms:
- Text extraction uses multiple libraries in order of preference
- Automatic format detection determines best processing method
- Legacy format support with graceful degradation
- Comprehensive error handling with helpful diagnostics
### Dependencies Management
Core dependencies:
- **python-docx**: Modern Word document processing
- **openpyxl**: Excel XLSX file processing
- **python-pptx**: PowerPoint PPTX processing
- **pandas**: CSV and data analysis
- **xlrd**: Legacy Excel XLS support
- **olefile**: Legacy OLE Compound Document support
- **Pillow**: Image processing
- **aiohttp/aiofiles**: Async file and URL handling
Optional dependencies:
- **msoffcrypto-tool**: Encrypted file detection
- **mammoth**: Enhanced Word to HTML/Markdown conversion
### Configuration
Environment variables:
- `OFFICE_TEMP_DIR`: Temporary file processing directory
- `DEBUG`: Enable debug logging and detailed error reporting
## Development Notes
### Testing Strategy
- Unit tests for each tool with mocked Office libraries
- Test fixtures for consistent document simulation
- Error handling tests for all major failure modes
- Format detection and validation testing
- URL processing and caching tests
### Tool Implementation Pattern
All tools follow this pattern:
1. Validate and resolve file path (including URL downloads)
2. Detect format and validate document integrity
3. Try primary method with intelligent selection based on format
4. Implement fallbacks where applicable
5. Return structured results with metadata
6. Include timing information and method used
7. Provide helpful error messages with troubleshooting hints
### Format Support Matrix
- **Modern formats** (.docx, .xlsx, .pptx): Full feature support
- **Legacy formats** (.doc, .xls, .ppt): Basic extraction with graceful degradation
- **CSV files**: Specialized pandas-based processing
- **Template files** (.dotx, .xltx, .potx): Standard processing as documents
### URL and Caching Support
- HTTPS URL processing with validation
- Intelligent caching system (1-hour default)
- Temporary file management with automatic cleanup
- Security headers and content validation
### MCP Integration
Tools are registered using FastMCP decorators and follow MCP protocol standards for:
- Tool descriptions and parameter validation
- Structured result formatting
- Error handling and reporting
- Async operation patterns
### Error Handling
- Custom `OfficeFileError` exception for Office-specific errors
- Comprehensive validation before processing
- Helpful error messages with processing hints
- Graceful degradation for unsupported features
- Debug mode for detailed troubleshooting
## Project Structure
```
mcp-office-tools/
├── src/mcp_office_tools/
│ ├── __init__.py # Package initialization
│ ├── server.py # Main FastMCP server with tools
│ ├── utils/ # Utility modules
│ │ ├── __init__.py # Utils package
│ │ ├── validation.py # File validation and format detection
│ │ ├── file_detection.py # Advanced format analysis
│ │ └── caching.py # URL caching system
│ ├── word/ # Word-specific processors (future)
│ ├── excel/ # Excel-specific processors (future)
│ └── powerpoint/ # PowerPoint-specific processors (future)
├── tests/ # Test suite
├── examples/ # Usage examples
├── docs/ # Documentation
├── pyproject.toml # Project configuration
├── README.md # Project documentation
├── LICENSE # MIT license
└── CLAUDE.md # This file
```
## Implementation Status
### Phase 1: Foundation ✅ COMPLETE
- Project structure setup with FastMCP
- Universal tools: extract_text, extract_images, extract_metadata
- Format detection and validation
- URL processing with caching
- Basic Word, Excel, PowerPoint support
### Phase 2: Enhancement (In Progress)
- Advanced Word document tools (tables, comments, structure)
- Excel-specific tools (formulas, charts, data analysis)
- PowerPoint tools (slides, speaker notes, animations)
- Legacy format optimization
### Phase 3: Advanced Features (Planned)
- Document manipulation tools (merge, split, convert)
- Cross-format comparison and analysis
- Batch processing capabilities
- Enhanced metadata extraction
## Testing Approach
The project uses pytest with:
- Async test support via pytest-asyncio
- Coverage reporting with pytest-cov
- Mock Office documents for consistent testing
- Parameterized tests for multiple format support
- Integration tests with real Office files
## Relationship to MCP PDF Tools
MCP Office Tools is designed as a companion to MCP PDF Tools:
- Consistent API design patterns
- Similar caching and URL handling
- Parallel tool organization
- Compatible error handling approaches
- Complementary document processing capabilities