mcp-office-tools/README.md

# MCP Office Tools

**Comprehensive Microsoft Office document processing server for the MCP (Model Context Protocol) ecosystem.**

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![FastMCP](https://img.shields.io/badge/FastMCP-0.5+-green.svg)](https://github.com/jlowin/fastmcp)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

MCP Office Tools provides **30+ comprehensive tools** for processing Microsoft Office documents including Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt), and CSV files. Built as a companion to [MCP PDF Tools](https://github.com/mcp-pdf-tools/mcp-pdf-tools), it offers the same level of quality and robustness for Office document processing.

## 🌟 Key Features

### **Universal Format Support**
- **Word Documents**: `.docx`, `.doc`, `.docm`, `.dotx`, `.dot`
- **Excel Spreadsheets**: `.xlsx`, `.xls`, `.xlsm`, `.xltx`, `.xlt`, `.csv`
- **PowerPoint Presentations**: `.pptx`, `.ppt`, `.pptm`, `.potx`, `.pot`
- **Legacy Compatibility**: Full support for Office 97-2003 formats

### **Intelligent Processing**
- **Multi-library fallback system** for robust document processing
- **Automatic format detection** and validation
- **Smart method selection** based on document type and complexity
- **URL support** with intelligent caching (1-hour cache)

### **Comprehensive Tool Suite**
- **Universal Tools** (8): Work across all Office formats
- **Word Tools** (8): Specialized document processing
- **Excel Tools** (8): Advanced spreadsheet analysis
- **PowerPoint Tools** (6): Presentation content extraction

## 🚀 Quick Start

### Installation

```bash
# Install with uv (recommended)
uv add mcp-office-tools

# Or with pip
pip install mcp-office-tools
```

### Basic Usage

```bash
# Run the MCP server
mcp-office-tools

# Or run directly with Python
python -m mcp_office_tools.server
```

### Integration with Claude Desktop

Add to your `claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "mcp-office-tools": {
      "command": "mcp-office-tools"
    }
  }
}
```

## 📊 Tool Categories

### **📄 Universal Processing Tools**
Work across all Office formats with intelligent format detection:

| Tool | Description | Formats |
|------|-------------|---------|
| `extract_text` | Multi-method text extraction | All formats |
| `extract_images` | Image extraction with filtering | Word, Excel, PowerPoint |
| `extract_metadata` | Document properties and statistics | All formats |
| `detect_office_format` | Format detection and analysis | All formats |
| `analyze_document_health` | File integrity and health check | All formats |

### **📝 Word Document Tools**
Specialized for Word documents (.docx, .doc, .docm):

```python
# Extract text with formatting preservation
result = await extract_text("document.docx", preserve_formatting=True)

# Get document structure and metadata
metadata = await extract_metadata("report.doc")

# Health check for legacy documents
health = await analyze_document_health("old_document.doc")
```

### **📊 Excel Spreadsheet Tools**
Advanced spreadsheet processing (.xlsx, .xls, .csv):

```python
# Extract data from all worksheets
data = await extract_text("spreadsheet.xlsx", preserve_formatting=True)

# Process CSV files
csv_data = await extract_text("data.csv")

# Legacy Excel support
legacy_data = await extract_text("old_data.xls")
```

### **🎯 PowerPoint Tools**
Presentation content extraction (.pptx, .ppt):

```python
# Extract slide content
slides = await extract_text("presentation.pptx", preserve_formatting=True)

# Get presentation metadata
info = await extract_metadata("slideshow.pptx")
```

## 🔧 Real-World Use Cases

### **Business Intelligence & Reporting**
```python
# Process quarterly reports across formats
word_summary = await extract_text("quarterly-report.docx")
excel_data = await extract_text("financial-data.xlsx", preserve_formatting=True)
ppt_insights = await extract_text("presentation.pptx")

# Cross-format health analysis
health_check = await analyze_document_health("legacy-report.doc")
```

### **Document Migration & Modernization**
```python
# Legacy document processing
legacy_docs = ["policy.doc", "procedures.xls", "training.ppt"]

for doc in legacy_docs:
    # Format detection
    format_info = await detect_office_format(doc)

    # Health assessment
    health = await analyze_document_health(doc)

    # Content extraction
    content = await extract_text(doc)
```

### **Content Analysis & Extraction**
```python
# Multi-format content processing
documents = ["research.docx", "data.xlsx", "slides.pptx"]

for doc in documents:
    # Comprehensive analysis
    text = await extract_text(doc, preserve_formatting=True)
    images = await extract_images(doc, min_width=200, min_height=200)
    metadata = await extract_metadata(doc)
```

## 🏗️ Architecture

### **Multi-Library Approach**
MCP Office Tools uses multiple libraries with intelligent fallbacks:

**Word Documents:**
- `python-docx` → `mammoth` → `docx2txt` → `olefile` (legacy)

**Excel Spreadsheets:**
- `openpyxl` → `pandas` → `xlrd` (legacy)

**PowerPoint Presentations:**
- `python-pptx` → `olefile` (legacy)

### **Format Support Matrix**

| Format | Text | Images | Metadata | Legacy |
|--------|------|--------|----------|--------|
| .docx  | ✅   | ✅     | ✅       | N/A    |
| .doc   | ✅   | ⚠️     | ⚠️       | ✅     |
| .xlsx  | ✅   | ✅     | ✅       | N/A    |
| .xls   | ✅   | ⚠️     | ⚠️       | ✅     |
| .pptx  | ✅   | ✅     | ✅       | N/A    |
| .ppt   | ⚠️   | ⚠️     | ⚠️       | ✅     |
| .csv   | ✅   | N/A    | ⚠️       | N/A    |

*✅ Full support, ⚠️ Basic support, N/A Not applicable*

## 🔍 Advanced Features

### **URL Processing**
Process Office documents directly from URLs:

```python
# Direct URL processing
url_doc = "https://example.com/document.docx"
content = await extract_text(url_doc)

# Automatic caching (1-hour default)
cached_content = await extract_text(url_doc)  # Uses cache
```

### **Format Detection**
Intelligent format detection and validation:

```python
# Comprehensive format analysis
format_info = await detect_office_format("unknown_file.office")

# Returns:
# - Format name and category
# - MIME type validation
# - Legacy vs modern classification
# - Processing recommendations
```

### **Document Health Analysis**
Comprehensive document integrity checking:

```python
# Health assessment
health = await analyze_document_health("suspicious_file.docx")

# Returns:
# - Health score (1-10)
# - Validation results
# - Corruption detection
# - Processing recommendations
```

## 📈 Performance & Compatibility

### **System Requirements**
- **Python**: 3.11+
- **Memory**: 512MB+ available RAM
- **Storage**: 100MB+ for dependencies

### **Dependencies**
- **Core**: FastMCP, python-docx, openpyxl, python-pptx
- **Legacy**: olefile, xlrd, msoffcrypto-tool
- **Enhancement**: mammoth, pandas, Pillow

### **Platform Support**
- ✅ **Linux** (Ubuntu 20.04+, RHEL 8+)
- ✅ **macOS** (10.15+)
- ✅ **Windows** (10/11)
- ✅ **Docker** containers

## 🛠️ Development

### **Setup Development Environment**

```bash
# Clone repository
git clone https://github.com/mcp-office-tools/mcp-office-tools.git
cd mcp-office-tools

# Install with development dependencies
uv sync --dev

# Run tests
uv run pytest

# Code quality checks
uv run black src/ tests/
uv run ruff check src/ tests/
uv run mypy src/
```

### **Testing**

```bash
# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=mcp_office_tools

# Test specific format
uv run pytest tests/test_word_extraction.py
```

## 🤝 Integration with MCP PDF Tools

MCP Office Tools is designed as a perfect companion to [MCP PDF Tools](https://github.com/mcp-pdf-tools/mcp-pdf-tools):

```python
# Unified document processing workflow
pdf_content = await pdf_tools.extract_text("document.pdf")
docx_content = await office_tools.extract_text("document.docx")

# Cross-format analysis
pdf_metadata = await pdf_tools.extract_metadata("document.pdf")
docx_metadata = await office_tools.extract_metadata("document.docx")
```

## 📋 Supported Formats

```python
# Get all supported formats
formats = await get_supported_formats()

# Returns comprehensive format information:
# - 15+ file extensions
# - MIME type mappings
# - Category classifications
# - Processing capabilities
```

## 🔒 Security & Privacy

- **No data collection**: Documents processed locally
- **Temporary files**: Automatic cleanup after processing
- **URL validation**: Secure HTTPS-only downloads
- **Memory management**: Efficient processing of large files

## 📝 License

MIT License - see [LICENSE](LICENSE) file for details.

## 🚀 Coming Soon

- **Advanced Excel Tools**: Formula parsing, chart extraction
- **PowerPoint Enhancement**: Animation analysis, slide comparison
- **Document Conversion**: Cross-format conversion capabilities
- **Batch Processing**: Multi-document workflows
- **Cloud Integration**: Direct cloud storage support

---

**Built with ❤️ for the MCP ecosystem**

*MCP Office Tools - Comprehensive Microsoft Office document processing for modern AI workflows.*