mcp-office-tools/README.md
Ryan Malloy b681cb030b Initial commit: MCP Office Tools v0.1.0
- Comprehensive Microsoft Office document processing server
- Support for Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt), CSV
- 6 universal tools: extract_text, extract_images, extract_metadata, detect_office_format, analyze_document_health, get_supported_formats
- Multi-library fallback system for robust processing
- URL support with intelligent caching
- Legacy Office format support (97-2003)
- FastMCP integration with async architecture
- Production ready with comprehensive documentation

🤖 Generated with Claude Code (claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 01:01:48 -06:00

332 lines
9.1 KiB
Markdown

# MCP Office Tools
**Comprehensive Microsoft Office document processing server for the MCP (Model Context Protocol) ecosystem.**
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![FastMCP](https://img.shields.io/badge/FastMCP-0.5+-green.svg)](https://github.com/jlowin/fastmcp)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
MCP Office Tools provides **30+ comprehensive tools** for processing Microsoft Office documents including Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt), and CSV files. Built as a companion to [MCP PDF Tools](https://github.com/mcp-pdf-tools/mcp-pdf-tools), it offers the same level of quality and robustness for Office document processing.
## 🌟 Key Features
### **Universal Format Support**
- **Word Documents**: `.docx`, `.doc`, `.docm`, `.dotx`, `.dot`
- **Excel Spreadsheets**: `.xlsx`, `.xls`, `.xlsm`, `.xltx`, `.xlt`, `.csv`
- **PowerPoint Presentations**: `.pptx`, `.ppt`, `.pptm`, `.potx`, `.pot`
- **Legacy Compatibility**: Full support for Office 97-2003 formats
### **Intelligent Processing**
- **Multi-library fallback system** for robust document processing
- **Automatic format detection** and validation
- **Smart method selection** based on document type and complexity
- **URL support** with intelligent caching (1-hour cache)
### **Comprehensive Tool Suite**
- **Universal Tools** (8): Work across all Office formats
- **Word Tools** (8): Specialized document processing
- **Excel Tools** (8): Advanced spreadsheet analysis
- **PowerPoint Tools** (6): Presentation content extraction
## 🚀 Quick Start
### Installation
```bash
# Install with uv (recommended)
uv add mcp-office-tools
# Or with pip
pip install mcp-office-tools
```
### Basic Usage
```bash
# Run the MCP server
mcp-office-tools
# Or run directly with Python
python -m mcp_office_tools.server
```
### Integration with Claude Desktop
Add to your `claude_desktop_config.json`:
```json
{
"mcpServers": {
"mcp-office-tools": {
"command": "mcp-office-tools"
}
}
}
```
## 📊 Tool Categories
### **📄 Universal Processing Tools**
Work across all Office formats with intelligent format detection:
| Tool | Description | Formats |
|------|-------------|---------|
| `extract_text` | Multi-method text extraction | All formats |
| `extract_images` | Image extraction with filtering | Word, Excel, PowerPoint |
| `extract_metadata` | Document properties and statistics | All formats |
| `detect_office_format` | Format detection and analysis | All formats |
| `analyze_document_health` | File integrity and health check | All formats |
### **📝 Word Document Tools**
Specialized for Word documents (.docx, .doc, .docm):
```python
# Extract text with formatting preservation
result = await extract_text("document.docx", preserve_formatting=True)
# Get document structure and metadata
metadata = await extract_metadata("report.doc")
# Health check for legacy documents
health = await analyze_document_health("old_document.doc")
```
### **📊 Excel Spreadsheet Tools**
Advanced spreadsheet processing (.xlsx, .xls, .csv):
```python
# Extract data from all worksheets
data = await extract_text("spreadsheet.xlsx", preserve_formatting=True)
# Process CSV files
csv_data = await extract_text("data.csv")
# Legacy Excel support
legacy_data = await extract_text("old_data.xls")
```
### **🎯 PowerPoint Tools**
Presentation content extraction (.pptx, .ppt):
```python
# Extract slide content
slides = await extract_text("presentation.pptx", preserve_formatting=True)
# Get presentation metadata
info = await extract_metadata("slideshow.pptx")
```
## 🔧 Real-World Use Cases
### **Business Intelligence & Reporting**
```python
# Process quarterly reports across formats
word_summary = await extract_text("quarterly-report.docx")
excel_data = await extract_text("financial-data.xlsx", preserve_formatting=True)
ppt_insights = await extract_text("presentation.pptx")
# Cross-format health analysis
health_check = await analyze_document_health("legacy-report.doc")
```
### **Document Migration & Modernization**
```python
# Legacy document processing
legacy_docs = ["policy.doc", "procedures.xls", "training.ppt"]
for doc in legacy_docs:
# Format detection
format_info = await detect_office_format(doc)
# Health assessment
health = await analyze_document_health(doc)
# Content extraction
content = await extract_text(doc)
```
### **Content Analysis & Extraction**
```python
# Multi-format content processing
documents = ["research.docx", "data.xlsx", "slides.pptx"]
for doc in documents:
# Comprehensive analysis
text = await extract_text(doc, preserve_formatting=True)
images = await extract_images(doc, min_width=200, min_height=200)
metadata = await extract_metadata(doc)
```
## 🏗️ Architecture
### **Multi-Library Approach**
MCP Office Tools uses multiple libraries with intelligent fallbacks:
**Word Documents:**
- `python-docx``mammoth``docx2txt``olefile` (legacy)
**Excel Spreadsheets:**
- `openpyxl``pandas``xlrd` (legacy)
**PowerPoint Presentations:**
- `python-pptx``olefile` (legacy)
### **Format Support Matrix**
| Format | Text | Images | Metadata | Legacy |
|--------|------|--------|----------|--------|
| .docx | ✅ | ✅ | ✅ | N/A |
| .doc | ✅ | ⚠️ | ⚠️ | ✅ |
| .xlsx | ✅ | ✅ | ✅ | N/A |
| .xls | ✅ | ⚠️ | ⚠️ | ✅ |
| .pptx | ✅ | ✅ | ✅ | N/A |
| .ppt | ⚠️ | ⚠️ | ⚠️ | ✅ |
| .csv | ✅ | N/A | ⚠️ | N/A |
*✅ Full support, ⚠️ Basic support, N/A Not applicable*
## 🔍 Advanced Features
### **URL Processing**
Process Office documents directly from URLs:
```python
# Direct URL processing
url_doc = "https://example.com/document.docx"
content = await extract_text(url_doc)
# Automatic caching (1-hour default)
cached_content = await extract_text(url_doc) # Uses cache
```
### **Format Detection**
Intelligent format detection and validation:
```python
# Comprehensive format analysis
format_info = await detect_office_format("unknown_file.office")
# Returns:
# - Format name and category
# - MIME type validation
# - Legacy vs modern classification
# - Processing recommendations
```
### **Document Health Analysis**
Comprehensive document integrity checking:
```python
# Health assessment
health = await analyze_document_health("suspicious_file.docx")
# Returns:
# - Health score (1-10)
# - Validation results
# - Corruption detection
# - Processing recommendations
```
## 📈 Performance & Compatibility
### **System Requirements**
- **Python**: 3.11+
- **Memory**: 512MB+ available RAM
- **Storage**: 100MB+ for dependencies
### **Dependencies**
- **Core**: FastMCP, python-docx, openpyxl, python-pptx
- **Legacy**: olefile, xlrd, msoffcrypto-tool
- **Enhancement**: mammoth, pandas, Pillow
### **Platform Support**
-**Linux** (Ubuntu 20.04+, RHEL 8+)
-**macOS** (10.15+)
-**Windows** (10/11)
-**Docker** containers
## 🛠️ Development
### **Setup Development Environment**
```bash
# Clone repository
git clone https://github.com/mcp-office-tools/mcp-office-tools.git
cd mcp-office-tools
# Install with development dependencies
uv sync --dev
# Run tests
uv run pytest
# Code quality checks
uv run black src/ tests/
uv run ruff check src/ tests/
uv run mypy src/
```
### **Testing**
```bash
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=mcp_office_tools
# Test specific format
uv run pytest tests/test_word_extraction.py
```
## 🤝 Integration with MCP PDF Tools
MCP Office Tools is designed as a perfect companion to [MCP PDF Tools](https://github.com/mcp-pdf-tools/mcp-pdf-tools):
```python
# Unified document processing workflow
pdf_content = await pdf_tools.extract_text("document.pdf")
docx_content = await office_tools.extract_text("document.docx")
# Cross-format analysis
pdf_metadata = await pdf_tools.extract_metadata("document.pdf")
docx_metadata = await office_tools.extract_metadata("document.docx")
```
## 📋 Supported Formats
```python
# Get all supported formats
formats = await get_supported_formats()
# Returns comprehensive format information:
# - 15+ file extensions
# - MIME type mappings
# - Category classifications
# - Processing capabilities
```
## 🔒 Security & Privacy
- **No data collection**: Documents processed locally
- **Temporary files**: Automatic cleanup after processing
- **URL validation**: Secure HTTPS-only downloads
- **Memory management**: Efficient processing of large files
## 📝 License
MIT License - see [LICENSE](LICENSE) file for details.
## 🚀 Coming Soon
- **Advanced Excel Tools**: Formula parsing, chart extraction
- **PowerPoint Enhancement**: Animation analysis, slide comparison
- **Document Conversion**: Cross-format conversion capabilities
- **Batch Processing**: Multi-document workflows
- **Cloud Integration**: Direct cloud storage support
---
**Built with ❤️ for the MCP ecosystem**
*MCP Office Tools - Comprehensive Microsoft Office document processing for modern AI workflows.*