- Comprehensive Microsoft Office document processing server
- Support for Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt), CSV
- 6 universal tools: extract_text, extract_images, extract_metadata, detect_office_format, analyze_document_health, get_supported_formats
- Multi-library fallback system for robust processing
- URL support with intelligent caching
- Legacy Office format support (97-2003)
- FastMCP integration with async architecture
- Production ready with comprehensive documentation
🤖 Generated with Claude Code (claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
332 lines
9.1 KiB
Markdown
332 lines
9.1 KiB
Markdown
# MCP Office Tools
|
|
|
|
**Comprehensive Microsoft Office document processing server for the MCP (Model Context Protocol) ecosystem.**
|
|
|
|
[](https://www.python.org/downloads/)
|
|
[](https://github.com/jlowin/fastmcp)
|
|
[](https://opensource.org/licenses/MIT)
|
|
|
|
MCP Office Tools provides **30+ comprehensive tools** for processing Microsoft Office documents including Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt), and CSV files. Built as a companion to [MCP PDF Tools](https://github.com/mcp-pdf-tools/mcp-pdf-tools), it offers the same level of quality and robustness for Office document processing.
|
|
|
|
## 🌟 Key Features
|
|
|
|
### **Universal Format Support**
|
|
- **Word Documents**: `.docx`, `.doc`, `.docm`, `.dotx`, `.dot`
|
|
- **Excel Spreadsheets**: `.xlsx`, `.xls`, `.xlsm`, `.xltx`, `.xlt`, `.csv`
|
|
- **PowerPoint Presentations**: `.pptx`, `.ppt`, `.pptm`, `.potx`, `.pot`
|
|
- **Legacy Compatibility**: Full support for Office 97-2003 formats
|
|
|
|
### **Intelligent Processing**
|
|
- **Multi-library fallback system** for robust document processing
|
|
- **Automatic format detection** and validation
|
|
- **Smart method selection** based on document type and complexity
|
|
- **URL support** with intelligent caching (1-hour cache)
|
|
|
|
### **Comprehensive Tool Suite**
|
|
- **Universal Tools** (8): Work across all Office formats
|
|
- **Word Tools** (8): Specialized document processing
|
|
- **Excel Tools** (8): Advanced spreadsheet analysis
|
|
- **PowerPoint Tools** (6): Presentation content extraction
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Install with uv (recommended)
|
|
uv add mcp-office-tools
|
|
|
|
# Or with pip
|
|
pip install mcp-office-tools
|
|
```
|
|
|
|
### Basic Usage
|
|
|
|
```bash
|
|
# Run the MCP server
|
|
mcp-office-tools
|
|
|
|
# Or run directly with Python
|
|
python -m mcp_office_tools.server
|
|
```
|
|
|
|
### Integration with Claude Desktop
|
|
|
|
Add to your `claude_desktop_config.json`:
|
|
|
|
```json
|
|
{
|
|
"mcpServers": {
|
|
"mcp-office-tools": {
|
|
"command": "mcp-office-tools"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## 📊 Tool Categories
|
|
|
|
### **📄 Universal Processing Tools**
|
|
Work across all Office formats with intelligent format detection:
|
|
|
|
| Tool | Description | Formats |
|
|
|------|-------------|---------|
|
|
| `extract_text` | Multi-method text extraction | All formats |
|
|
| `extract_images` | Image extraction with filtering | Word, Excel, PowerPoint |
|
|
| `extract_metadata` | Document properties and statistics | All formats |
|
|
| `detect_office_format` | Format detection and analysis | All formats |
|
|
| `analyze_document_health` | File integrity and health check | All formats |
|
|
|
|
### **📝 Word Document Tools**
|
|
Specialized for Word documents (.docx, .doc, .docm):
|
|
|
|
```python
|
|
# Extract text with formatting preservation
|
|
result = await extract_text("document.docx", preserve_formatting=True)
|
|
|
|
# Get document structure and metadata
|
|
metadata = await extract_metadata("report.doc")
|
|
|
|
# Health check for legacy documents
|
|
health = await analyze_document_health("old_document.doc")
|
|
```
|
|
|
|
### **📊 Excel Spreadsheet Tools**
|
|
Advanced spreadsheet processing (.xlsx, .xls, .csv):
|
|
|
|
```python
|
|
# Extract data from all worksheets
|
|
data = await extract_text("spreadsheet.xlsx", preserve_formatting=True)
|
|
|
|
# Process CSV files
|
|
csv_data = await extract_text("data.csv")
|
|
|
|
# Legacy Excel support
|
|
legacy_data = await extract_text("old_data.xls")
|
|
```
|
|
|
|
### **🎯 PowerPoint Tools**
|
|
Presentation content extraction (.pptx, .ppt):
|
|
|
|
```python
|
|
# Extract slide content
|
|
slides = await extract_text("presentation.pptx", preserve_formatting=True)
|
|
|
|
# Get presentation metadata
|
|
info = await extract_metadata("slideshow.pptx")
|
|
```
|
|
|
|
## 🔧 Real-World Use Cases
|
|
|
|
### **Business Intelligence & Reporting**
|
|
```python
|
|
# Process quarterly reports across formats
|
|
word_summary = await extract_text("quarterly-report.docx")
|
|
excel_data = await extract_text("financial-data.xlsx", preserve_formatting=True)
|
|
ppt_insights = await extract_text("presentation.pptx")
|
|
|
|
# Cross-format health analysis
|
|
health_check = await analyze_document_health("legacy-report.doc")
|
|
```
|
|
|
|
### **Document Migration & Modernization**
|
|
```python
|
|
# Legacy document processing
|
|
legacy_docs = ["policy.doc", "procedures.xls", "training.ppt"]
|
|
|
|
for doc in legacy_docs:
|
|
# Format detection
|
|
format_info = await detect_office_format(doc)
|
|
|
|
# Health assessment
|
|
health = await analyze_document_health(doc)
|
|
|
|
# Content extraction
|
|
content = await extract_text(doc)
|
|
```
|
|
|
|
### **Content Analysis & Extraction**
|
|
```python
|
|
# Multi-format content processing
|
|
documents = ["research.docx", "data.xlsx", "slides.pptx"]
|
|
|
|
for doc in documents:
|
|
# Comprehensive analysis
|
|
text = await extract_text(doc, preserve_formatting=True)
|
|
images = await extract_images(doc, min_width=200, min_height=200)
|
|
metadata = await extract_metadata(doc)
|
|
```
|
|
|
|
## 🏗️ Architecture
|
|
|
|
### **Multi-Library Approach**
|
|
MCP Office Tools uses multiple libraries with intelligent fallbacks:
|
|
|
|
**Word Documents:**
|
|
- `python-docx` → `mammoth` → `docx2txt` → `olefile` (legacy)
|
|
|
|
**Excel Spreadsheets:**
|
|
- `openpyxl` → `pandas` → `xlrd` (legacy)
|
|
|
|
**PowerPoint Presentations:**
|
|
- `python-pptx` → `olefile` (legacy)
|
|
|
|
### **Format Support Matrix**
|
|
|
|
| Format | Text | Images | Metadata | Legacy |
|
|
|--------|------|--------|----------|--------|
|
|
| .docx | ✅ | ✅ | ✅ | N/A |
|
|
| .doc | ✅ | ⚠️ | ⚠️ | ✅ |
|
|
| .xlsx | ✅ | ✅ | ✅ | N/A |
|
|
| .xls | ✅ | ⚠️ | ⚠️ | ✅ |
|
|
| .pptx | ✅ | ✅ | ✅ | N/A |
|
|
| .ppt | ⚠️ | ⚠️ | ⚠️ | ✅ |
|
|
| .csv | ✅ | N/A | ⚠️ | N/A |
|
|
|
|
*✅ Full support, ⚠️ Basic support, N/A Not applicable*
|
|
|
|
## 🔍 Advanced Features
|
|
|
|
### **URL Processing**
|
|
Process Office documents directly from URLs:
|
|
|
|
```python
|
|
# Direct URL processing
|
|
url_doc = "https://example.com/document.docx"
|
|
content = await extract_text(url_doc)
|
|
|
|
# Automatic caching (1-hour default)
|
|
cached_content = await extract_text(url_doc) # Uses cache
|
|
```
|
|
|
|
### **Format Detection**
|
|
Intelligent format detection and validation:
|
|
|
|
```python
|
|
# Comprehensive format analysis
|
|
format_info = await detect_office_format("unknown_file.office")
|
|
|
|
# Returns:
|
|
# - Format name and category
|
|
# - MIME type validation
|
|
# - Legacy vs modern classification
|
|
# - Processing recommendations
|
|
```
|
|
|
|
### **Document Health Analysis**
|
|
Comprehensive document integrity checking:
|
|
|
|
```python
|
|
# Health assessment
|
|
health = await analyze_document_health("suspicious_file.docx")
|
|
|
|
# Returns:
|
|
# - Health score (1-10)
|
|
# - Validation results
|
|
# - Corruption detection
|
|
# - Processing recommendations
|
|
```
|
|
|
|
## 📈 Performance & Compatibility
|
|
|
|
### **System Requirements**
|
|
- **Python**: 3.11+
|
|
- **Memory**: 512MB+ available RAM
|
|
- **Storage**: 100MB+ for dependencies
|
|
|
|
### **Dependencies**
|
|
- **Core**: FastMCP, python-docx, openpyxl, python-pptx
|
|
- **Legacy**: olefile, xlrd, msoffcrypto-tool
|
|
- **Enhancement**: mammoth, pandas, Pillow
|
|
|
|
### **Platform Support**
|
|
- ✅ **Linux** (Ubuntu 20.04+, RHEL 8+)
|
|
- ✅ **macOS** (10.15+)
|
|
- ✅ **Windows** (10/11)
|
|
- ✅ **Docker** containers
|
|
|
|
## 🛠️ Development
|
|
|
|
### **Setup Development Environment**
|
|
|
|
```bash
|
|
# Clone repository
|
|
git clone https://github.com/mcp-office-tools/mcp-office-tools.git
|
|
cd mcp-office-tools
|
|
|
|
# Install with development dependencies
|
|
uv sync --dev
|
|
|
|
# Run tests
|
|
uv run pytest
|
|
|
|
# Code quality checks
|
|
uv run black src/ tests/
|
|
uv run ruff check src/ tests/
|
|
uv run mypy src/
|
|
```
|
|
|
|
### **Testing**
|
|
|
|
```bash
|
|
# Run all tests
|
|
uv run pytest
|
|
|
|
# Run with coverage
|
|
uv run pytest --cov=mcp_office_tools
|
|
|
|
# Test specific format
|
|
uv run pytest tests/test_word_extraction.py
|
|
```
|
|
|
|
## 🤝 Integration with MCP PDF Tools
|
|
|
|
MCP Office Tools is designed as a perfect companion to [MCP PDF Tools](https://github.com/mcp-pdf-tools/mcp-pdf-tools):
|
|
|
|
```python
|
|
# Unified document processing workflow
|
|
pdf_content = await pdf_tools.extract_text("document.pdf")
|
|
docx_content = await office_tools.extract_text("document.docx")
|
|
|
|
# Cross-format analysis
|
|
pdf_metadata = await pdf_tools.extract_metadata("document.pdf")
|
|
docx_metadata = await office_tools.extract_metadata("document.docx")
|
|
```
|
|
|
|
## 📋 Supported Formats
|
|
|
|
```python
|
|
# Get all supported formats
|
|
formats = await get_supported_formats()
|
|
|
|
# Returns comprehensive format information:
|
|
# - 15+ file extensions
|
|
# - MIME type mappings
|
|
# - Category classifications
|
|
# - Processing capabilities
|
|
```
|
|
|
|
## 🔒 Security & Privacy
|
|
|
|
- **No data collection**: Documents processed locally
|
|
- **Temporary files**: Automatic cleanup after processing
|
|
- **URL validation**: Secure HTTPS-only downloads
|
|
- **Memory management**: Efficient processing of large files
|
|
|
|
## 📝 License
|
|
|
|
MIT License - see [LICENSE](LICENSE) file for details.
|
|
|
|
## 🚀 Coming Soon
|
|
|
|
- **Advanced Excel Tools**: Formula parsing, chart extraction
|
|
- **PowerPoint Enhancement**: Animation analysis, slide comparison
|
|
- **Document Conversion**: Cross-format conversion capabilities
|
|
- **Batch Processing**: Multi-document workflows
|
|
- **Cloud Integration**: Direct cloud storage support
|
|
|
|
---
|
|
|
|
**Built with ❤️ for the MCP ecosystem**
|
|
|
|
*MCP Office Tools - Comprehensive Microsoft Office document processing for modern AI workflows.* |