mcp-office-tools/README.md
Ryan Malloy 036160d029
Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
Update README with accurate tool documentation
- Document all 12 actual MCP tools (6 universal, 3 Word, 3 Excel)
- Add comprehensive format support matrix with feature breakdown
- Include practical usage examples with real output structures
- Add test dashboard section
- Simplify installation with uvx/Claude Code instructions
- Remove marketing fluff; focus on technical accuracy
2026-01-11 00:45:00 -07:00

382 lines
9.8 KiB
Markdown

<div align="center">
# 📊 MCP Office Tools
**Comprehensive Microsoft Office document processing for AI agents**
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg?style=flat-square)](https://www.python.org/downloads/)
[![FastMCP](https://img.shields.io/badge/FastMCP-0.5+-green.svg?style=flat-square)](https://gofastmcp.com)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square)](https://opensource.org/licenses/MIT)
[![MCP Protocol](https://img.shields.io/badge/MCP-Protocol-purple?style=flat-square)](https://modelcontextprotocol.io)
*Extract text, tables, images, formulas, and metadata from Word, Excel, PowerPoint, and CSV files*
[Installation](#-installation) • [Tools](#-available-tools) • [Examples](#-usage-examples) • [Testing](#-testing)
</div>
---
## ✨ Features
- **Universal extraction** - Text, images, and metadata from any Office format
- **Format-specific tools** - Deep analysis for Word, Excel, and PowerPoint
- **Intelligent pagination** - Large documents automatically chunked for AI context limits
- **Multi-library fallbacks** - Never fails silently; tries multiple extraction methods
- **URL support** - Process documents directly from HTTP/HTTPS URLs with caching
- **Legacy format support** - Handles .doc, .xls, .ppt from Office 97-2003
---
## 🚀 Installation
```bash
# Quick install with uvx (recommended)
uvx mcp-office-tools
# Or install with uv/pip
uv add mcp-office-tools
pip install mcp-office-tools
```
### Claude Desktop Configuration
Add to your `claude_desktop_config.json`:
```json
{
"mcpServers": {
"office-tools": {
"command": "uvx",
"args": ["mcp-office-tools"]
}
}
}
```
### Claude Code Configuration
```bash
claude mcp add office-tools "uvx mcp-office-tools"
```
---
## 🛠 Available Tools
### Universal Tools
*Work with all Office formats: Word, Excel, PowerPoint, CSV*
| Tool | Description |
|------|-------------|
| `extract_text` | Extract text with optional formatting preservation |
| `extract_images` | Extract embedded images with size filtering |
| `extract_metadata` | Get document properties (author, dates, statistics) |
| `detect_office_format` | Identify format, version, encryption status |
| `analyze_document_health` | Check integrity, corruption, password protection |
| `get_supported_formats` | List all supported file extensions |
### Word Tools
| Tool | Description |
|------|-------------|
| `convert_to_markdown` | Convert to Markdown with automatic pagination for large docs |
| `extract_word_tables` | Extract tables as structured JSON, CSV, or Markdown |
| `analyze_word_structure` | Analyze headings, sections, styles, and document hierarchy |
### Excel Tools
| Tool | Description |
|------|-------------|
| `analyze_excel_data` | Statistical analysis: data types, missing values, outliers |
| `extract_excel_formulas` | Extract formulas with values and dependency analysis |
| `create_excel_chart_data` | Generate Chart.js/Plotly-ready data from spreadsheets |
---
## 📋 Format Support
| Format | Extension | Text | Images | Metadata | Tables | Formulas |
|--------|-----------|:----:|:------:|:--------:|:------:|:--------:|
| **Word (Modern)** | `.docx` | ✅ | ✅ | ✅ | ✅ | - |
| **Word (Legacy)** | `.doc` | ✅ | ⚠️ | ⚠️ | ⚠️ | - |
| **Word Template** | `.dotx` | ✅ | ✅ | ✅ | ✅ | - |
| **Word Macro** | `.docm` | ✅ | ✅ | ✅ | ✅ | - |
| **Excel (Modern)** | `.xlsx` | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Excel (Legacy)** | `.xls` | ✅ | ⚠️ | ⚠️ | ✅ | ⚠️ |
| **Excel Template** | `.xltx` | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Excel Macro** | `.xlsm` | ✅ | ✅ | ✅ | ✅ | ✅ |
| **PowerPoint (Modern)** | `.pptx` | ✅ | ✅ | ✅ | ✅ | - |
| **PowerPoint (Legacy)** | `.ppt` | ✅ | ⚠️ | ⚠️ | ⚠️ | - |
| **PowerPoint Template** | `.potx` | ✅ | ✅ | ✅ | ✅ | - |
| **CSV** | `.csv` | ✅ | - | ⚠️ | ✅ | - |
✅ Full support • ⚠️ Basic/partial support • - Not applicable
---
## 💡 Usage Examples
### Extract Text from Any Document
```python
# Simple extraction
result = await extract_text("report.docx")
print(result["text"])
# With formatting preserved
result = await extract_text(
file_path="report.docx",
preserve_formatting=True,
include_metadata=True
)
```
### Convert Word to Markdown (with Pagination)
```python
# For large documents, results are automatically paginated
result = await convert_to_markdown("big-manual.docx")
# Continue with cursor for next page
if result.get("pagination", {}).get("has_more"):
next_page = await convert_to_markdown(
"big-manual.docx",
cursor_id=result["pagination"]["cursor_id"]
)
# Or use page ranges to get specific sections
result = await convert_to_markdown(
"big-manual.docx",
page_range="1-10"
)
# Or extract by chapter name
result = await convert_to_markdown(
"big-manual.docx",
chapter_name="Introduction"
)
```
### Analyze Excel Data Quality
```python
result = await analyze_excel_data(
file_path="sales-data.xlsx",
include_statistics=True,
check_data_quality=True
)
# Returns per-column analysis
# {
# "analysis": {
# "Sheet1": {
# "dimensions": {"rows": 1000, "columns": 12},
# "column_info": {
# "Revenue": {
# "data_type": "float64",
# "null_percentage": 2.3,
# "statistics": {"mean": 45000, "median": 42000, ...},
# "quality_issues": ["5 potential outliers"]
# }
# },
# "data_quality": {
# "completeness_percentage": 97.8,
# "duplicate_rows": 12
# }
# }
# }
# }
```
### Extract Excel Formulas
```python
result = await extract_excel_formulas(
file_path="financial-model.xlsx",
analyze_dependencies=True
)
# Returns formula details with dependency mapping
# {
# "formulas": {
# "Sheet1": [
# {
# "cell": "D2",
# "formula": "=B2*C2",
# "value": 1500.00,
# "dependencies": ["B2", "C2"]
# }
# ]
# }
# }
```
### Generate Chart Data
```python
result = await create_excel_chart_data(
file_path="quarterly-revenue.xlsx",
chart_type="line",
output_format="chartjs"
)
# Returns ready-to-use Chart.js configuration
# {
# "chartjs": {
# "type": "line",
# "data": {
# "labels": ["Q1", "Q2", "Q3", "Q4"],
# "datasets": [{"label": "Revenue", "data": [100, 120, 115, 140]}]
# }
# }
# }
```
### Extract Word Tables
```python
result = await extract_word_tables(
file_path="contract.docx",
output_format="markdown"
)
# Returns tables with optional format conversion
# {
# "tables": [
# {
# "table_index": 0,
# "dimensions": {"rows": 5, "columns": 3},
# "converted_output": "| Name | Role | Department |\n|---|---|---|\n..."
# }
# ]
# }
```
### Process Documents from URLs
```python
# Documents are downloaded and cached automatically
result = await extract_text("https://example.com/report.docx")
# Cache expires after 1 hour by default
```
---
## 🧪 Testing
The project includes a comprehensive test suite with an interactive HTML dashboard:
```bash
# Run all tests with dashboard generation
make test
# Run just pytest
make test-pytest
# View the test dashboard
make view-dashboard
```
The test dashboard shows:
- Pass/fail statistics with MS Office-themed styling
- Detailed inputs and outputs for each test
- Expandable error tracebacks for failures
- Category breakdown (Word, Excel, PowerPoint)
---
## 🏗 Architecture
```
mcp-office-tools/
├── src/mcp_office_tools/
│ ├── server.py # FastMCP server entry point
│ ├── mixins/
│ │ ├── universal.py # Format-agnostic tools
│ │ ├── word.py # Word-specific tools
│ │ ├── excel.py # Excel-specific tools
│ │ └── powerpoint.py # PowerPoint tools (WIP)
│ ├── utils/
│ │ ├── validation.py # File validation
│ │ ├── file_detection.py # Format detection
│ │ ├── caching.py # URL caching
│ │ └── decorators.py # Error handling, defaults
│ └── pagination.py # Large document pagination
├── tests/ # pytest test suite
└── reports/ # Test dashboard output
```
### Processing Libraries
| Format | Primary Library | Fallback |
|--------|----------------|----------|
| `.docx` | python-docx | mammoth |
| `.xlsx` | openpyxl | pandas |
| `.pptx` | python-pptx | - |
| `.doc`/`.xls`/`.ppt` | olefile | - |
| `.csv` | pandas | built-in csv |
---
## 🔧 Development
```bash
# Clone and install
git clone https://github.com/yourusername/mcp-office-tools.git
cd mcp-office-tools
uv sync --dev
# Run tests
uv run pytest
# Format and lint
uv run black src/ tests/
uv run ruff check src/ tests/
# Type check
uv run mypy src/
```
---
## 📦 Dependencies
**Core:**
- `fastmcp` - MCP server framework
- `python-docx` - Word document processing
- `openpyxl` - Excel spreadsheet processing
- `python-pptx` - PowerPoint processing
- `pandas` - Data analysis and CSV handling
- `mammoth` - Word to HTML/Markdown conversion
- `olefile` - Legacy OLE format support
- `xlrd` - Legacy Excel support
- `pillow` - Image processing
- `aiohttp` / `aiofiles` - Async HTTP and file I/O
**Optional:**
- `python-magic` - Enhanced MIME type detection
- `msoffcrypto-tool` - Encrypted file detection
---
## 🤝 Related Projects
- **[MCP PDF Tools](https://github.com/yourusername/mcp-pdf-tools)** - Companion server for PDF processing
- **[FastMCP](https://gofastmcp.com)** - The framework powering this server
---
## 📜 License
MIT License - see [LICENSE](LICENSE) for details.
---
<div align="center">
**Built with [FastMCP](https://gofastmcp.com) and the [Model Context Protocol](https://modelcontextprotocol.io)**
</div>