mcp-office-tools/README.md

<div align="center">

# 📊 MCP Office Tools

**MCP server for extracting text, tables, images, and data from Microsoft Office files**

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg?style=flat-square)](https://www.python.org/downloads/)
[![FastMCP](https://img.shields.io/badge/FastMCP-0.5+-green.svg?style=flat-square)](https://gofastmcp.com)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square)](https://opensource.org/licenses/MIT)
[![MCP Protocol](https://img.shields.io/badge/MCP-Protocol-purple?style=flat-square)](https://modelcontextprotocol.io)

*Word, Excel, PowerPoint, CSV — all the formats your AI agent needs to read but can't*

[Installation](#-installation) • [Tools](#-available-tools) • [Examples](#-usage-examples) • [Testing](#-testing)

</div>

---

## ✨ Features

- **Universal extraction** — Pull text, images, and metadata from any Office format
- **Format-specific tools** — Deep analysis for Word (tables, structure), Excel (formulas, charts), PowerPoint
- **Automatic pagination** — Large documents get chunked so they don't blow up your context window
- **Fallback processing** — When one library chokes on a weird file, we try another. No silent failures.
- **URL support** — Pass a URL instead of a file path; we'll download and cache it
- **Legacy formats** — Yes, even those .doc and .xls files from 2003 still work

---

## 🚀 Installation

```bash
# Quick install with uvx (recommended)
uvx mcp-office-tools

# Or install with uv/pip
uv add mcp-office-tools
pip install mcp-office-tools
```

### Claude Desktop Configuration

Add to your `claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "office-tools": {
      "command": "uvx",
      "args": ["mcp-office-tools"]
    }
  }
}
```

### Claude Code Configuration

```bash
claude mcp add office-tools "uvx mcp-office-tools"
```

---

## 🛠 Available Tools

### Universal Tools
*Work with all Office formats: Word, Excel, PowerPoint, CSV*

| Tool | Description |
|------|-------------|
| `extract_text` | Extract text with optional formatting preservation |
| `extract_images` | Extract embedded images with size filtering |
| `extract_metadata` | Get document properties (author, dates, statistics) |
| `detect_office_format` | Identify format, version, encryption status |
| `analyze_document_health` | Check integrity, corruption, password protection |
| `get_supported_formats` | List all supported file extensions |

### Word Tools

| Tool | Description |
|------|-------------|
| `convert_to_markdown` | Convert to Markdown with automatic pagination for large docs |
| `extract_word_tables` | Extract tables as structured JSON, CSV, or Markdown |
| `analyze_word_structure` | Analyze headings, sections, styles, and document hierarchy |

### Excel Tools

| Tool | Description |
|------|-------------|
| `analyze_excel_data` | Statistical analysis: data types, missing values, outliers |
| `extract_excel_formulas` | Extract formulas with values and dependency analysis |
| `create_excel_chart_data` | Generate Chart.js/Plotly-ready data from spreadsheets |

---

## 📋 Format Support

Here's what works and what's "good enough" — legacy formats from Office 97-2003 have more limited extraction, but they still work:

| Format | Extension | Text | Images | Metadata | Tables | Formulas |
|--------|-----------|:----:|:------:|:--------:|:------:|:--------:|
| **Word (Modern)** | `.docx` | ✅ | ✅ | ✅ | ✅ | - |
| **Word (Legacy)** | `.doc` | ✅ | ⚠️ | ⚠️ | ⚠️ | - |
| **Word Template** | `.dotx` | ✅ | ✅ | ✅ | ✅ | - |
| **Word Macro** | `.docm` | ✅ | ✅ | ✅ | ✅ | - |
| **Excel (Modern)** | `.xlsx` | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Excel (Legacy)** | `.xls` | ✅ | ⚠️ | ⚠️ | ✅ | ⚠️ |
| **Excel Template** | `.xltx` | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Excel Macro** | `.xlsm` | ✅ | ✅ | ✅ | ✅ | ✅ |
| **PowerPoint (Modern)** | `.pptx` | ✅ | ✅ | ✅ | ✅ | - |
| **PowerPoint (Legacy)** | `.ppt` | ✅ | ⚠️ | ⚠️ | ⚠️ | - |
| **PowerPoint Template** | `.potx` | ✅ | ✅ | ✅ | ✅ | - |
| **CSV** | `.csv` | ✅ | - | ⚠️ | ✅ | - |

✅ Full support • ⚠️ Basic/partial support • - Not applicable

---

## 💡 Usage Examples

### Extract Text from Any Document

```python
# Simple extraction
result = await extract_text("report.docx")
print(result["text"])

# With formatting preserved
result = await extract_text(
    file_path="report.docx",
    preserve_formatting=True,
    include_metadata=True
)
```

### Convert Word to Markdown (with Pagination)

Large documents get paginated automatically. Three ways to handle it:

```python
# Option 1: Follow the cursor for each chunk
result = await convert_to_markdown("big-manual.docx")
if result.get("pagination", {}).get("has_more"):
    next_page = await convert_to_markdown(
        "big-manual.docx",
        cursor_id=result["pagination"]["cursor_id"]
    )

# Option 2: Grab specific pages
result = await convert_to_markdown("big-manual.docx", page_range="1-10")

# Option 3: Extract by chapter heading
result = await convert_to_markdown("big-manual.docx", chapter_name="Introduction")
```

### Analyze Excel Data Quality

```python
result = await analyze_excel_data(
    file_path="sales-data.xlsx",
    include_statistics=True,
    check_data_quality=True
)

# Returns per-column analysis
# {
#   "analysis": {
#     "Sheet1": {
#       "dimensions": {"rows": 1000, "columns": 12},
#       "column_info": {
#         "Revenue": {
#           "data_type": "float64",
#           "null_percentage": 2.3,
#           "statistics": {"mean": 45000, "median": 42000, ...},
#           "quality_issues": ["5 potential outliers"]
#         }
#       },
#       "data_quality": {
#         "completeness_percentage": 97.8,
#         "duplicate_rows": 12
#       }
#     }
#   }
# }
```

### Extract Excel Formulas

```python
result = await extract_excel_formulas(
    file_path="financial-model.xlsx",
    analyze_dependencies=True
)

# Returns formula details with dependency mapping
# {
#   "formulas": {
#     "Sheet1": [
#       {
#         "cell": "D2",
#         "formula": "=B2*C2",
#         "value": 1500.00,
#         "dependencies": ["B2", "C2"]
#       }
#     ]
#   }
# }
```

### Generate Chart Data

```python
result = await create_excel_chart_data(
    file_path="quarterly-revenue.xlsx",
    chart_type="line",
    output_format="chartjs"
)

# Returns ready-to-use Chart.js configuration
# {
#   "chartjs": {
#     "type": "line",
#     "data": {
#       "labels": ["Q1", "Q2", "Q3", "Q4"],
#       "datasets": [{"label": "Revenue", "data": [100, 120, 115, 140]}]
#     }
#   }
# }
```

### Extract Word Tables

```python
result = await extract_word_tables(
    file_path="contract.docx",
    output_format="markdown"
)

# Returns tables with optional format conversion
# {
#   "tables": [
#     {
#       "table_index": 0,
#       "dimensions": {"rows": 5, "columns": 3},
#       "converted_output": "| Name | Role | Department |\n|---|---|---|\n..."
#     }
#   ]
# }
```

### Process Documents from URLs

```python
# Documents are downloaded and cached automatically
result = await extract_text("https://example.com/report.docx")

# Cache expires after 1 hour by default
```

---

## 🧪 Testing

We built a visual test dashboard because staring at pytest output gets old. Run `make test` and you get an HTML report with pass/fail stats, detailed I/O for each test, and expandable tracebacks when things break.

```bash
# Run tests and generate the dashboard
make test

# Just pytest, no dashboard
make test-pytest

# Open existing dashboard
make view-dashboard
```

The dashboard has an MS Office-inspired theme (Word blue, Excel green, PowerPoint orange) and groups tests by category so you can see what's working at a glance.

---

## 🏗 Architecture

The mixin pattern keeps things modular — universal tools work on everything, format-specific tools go deeper. When the primary library can't handle something (corrupted files, weird formatting), we fall back to alternatives.

```
mcp-office-tools/
├── src/mcp_office_tools/
│   ├── server.py              # FastMCP server entry point
│   ├── mixins/
│   │   ├── universal.py       # Format-agnostic tools
│   │   ├── word.py            # Word-specific tools
│   │   ├── excel.py           # Excel-specific tools
│   │   └── powerpoint.py      # PowerPoint tools (WIP)
│   ├── utils/
│   │   ├── validation.py      # File validation
│   │   ├── file_detection.py  # Format detection
│   │   ├── caching.py         # URL caching
│   │   └── decorators.py      # Error handling, defaults
│   └── pagination.py          # Large document pagination
├── tests/                     # pytest test suite
└── reports/                   # Test dashboard output
```

### Processing Libraries

| Format | Primary Library | Fallback |
|--------|----------------|----------|
| `.docx` | python-docx | mammoth |
| `.xlsx` | openpyxl | pandas |
| `.pptx` | python-pptx | - |
| `.doc`/`.xls`/`.ppt` | olefile | - |
| `.csv` | pandas | built-in csv |

---

## 🔧 Development

```bash
# Clone and install
git clone https://github.com/yourusername/mcp-office-tools.git
cd mcp-office-tools
uv sync --dev

# Run tests
uv run pytest

# Format and lint
uv run black src/ tests/
uv run ruff check src/ tests/

# Type check
uv run mypy src/
```

---

## 📦 Dependencies

**Core:**
- `fastmcp` - MCP server framework
- `python-docx` - Word document processing
- `openpyxl` - Excel spreadsheet processing
- `python-pptx` - PowerPoint processing
- `pandas` - Data analysis and CSV handling
- `mammoth` - Word to HTML/Markdown conversion
- `olefile` - Legacy OLE format support
- `xlrd` - Legacy Excel support
- `pillow` - Image processing
- `aiohttp` / `aiofiles` - Async HTTP and file I/O

**Optional:**
- `python-magic` - Enhanced MIME type detection
- `msoffcrypto-tool` - Encrypted file detection

---

## 🤝 Related Projects

- **[MCP PDF Tools](https://github.com/yourusername/mcp-pdf-tools)** - Companion server for PDF processing
- **[FastMCP](https://gofastmcp.com)** - The framework powering this server

## 📝 Behind the Scenes

This README was rewritten during a human-AI collaboration session. The process raised questions about discernment, voice, and what makes documentation actually land:

- **[AI Isn't New. Your Discernment Is What Matters.](https://ryanmalloy.com/blog/ai-discernment)** — Ryan's take on 40 years of writing code and why discernment matters more than the tools

---

## 📜 License

MIT License - see [LICENSE](LICENSE) for details.

---

<div align="center">

**Built with [FastMCP](https://gofastmcp.com) and the [Model Context Protocol](https://modelcontextprotocol.io)**

</div>