# ๐ MCP Office Tools
**MCP server for extracting text, tables, images, and data from Microsoft Office files**
[](https://www.python.org/downloads/)
[](https://gofastmcp.com)
[](https://opensource.org/licenses/MIT)
[](https://modelcontextprotocol.io)
*Word, Excel, PowerPoint, CSV โ all the formats your AI agent needs to read but can't*
[Installation](#-installation) โข [Tools](#-available-tools) โข [Examples](#-usage-examples) โข [Testing](#-testing)
---
## โจ Features
- **Universal extraction** โ Pull text, images, and metadata from any Office format
- **Format-specific tools** โ Deep analysis for Word (tables, structure), Excel (formulas, charts), PowerPoint
- **Automatic pagination** โ Large documents get chunked so they don't blow up your context window
- **Fallback processing** โ When one library chokes on a weird file, we try another. No silent failures.
- **URL support** โ Pass a URL instead of a file path; we'll download and cache it
- **Legacy formats** โ Yes, even those .doc and .xls files from 2003 still work
---
## ๐ Installation
```bash
# Quick install with uvx (recommended)
uvx mcp-office-tools
# Or install with uv/pip
uv add mcp-office-tools
pip install mcp-office-tools
```
### Claude Desktop Configuration
Add to your `claude_desktop_config.json`:
```json
{
"mcpServers": {
"office-tools": {
"command": "uvx",
"args": ["mcp-office-tools"]
}
}
}
```
### Claude Code Configuration
```bash
claude mcp add office-tools "uvx mcp-office-tools"
```
---
## ๐ Available Tools
### Universal Tools
*Work with all Office formats: Word, Excel, PowerPoint, CSV*
| Tool | Description |
|------|-------------|
| `extract_text` | Extract text with optional formatting preservation |
| `extract_images` | Extract embedded images with size filtering |
| `extract_metadata` | Get document properties (author, dates, statistics) |
| `detect_office_format` | Identify format, version, encryption status |
| `analyze_document_health` | Check integrity, corruption, password protection |
| `get_supported_formats` | List all supported file extensions |
| `index_document` | Scan document and create resource URIs for on-demand fetching |
### Word Tools
| Tool | Description |
|------|-------------|
| `convert_to_markdown` | Convert to Markdown with automatic pagination for large docs |
| `extract_word_tables` | Extract tables as structured JSON, CSV, or Markdown |
| `analyze_word_structure` | Analyze headings, sections, styles, and document hierarchy |
| `get_document_outline` | Get structured outline with chapter detection and word counts |
| `check_style_consistency` | Find formatting issues, missing chapters, style problems |
| `search_document` | Search text with context and chapter location |
| `extract_entities` | Extract people, places, organizations using pattern recognition |
| `get_chapter_summaries` | Generate chapter previews with opening sentences |
| `save_reading_progress` | Bookmark your reading position for later |
| `get_reading_progress` | Resume reading from saved position |
### Excel Tools
| Tool | Description |
|------|-------------|
| `analyze_excel_data` | Statistical analysis: data types, missing values, outliers |
| `extract_excel_formulas` | Extract formulas with values and dependency analysis |
| `create_excel_chart_data` | Generate Chart.js/Plotly-ready data from spreadsheets |
---
## ๐ Format Support
Here's what works and what's "good enough" โ legacy formats from Office 97-2003 have more limited extraction, but they still work:
| Format | Extension | Text | Images | Metadata | Tables | Formulas |
|--------|-----------|:----:|:------:|:--------:|:------:|:--------:|
| **Word (Modern)** | `.docx` | โ
| โ
| โ
| โ
| - |
| **Word (Legacy)** | `.doc` | โ
| โ ๏ธ | โ ๏ธ | โ ๏ธ | - |
| **Word Template** | `.dotx` | โ
| โ
| โ
| โ
| - |
| **Word Macro** | `.docm` | โ
| โ
| โ
| โ
| - |
| **Excel (Modern)** | `.xlsx` | โ
| โ
| โ
| โ
| โ
|
| **Excel (Legacy)** | `.xls` | โ
| โ ๏ธ | โ ๏ธ | โ
| โ ๏ธ |
| **Excel Template** | `.xltx` | โ
| โ
| โ
| โ
| โ
|
| **Excel Macro** | `.xlsm` | โ
| โ
| โ
| โ
| โ
|
| **PowerPoint (Modern)** | `.pptx` | โ
| โ
| โ
| โ
| - |
| **PowerPoint (Legacy)** | `.ppt` | โ
| โ ๏ธ | โ ๏ธ | โ ๏ธ | - |
| **PowerPoint Template** | `.potx` | โ
| โ
| โ
| โ
| - |
| **CSV** | `.csv` | โ
| - | โ ๏ธ | โ
| - |
โ
Full support โข โ ๏ธ Basic/partial support โข - Not applicable
---
## ๐ MCP Resources
Instead of returning entire documents in tool responses, you can index a document once and fetch content on-demand via URI-based resources. This keeps context windows manageable when working with large files.
### How It Works
1. **Index the document** โ `index_document` scans the file and returns URIs
2. **Fetch what you need** โ Request specific chapters, sheets, slides, or images by URI
3. **Format on demand** โ Append `.txt` or `.html` to get different output formats
### Resource URI Patterns
| URI Pattern | Description | Example |
|-------------|-------------|---------|
| `chapter://{doc_id}/{n}` | Single chapter/section | `chapter://abc123/3` |
| `chapters://{doc_id}/{range}` | Multiple chapters | `chapters://abc123/1-5` |
| `section://{doc_id}/{n}` | Section by heading style | `section://abc123/2` |
| `paragraph://{doc_id}/{ch}/{p}` | Specific paragraph | `paragraph://abc123/3/7` |
| `sheet://{doc_id}/{name}` | Excel sheet as markdown table | `sheet://abc123/Revenue` |
| `slide://{doc_id}/{n}` | PowerPoint slide | `slide://abc123/5` |
| `slides://{doc_id}/{range}` | Multiple slides | `slides://abc123/1,3,5` |
| `image://{doc_id}/{n}` | Embedded image | `image://abc123/0` |
### Format Suffixes
Append a format suffix to convert on the fly:
| Suffix | Output |
|--------|--------|
| `.md` (default) | Markdown |
| `.txt` | Plain text (no formatting) |
| `.html` | Basic HTML |
Examples:
- `chapter://abc123/3` โ Markdown (default)
- `chapter://abc123/3.txt` โ Plain text
- `chapter://abc123/3.html` โ HTML
### Range Syntax
Fetch multiple items at once:
- `1-5` โ Items 1 through 5
- `1,3,5` โ Specific items
- `1-3,7,9-10` โ Mixed ranges
### Section Detection
The indexer detects document structure automatically:
1. **Heading 1 styles** (primary) โ Business docs, manuals, technical documents
2. **"Chapter X" text patterns** (fallback) โ Books, manuscripts, narratives
Use `text_patterns_only=True` to skip heading style detection for documents with messy formatting.
---
## ๐ฏ MCP Prompts
Pre-built workflows that chain multiple tools together. Use these as starting points:
| Prompt | Level | Description |
|--------|-------|-------------|
| `explore-document` | Basic | Start with any new document - get structure and identify issues |
| `find-character` | Basic | Track all mentions of a person/character with context |
| `chapter-preview` | Basic | Quick overview of each chapter without full read |
| `resume-reading` | Intermediate | Check saved position and continue reading |
| `document-analysis` | Intermediate | Comprehensive multi-tool analysis |
| `character-journey` | Advanced | Track character arc through entire narrative |
| `document-comparison` | Advanced | Compare entities and themes between chapters |
| `full-reading-session` | Advanced | Guided reading with bookmarking |
| `manuscript-review` | Advanced | Complete editorial workflow for editors |
---
## ๐ก Usage Examples
### Extract Text from Any Document
```python
# Simple extraction
result = await extract_text("report.docx")
print(result["text"])
# With formatting preserved
result = await extract_text(
file_path="report.docx",
preserve_formatting=True,
include_metadata=True
)
```
### Convert Word to Markdown (with Pagination)
Large documents get paginated automatically. Three ways to handle it:
```python
# Option 1: Follow the cursor for each chunk
result = await convert_to_markdown("big-manual.docx")
if result.get("pagination", {}).get("has_more"):
next_page = await convert_to_markdown(
"big-manual.docx",
cursor_id=result["pagination"]["cursor_id"]
)
# Option 2: Grab specific pages
result = await convert_to_markdown("big-manual.docx", page_range="1-10")
# Option 3: Extract by chapter heading
result = await convert_to_markdown("big-manual.docx", chapter_name="Introduction")
```
### Analyze Excel Data Quality
```python
result = await analyze_excel_data(
file_path="sales-data.xlsx",
include_statistics=True,
check_data_quality=True
)
# Returns per-column analysis
# {
# "analysis": {
# "Sheet1": {
# "dimensions": {"rows": 1000, "columns": 12},
# "column_info": {
# "Revenue": {
# "data_type": "float64",
# "null_percentage": 2.3,
# "statistics": {"mean": 45000, "median": 42000, ...},
# "quality_issues": ["5 potential outliers"]
# }
# },
# "data_quality": {
# "completeness_percentage": 97.8,
# "duplicate_rows": 12
# }
# }
# }
# }
```
### Extract Excel Formulas
```python
result = await extract_excel_formulas(
file_path="financial-model.xlsx",
analyze_dependencies=True
)
# Returns formula details with dependency mapping
# {
# "formulas": {
# "Sheet1": [
# {
# "cell": "D2",
# "formula": "=B2*C2",
# "value": 1500.00,
# "dependencies": ["B2", "C2"]
# }
# ]
# }
# }
```
### Generate Chart Data
```python
result = await create_excel_chart_data(
file_path="quarterly-revenue.xlsx",
chart_type="line",
output_format="chartjs"
)
# Returns ready-to-use Chart.js configuration
# {
# "chartjs": {
# "type": "line",
# "data": {
# "labels": ["Q1", "Q2", "Q3", "Q4"],
# "datasets": [{"label": "Revenue", "data": [100, 120, 115, 140]}]
# }
# }
# }
```
### Extract Word Tables
```python
result = await extract_word_tables(
file_path="contract.docx",
output_format="markdown"
)
# Returns tables with optional format conversion
# {
# "tables": [
# {
# "table_index": 0,
# "dimensions": {"rows": 5, "columns": 3},
# "converted_output": "| Name | Role | Department |\n|---|---|---|\n..."
# }
# ]
# }
```
### Process Documents from URLs
```python
# Documents are downloaded and cached automatically
result = await extract_text("https://example.com/report.docx")
# Cache expires after 1 hour by default
```
### Index Document for On-Demand Resource Fetching
```python
# Index the document - returns URIs for all content
result = await index_document("novel.docx")
# Returns:
# {
# "doc_id": "56036b0f171a",
# "resources": {
# "chapter": [
# {"id": "1", "title": "Chapter 1: The Beginning", "uri": "chapter://56036b0f171a/1"},
# {"id": "2", "title": "Chapter 2: Rising Action", "uri": "chapter://56036b0f171a/2"},
# ...
# ],
# "image": [
# {"id": "0", "uri": "image://56036b0f171a/0"},
# ...
# ]
# }
# }
# Now fetch specific content via MCP resources:
# - chapter://56036b0f171a/1 โ Chapter 1 as markdown
# - chapter://56036b0f171a/1.txt โ Chapter 1 as plain text
# - chapters://56036b0f171a/1-3 โ Chapters 1-3 combined
# - image://56036b0f171a/0 โ First embedded image
# Works with Excel and PowerPoint too:
await index_document("data.xlsx")
# โ sheet://abc123/Revenue, sheet://abc123/Expenses, ...
await index_document("presentation.pptx")
# โ slide://def456/1, slide://def456/2, ...
```
---
## ๐งช Testing
We built a visual test dashboard because staring at pytest output gets old. Run `make test` and you get an HTML report with pass/fail stats, detailed I/O for each test, and expandable tracebacks when things break.
```bash
# Run tests and generate the dashboard
make test
# Just pytest, no dashboard
make test-pytest
# Open existing dashboard
make view-dashboard
```
The dashboard has an MS Office-inspired theme (Word blue, Excel green, PowerPoint orange) and groups tests by category so you can see what's working at a glance.
---
## ๐ Architecture
The mixin pattern keeps things modular โ universal tools work on everything, format-specific tools go deeper. When the primary library can't handle something (corrupted files, weird formatting), we fall back to alternatives.
```
mcp-office-tools/
โโโ src/mcp_office_tools/
โ โโโ server.py # FastMCP server + resource templates
โ โโโ resources.py # Resource store for on-demand content
โ โโโ mixins/
โ โ โโโ universal.py # Format-agnostic tools (incl. index_document)
โ โ โโโ word.py # Word-specific tools
โ โ โโโ excel.py # Excel-specific tools
โ โ โโโ powerpoint.py # PowerPoint tools (WIP)
โ โโโ utils/
โ โ โโโ validation.py # File validation
โ โ โโโ file_detection.py # Format detection
โ โ โโโ caching.py # URL caching
โ โ โโโ decorators.py # Error handling, defaults
โ โโโ pagination.py # Large document pagination
โโโ tests/ # pytest test suite
โโโ reports/ # Test dashboard output
```
### Processing Libraries
| Format | Primary Library | Fallback |
|--------|----------------|----------|
| `.docx` | python-docx | mammoth |
| `.xlsx` | openpyxl | pandas |
| `.pptx` | python-pptx | - |
| `.doc`/`.xls`/`.ppt` | olefile | - |
| `.csv` | pandas | built-in csv |
---
## ๐ง Development
```bash
# Clone and install
git clone https://github.com/yourusername/mcp-office-tools.git
cd mcp-office-tools
uv sync --dev
# Run tests
uv run pytest
# Format and lint
uv run black src/ tests/
uv run ruff check src/ tests/
# Type check
uv run mypy src/
```
---
## ๐ฆ Dependencies
**Core:**
- `fastmcp` - MCP server framework
- `python-docx` - Word document processing
- `openpyxl` - Excel spreadsheet processing
- `python-pptx` - PowerPoint processing
- `pandas` - Data analysis and CSV handling
- `mammoth` - Word to HTML/Markdown conversion
- `olefile` - Legacy OLE format support
- `xlrd` - Legacy Excel support
- `pillow` - Image processing
- `aiohttp` / `aiofiles` - Async HTTP and file I/O
**Optional:**
- `python-magic` - Enhanced MIME type detection
- `msoffcrypto-tool` - Encrypted file detection
---
## ๐ค Related Projects
- **[MCP PDF Tools](https://github.com/yourusername/mcp-pdf-tools)** - Companion server for PDF processing
- **[FastMCP](https://gofastmcp.com)** - The framework powering this server
## ๐ Behind the Scenes
This README was rewritten during a human-AI collaboration session. The process raised questions about discernment, voice, and what makes documentation actually land:
- **[AI Isn't New. Your Discernment Is What Matters.](https://ryanmalloy.com/blog/ai-discernment)** โ Ryan's take on 40 years of writing code and why discernment matters more than the tools
---
## ๐ License
MIT License - see [LICENSE](LICENSE) for details.
---
**Built with [FastMCP](https://gofastmcp.com) and the [Model Context Protocol](https://modelcontextprotocol.io)**