Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
Links README to Ryan's AI discernment article, which discusses the documentation rewrite process and connects to the model's perspective in the collaborations archive.
382 lines
11 KiB
Markdown
382 lines
11 KiB
Markdown
<div align="center">
|
|
|
|
# 📊 MCP Office Tools
|
|
|
|
**MCP server for extracting text, tables, images, and data from Microsoft Office files**
|
|
|
|
[](https://www.python.org/downloads/)
|
|
[](https://gofastmcp.com)
|
|
[](https://opensource.org/licenses/MIT)
|
|
[](https://modelcontextprotocol.io)
|
|
|
|
*Word, Excel, PowerPoint, CSV — all the formats your AI agent needs to read but can't*
|
|
|
|
[Installation](#-installation) • [Tools](#-available-tools) • [Examples](#-usage-examples) • [Testing](#-testing)
|
|
|
|
</div>
|
|
|
|
---
|
|
|
|
## ✨ Features
|
|
|
|
- **Universal extraction** — Pull text, images, and metadata from any Office format
|
|
- **Format-specific tools** — Deep analysis for Word (tables, structure), Excel (formulas, charts), PowerPoint
|
|
- **Automatic pagination** — Large documents get chunked so they don't blow up your context window
|
|
- **Fallback processing** — When one library chokes on a weird file, we try another. No silent failures.
|
|
- **URL support** — Pass a URL instead of a file path; we'll download and cache it
|
|
- **Legacy formats** — Yes, even those .doc and .xls files from 2003 still work
|
|
|
|
---
|
|
|
|
## 🚀 Installation
|
|
|
|
```bash
|
|
# Quick install with uvx (recommended)
|
|
uvx mcp-office-tools
|
|
|
|
# Or install with uv/pip
|
|
uv add mcp-office-tools
|
|
pip install mcp-office-tools
|
|
```
|
|
|
|
### Claude Desktop Configuration
|
|
|
|
Add to your `claude_desktop_config.json`:
|
|
|
|
```json
|
|
{
|
|
"mcpServers": {
|
|
"office-tools": {
|
|
"command": "uvx",
|
|
"args": ["mcp-office-tools"]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Claude Code Configuration
|
|
|
|
```bash
|
|
claude mcp add office-tools "uvx mcp-office-tools"
|
|
```
|
|
|
|
---
|
|
|
|
## 🛠 Available Tools
|
|
|
|
### Universal Tools
|
|
*Work with all Office formats: Word, Excel, PowerPoint, CSV*
|
|
|
|
| Tool | Description |
|
|
|------|-------------|
|
|
| `extract_text` | Extract text with optional formatting preservation |
|
|
| `extract_images` | Extract embedded images with size filtering |
|
|
| `extract_metadata` | Get document properties (author, dates, statistics) |
|
|
| `detect_office_format` | Identify format, version, encryption status |
|
|
| `analyze_document_health` | Check integrity, corruption, password protection |
|
|
| `get_supported_formats` | List all supported file extensions |
|
|
|
|
### Word Tools
|
|
|
|
| Tool | Description |
|
|
|------|-------------|
|
|
| `convert_to_markdown` | Convert to Markdown with automatic pagination for large docs |
|
|
| `extract_word_tables` | Extract tables as structured JSON, CSV, or Markdown |
|
|
| `analyze_word_structure` | Analyze headings, sections, styles, and document hierarchy |
|
|
|
|
### Excel Tools
|
|
|
|
| Tool | Description |
|
|
|------|-------------|
|
|
| `analyze_excel_data` | Statistical analysis: data types, missing values, outliers |
|
|
| `extract_excel_formulas` | Extract formulas with values and dependency analysis |
|
|
| `create_excel_chart_data` | Generate Chart.js/Plotly-ready data from spreadsheets |
|
|
|
|
---
|
|
|
|
## 📋 Format Support
|
|
|
|
Here's what works and what's "good enough" — legacy formats from Office 97-2003 have more limited extraction, but they still work:
|
|
|
|
| Format | Extension | Text | Images | Metadata | Tables | Formulas |
|
|
|--------|-----------|:----:|:------:|:--------:|:------:|:--------:|
|
|
| **Word (Modern)** | `.docx` | ✅ | ✅ | ✅ | ✅ | - |
|
|
| **Word (Legacy)** | `.doc` | ✅ | ⚠️ | ⚠️ | ⚠️ | - |
|
|
| **Word Template** | `.dotx` | ✅ | ✅ | ✅ | ✅ | - |
|
|
| **Word Macro** | `.docm` | ✅ | ✅ | ✅ | ✅ | - |
|
|
| **Excel (Modern)** | `.xlsx` | ✅ | ✅ | ✅ | ✅ | ✅ |
|
|
| **Excel (Legacy)** | `.xls` | ✅ | ⚠️ | ⚠️ | ✅ | ⚠️ |
|
|
| **Excel Template** | `.xltx` | ✅ | ✅ | ✅ | ✅ | ✅ |
|
|
| **Excel Macro** | `.xlsm` | ✅ | ✅ | ✅ | ✅ | ✅ |
|
|
| **PowerPoint (Modern)** | `.pptx` | ✅ | ✅ | ✅ | ✅ | - |
|
|
| **PowerPoint (Legacy)** | `.ppt` | ✅ | ⚠️ | ⚠️ | ⚠️ | - |
|
|
| **PowerPoint Template** | `.potx` | ✅ | ✅ | ✅ | ✅ | - |
|
|
| **CSV** | `.csv` | ✅ | - | ⚠️ | ✅ | - |
|
|
|
|
✅ Full support • ⚠️ Basic/partial support • - Not applicable
|
|
|
|
---
|
|
|
|
## 💡 Usage Examples
|
|
|
|
### Extract Text from Any Document
|
|
|
|
```python
|
|
# Simple extraction
|
|
result = await extract_text("report.docx")
|
|
print(result["text"])
|
|
|
|
# With formatting preserved
|
|
result = await extract_text(
|
|
file_path="report.docx",
|
|
preserve_formatting=True,
|
|
include_metadata=True
|
|
)
|
|
```
|
|
|
|
### Convert Word to Markdown (with Pagination)
|
|
|
|
Large documents get paginated automatically. Three ways to handle it:
|
|
|
|
```python
|
|
# Option 1: Follow the cursor for each chunk
|
|
result = await convert_to_markdown("big-manual.docx")
|
|
if result.get("pagination", {}).get("has_more"):
|
|
next_page = await convert_to_markdown(
|
|
"big-manual.docx",
|
|
cursor_id=result["pagination"]["cursor_id"]
|
|
)
|
|
|
|
# Option 2: Grab specific pages
|
|
result = await convert_to_markdown("big-manual.docx", page_range="1-10")
|
|
|
|
# Option 3: Extract by chapter heading
|
|
result = await convert_to_markdown("big-manual.docx", chapter_name="Introduction")
|
|
```
|
|
|
|
### Analyze Excel Data Quality
|
|
|
|
```python
|
|
result = await analyze_excel_data(
|
|
file_path="sales-data.xlsx",
|
|
include_statistics=True,
|
|
check_data_quality=True
|
|
)
|
|
|
|
# Returns per-column analysis
|
|
# {
|
|
# "analysis": {
|
|
# "Sheet1": {
|
|
# "dimensions": {"rows": 1000, "columns": 12},
|
|
# "column_info": {
|
|
# "Revenue": {
|
|
# "data_type": "float64",
|
|
# "null_percentage": 2.3,
|
|
# "statistics": {"mean": 45000, "median": 42000, ...},
|
|
# "quality_issues": ["5 potential outliers"]
|
|
# }
|
|
# },
|
|
# "data_quality": {
|
|
# "completeness_percentage": 97.8,
|
|
# "duplicate_rows": 12
|
|
# }
|
|
# }
|
|
# }
|
|
# }
|
|
```
|
|
|
|
### Extract Excel Formulas
|
|
|
|
```python
|
|
result = await extract_excel_formulas(
|
|
file_path="financial-model.xlsx",
|
|
analyze_dependencies=True
|
|
)
|
|
|
|
# Returns formula details with dependency mapping
|
|
# {
|
|
# "formulas": {
|
|
# "Sheet1": [
|
|
# {
|
|
# "cell": "D2",
|
|
# "formula": "=B2*C2",
|
|
# "value": 1500.00,
|
|
# "dependencies": ["B2", "C2"]
|
|
# }
|
|
# ]
|
|
# }
|
|
# }
|
|
```
|
|
|
|
### Generate Chart Data
|
|
|
|
```python
|
|
result = await create_excel_chart_data(
|
|
file_path="quarterly-revenue.xlsx",
|
|
chart_type="line",
|
|
output_format="chartjs"
|
|
)
|
|
|
|
# Returns ready-to-use Chart.js configuration
|
|
# {
|
|
# "chartjs": {
|
|
# "type": "line",
|
|
# "data": {
|
|
# "labels": ["Q1", "Q2", "Q3", "Q4"],
|
|
# "datasets": [{"label": "Revenue", "data": [100, 120, 115, 140]}]
|
|
# }
|
|
# }
|
|
# }
|
|
```
|
|
|
|
### Extract Word Tables
|
|
|
|
```python
|
|
result = await extract_word_tables(
|
|
file_path="contract.docx",
|
|
output_format="markdown"
|
|
)
|
|
|
|
# Returns tables with optional format conversion
|
|
# {
|
|
# "tables": [
|
|
# {
|
|
# "table_index": 0,
|
|
# "dimensions": {"rows": 5, "columns": 3},
|
|
# "converted_output": "| Name | Role | Department |\n|---|---|---|\n..."
|
|
# }
|
|
# ]
|
|
# }
|
|
```
|
|
|
|
### Process Documents from URLs
|
|
|
|
```python
|
|
# Documents are downloaded and cached automatically
|
|
result = await extract_text("https://example.com/report.docx")
|
|
|
|
# Cache expires after 1 hour by default
|
|
```
|
|
|
|
---
|
|
|
|
## 🧪 Testing
|
|
|
|
We built a visual test dashboard because staring at pytest output gets old. Run `make test` and you get an HTML report with pass/fail stats, detailed I/O for each test, and expandable tracebacks when things break.
|
|
|
|
```bash
|
|
# Run tests and generate the dashboard
|
|
make test
|
|
|
|
# Just pytest, no dashboard
|
|
make test-pytest
|
|
|
|
# Open existing dashboard
|
|
make view-dashboard
|
|
```
|
|
|
|
The dashboard has an MS Office-inspired theme (Word blue, Excel green, PowerPoint orange) and groups tests by category so you can see what's working at a glance.
|
|
|
|
---
|
|
|
|
## 🏗 Architecture
|
|
|
|
The mixin pattern keeps things modular — universal tools work on everything, format-specific tools go deeper. When the primary library can't handle something (corrupted files, weird formatting), we fall back to alternatives.
|
|
|
|
```
|
|
mcp-office-tools/
|
|
├── src/mcp_office_tools/
|
|
│ ├── server.py # FastMCP server entry point
|
|
│ ├── mixins/
|
|
│ │ ├── universal.py # Format-agnostic tools
|
|
│ │ ├── word.py # Word-specific tools
|
|
│ │ ├── excel.py # Excel-specific tools
|
|
│ │ └── powerpoint.py # PowerPoint tools (WIP)
|
|
│ ├── utils/
|
|
│ │ ├── validation.py # File validation
|
|
│ │ ├── file_detection.py # Format detection
|
|
│ │ ├── caching.py # URL caching
|
|
│ │ └── decorators.py # Error handling, defaults
|
|
│ └── pagination.py # Large document pagination
|
|
├── tests/ # pytest test suite
|
|
└── reports/ # Test dashboard output
|
|
```
|
|
|
|
### Processing Libraries
|
|
|
|
| Format | Primary Library | Fallback |
|
|
|--------|----------------|----------|
|
|
| `.docx` | python-docx | mammoth |
|
|
| `.xlsx` | openpyxl | pandas |
|
|
| `.pptx` | python-pptx | - |
|
|
| `.doc`/`.xls`/`.ppt` | olefile | - |
|
|
| `.csv` | pandas | built-in csv |
|
|
|
|
---
|
|
|
|
## 🔧 Development
|
|
|
|
```bash
|
|
# Clone and install
|
|
git clone https://github.com/yourusername/mcp-office-tools.git
|
|
cd mcp-office-tools
|
|
uv sync --dev
|
|
|
|
# Run tests
|
|
uv run pytest
|
|
|
|
# Format and lint
|
|
uv run black src/ tests/
|
|
uv run ruff check src/ tests/
|
|
|
|
# Type check
|
|
uv run mypy src/
|
|
```
|
|
|
|
---
|
|
|
|
## 📦 Dependencies
|
|
|
|
**Core:**
|
|
- `fastmcp` - MCP server framework
|
|
- `python-docx` - Word document processing
|
|
- `openpyxl` - Excel spreadsheet processing
|
|
- `python-pptx` - PowerPoint processing
|
|
- `pandas` - Data analysis and CSV handling
|
|
- `mammoth` - Word to HTML/Markdown conversion
|
|
- `olefile` - Legacy OLE format support
|
|
- `xlrd` - Legacy Excel support
|
|
- `pillow` - Image processing
|
|
- `aiohttp` / `aiofiles` - Async HTTP and file I/O
|
|
|
|
**Optional:**
|
|
- `python-magic` - Enhanced MIME type detection
|
|
- `msoffcrypto-tool` - Encrypted file detection
|
|
|
|
---
|
|
|
|
## 🤝 Related Projects
|
|
|
|
- **[MCP PDF Tools](https://github.com/yourusername/mcp-pdf-tools)** - Companion server for PDF processing
|
|
- **[FastMCP](https://gofastmcp.com)** - The framework powering this server
|
|
|
|
## 📝 Behind the Scenes
|
|
|
|
This README was rewritten during a human-AI collaboration session. The process raised questions about discernment, voice, and what makes documentation actually land:
|
|
|
|
- **[AI Isn't New. Your Discernment Is What Matters.](https://ryanmalloy.com/blog/ai-discernment)** — Ryan's take on 40 years of writing code and why discernment matters more than the tools
|
|
|
|
---
|
|
|
|
## 📜 License
|
|
|
|
MIT License - see [LICENSE](LICENSE) for details.
|
|
|
|
---
|
|
|
|
<div align="center">
|
|
|
|
**Built with [FastMCP](https://gofastmcp.com) and the [Model Context Protocol](https://modelcontextprotocol.io)**
|
|
|
|
</div>
|