- Document MCP resource system in README with URI patterns, format suffixes, range syntax, and section detection strategies - Add index_document to Universal Tools table - Update architecture section to include resources.py - Fix section:// resource to support .md/.txt/.html format suffixes (matching chapter:// behavior)
15 KiB
📊 MCP Office Tools
MCP server for extracting text, tables, images, and data from Microsoft Office files
Word, Excel, PowerPoint, CSV — all the formats your AI agent needs to read but can't
Installation • Tools • Examples • Testing
✨ Features
- Universal extraction — Pull text, images, and metadata from any Office format
- Format-specific tools — Deep analysis for Word (tables, structure), Excel (formulas, charts), PowerPoint
- Automatic pagination — Large documents get chunked so they don't blow up your context window
- Fallback processing — When one library chokes on a weird file, we try another. No silent failures.
- URL support — Pass a URL instead of a file path; we'll download and cache it
- Legacy formats — Yes, even those .doc and .xls files from 2003 still work
🚀 Installation
# Quick install with uvx (recommended)
uvx mcp-office-tools
# Or install with uv/pip
uv add mcp-office-tools
pip install mcp-office-tools
Claude Desktop Configuration
Add to your claude_desktop_config.json:
{
"mcpServers": {
"office-tools": {
"command": "uvx",
"args": ["mcp-office-tools"]
}
}
}
Claude Code Configuration
claude mcp add office-tools "uvx mcp-office-tools"
🛠 Available Tools
Universal Tools
Work with all Office formats: Word, Excel, PowerPoint, CSV
| Tool | Description |
|---|---|
extract_text |
Extract text with optional formatting preservation |
extract_images |
Extract embedded images with size filtering |
extract_metadata |
Get document properties (author, dates, statistics) |
detect_office_format |
Identify format, version, encryption status |
analyze_document_health |
Check integrity, corruption, password protection |
get_supported_formats |
List all supported file extensions |
index_document |
Scan document and create resource URIs for on-demand fetching |
Word Tools
| Tool | Description |
|---|---|
convert_to_markdown |
Convert to Markdown with automatic pagination for large docs |
extract_word_tables |
Extract tables as structured JSON, CSV, or Markdown |
analyze_word_structure |
Analyze headings, sections, styles, and document hierarchy |
get_document_outline |
Get structured outline with chapter detection and word counts |
check_style_consistency |
Find formatting issues, missing chapters, style problems |
search_document |
Search text with context and chapter location |
extract_entities |
Extract people, places, organizations using pattern recognition |
get_chapter_summaries |
Generate chapter previews with opening sentences |
save_reading_progress |
Bookmark your reading position for later |
get_reading_progress |
Resume reading from saved position |
Excel Tools
| Tool | Description |
|---|---|
analyze_excel_data |
Statistical analysis: data types, missing values, outliers |
extract_excel_formulas |
Extract formulas with values and dependency analysis |
create_excel_chart_data |
Generate Chart.js/Plotly-ready data from spreadsheets |
📋 Format Support
Here's what works and what's "good enough" — legacy formats from Office 97-2003 have more limited extraction, but they still work:
| Format | Extension | Text | Images | Metadata | Tables | Formulas |
|---|---|---|---|---|---|---|
| Word (Modern) | .docx |
✅ | ✅ | ✅ | ✅ | - |
| Word (Legacy) | .doc |
✅ | ⚠️ | ⚠️ | ⚠️ | - |
| Word Template | .dotx |
✅ | ✅ | ✅ | ✅ | - |
| Word Macro | .docm |
✅ | ✅ | ✅ | ✅ | - |
| Excel (Modern) | .xlsx |
✅ | ✅ | ✅ | ✅ | ✅ |
| Excel (Legacy) | .xls |
✅ | ⚠️ | ⚠️ | ✅ | ⚠️ |
| Excel Template | .xltx |
✅ | ✅ | ✅ | ✅ | ✅ |
| Excel Macro | .xlsm |
✅ | ✅ | ✅ | ✅ | ✅ |
| PowerPoint (Modern) | .pptx |
✅ | ✅ | ✅ | ✅ | - |
| PowerPoint (Legacy) | .ppt |
✅ | ⚠️ | ⚠️ | ⚠️ | - |
| PowerPoint Template | .potx |
✅ | ✅ | ✅ | ✅ | - |
| CSV | .csv |
✅ | - | ⚠️ | ✅ | - |
✅ Full support • ⚠️ Basic/partial support • - Not applicable
🔗 MCP Resources
Instead of returning entire documents in tool responses, you can index a document once and fetch content on-demand via URI-based resources. This keeps context windows manageable when working with large files.
How It Works
- Index the document —
index_documentscans the file and returns URIs - Fetch what you need — Request specific chapters, sheets, slides, or images by URI
- Format on demand — Append
.txtor.htmlto get different output formats
Resource URI Patterns
| URI Pattern | Description | Example |
|---|---|---|
chapter://{doc_id}/{n} |
Single chapter/section | chapter://abc123/3 |
chapters://{doc_id}/{range} |
Multiple chapters | chapters://abc123/1-5 |
section://{doc_id}/{n} |
Section by heading style | section://abc123/2 |
paragraph://{doc_id}/{ch}/{p} |
Specific paragraph | paragraph://abc123/3/7 |
sheet://{doc_id}/{name} |
Excel sheet as markdown table | sheet://abc123/Revenue |
slide://{doc_id}/{n} |
PowerPoint slide | slide://abc123/5 |
slides://{doc_id}/{range} |
Multiple slides | slides://abc123/1,3,5 |
image://{doc_id}/{n} |
Embedded image | image://abc123/0 |
Format Suffixes
Append a format suffix to convert on the fly:
| Suffix | Output |
|---|---|
.md (default) |
Markdown |
.txt |
Plain text (no formatting) |
.html |
Basic HTML |
Examples:
chapter://abc123/3→ Markdown (default)chapter://abc123/3.txt→ Plain textchapter://abc123/3.html→ HTML
Range Syntax
Fetch multiple items at once:
1-5→ Items 1 through 51,3,5→ Specific items1-3,7,9-10→ Mixed ranges
Section Detection
The indexer detects document structure automatically:
- Heading 1 styles (primary) — Business docs, manuals, technical documents
- "Chapter X" text patterns (fallback) — Books, manuscripts, narratives
Use text_patterns_only=True to skip heading style detection for documents with messy formatting.
🎯 MCP Prompts
Pre-built workflows that chain multiple tools together. Use these as starting points:
| Prompt | Level | Description |
|---|---|---|
explore-document |
Basic | Start with any new document - get structure and identify issues |
find-character |
Basic | Track all mentions of a person/character with context |
chapter-preview |
Basic | Quick overview of each chapter without full read |
resume-reading |
Intermediate | Check saved position and continue reading |
document-analysis |
Intermediate | Comprehensive multi-tool analysis |
character-journey |
Advanced | Track character arc through entire narrative |
document-comparison |
Advanced | Compare entities and themes between chapters |
full-reading-session |
Advanced | Guided reading with bookmarking |
manuscript-review |
Advanced | Complete editorial workflow for editors |
💡 Usage Examples
Extract Text from Any Document
# Simple extraction
result = await extract_text("report.docx")
print(result["text"])
# With formatting preserved
result = await extract_text(
file_path="report.docx",
preserve_formatting=True,
include_metadata=True
)
Convert Word to Markdown (with Pagination)
Large documents get paginated automatically. Three ways to handle it:
# Option 1: Follow the cursor for each chunk
result = await convert_to_markdown("big-manual.docx")
if result.get("pagination", {}).get("has_more"):
next_page = await convert_to_markdown(
"big-manual.docx",
cursor_id=result["pagination"]["cursor_id"]
)
# Option 2: Grab specific pages
result = await convert_to_markdown("big-manual.docx", page_range="1-10")
# Option 3: Extract by chapter heading
result = await convert_to_markdown("big-manual.docx", chapter_name="Introduction")
Analyze Excel Data Quality
result = await analyze_excel_data(
file_path="sales-data.xlsx",
include_statistics=True,
check_data_quality=True
)
# Returns per-column analysis
# {
# "analysis": {
# "Sheet1": {
# "dimensions": {"rows": 1000, "columns": 12},
# "column_info": {
# "Revenue": {
# "data_type": "float64",
# "null_percentage": 2.3,
# "statistics": {"mean": 45000, "median": 42000, ...},
# "quality_issues": ["5 potential outliers"]
# }
# },
# "data_quality": {
# "completeness_percentage": 97.8,
# "duplicate_rows": 12
# }
# }
# }
# }
Extract Excel Formulas
result = await extract_excel_formulas(
file_path="financial-model.xlsx",
analyze_dependencies=True
)
# Returns formula details with dependency mapping
# {
# "formulas": {
# "Sheet1": [
# {
# "cell": "D2",
# "formula": "=B2*C2",
# "value": 1500.00,
# "dependencies": ["B2", "C2"]
# }
# ]
# }
# }
Generate Chart Data
result = await create_excel_chart_data(
file_path="quarterly-revenue.xlsx",
chart_type="line",
output_format="chartjs"
)
# Returns ready-to-use Chart.js configuration
# {
# "chartjs": {
# "type": "line",
# "data": {
# "labels": ["Q1", "Q2", "Q3", "Q4"],
# "datasets": [{"label": "Revenue", "data": [100, 120, 115, 140]}]
# }
# }
# }
Extract Word Tables
result = await extract_word_tables(
file_path="contract.docx",
output_format="markdown"
)
# Returns tables with optional format conversion
# {
# "tables": [
# {
# "table_index": 0,
# "dimensions": {"rows": 5, "columns": 3},
# "converted_output": "| Name | Role | Department |\n|---|---|---|\n..."
# }
# ]
# }
Process Documents from URLs
# Documents are downloaded and cached automatically
result = await extract_text("https://example.com/report.docx")
# Cache expires after 1 hour by default
Index Document for On-Demand Resource Fetching
# Index the document - returns URIs for all content
result = await index_document("novel.docx")
# Returns:
# {
# "doc_id": "56036b0f171a",
# "resources": {
# "chapter": [
# {"id": "1", "title": "Chapter 1: The Beginning", "uri": "chapter://56036b0f171a/1"},
# {"id": "2", "title": "Chapter 2: Rising Action", "uri": "chapter://56036b0f171a/2"},
# ...
# ],
# "image": [
# {"id": "0", "uri": "image://56036b0f171a/0"},
# ...
# ]
# }
# }
# Now fetch specific content via MCP resources:
# - chapter://56036b0f171a/1 → Chapter 1 as markdown
# - chapter://56036b0f171a/1.txt → Chapter 1 as plain text
# - chapters://56036b0f171a/1-3 → Chapters 1-3 combined
# - image://56036b0f171a/0 → First embedded image
# Works with Excel and PowerPoint too:
await index_document("data.xlsx")
# → sheet://abc123/Revenue, sheet://abc123/Expenses, ...
await index_document("presentation.pptx")
# → slide://def456/1, slide://def456/2, ...
🧪 Testing
We built a visual test dashboard because staring at pytest output gets old. Run make test and you get an HTML report with pass/fail stats, detailed I/O for each test, and expandable tracebacks when things break.
# Run tests and generate the dashboard
make test
# Just pytest, no dashboard
make test-pytest
# Open existing dashboard
make view-dashboard
The dashboard has an MS Office-inspired theme (Word blue, Excel green, PowerPoint orange) and groups tests by category so you can see what's working at a glance.
🏗 Architecture
The mixin pattern keeps things modular — universal tools work on everything, format-specific tools go deeper. When the primary library can't handle something (corrupted files, weird formatting), we fall back to alternatives.
mcp-office-tools/
├── src/mcp_office_tools/
│ ├── server.py # FastMCP server + resource templates
│ ├── resources.py # Resource store for on-demand content
│ ├── mixins/
│ │ ├── universal.py # Format-agnostic tools (incl. index_document)
│ │ ├── word.py # Word-specific tools
│ │ ├── excel.py # Excel-specific tools
│ │ └── powerpoint.py # PowerPoint tools (WIP)
│ ├── utils/
│ │ ├── validation.py # File validation
│ │ ├── file_detection.py # Format detection
│ │ ├── caching.py # URL caching
│ │ └── decorators.py # Error handling, defaults
│ └── pagination.py # Large document pagination
├── tests/ # pytest test suite
└── reports/ # Test dashboard output
Processing Libraries
| Format | Primary Library | Fallback |
|---|---|---|
.docx |
python-docx | mammoth |
.xlsx |
openpyxl | pandas |
.pptx |
python-pptx | - |
.doc/.xls/.ppt |
olefile | - |
.csv |
pandas | built-in csv |
🔧 Development
# Clone and install
git clone https://github.com/yourusername/mcp-office-tools.git
cd mcp-office-tools
uv sync --dev
# Run tests
uv run pytest
# Format and lint
uv run black src/ tests/
uv run ruff check src/ tests/
# Type check
uv run mypy src/
📦 Dependencies
Core:
fastmcp- MCP server frameworkpython-docx- Word document processingopenpyxl- Excel spreadsheet processingpython-pptx- PowerPoint processingpandas- Data analysis and CSV handlingmammoth- Word to HTML/Markdown conversionolefile- Legacy OLE format supportxlrd- Legacy Excel supportpillow- Image processingaiohttp/aiofiles- Async HTTP and file I/O
Optional:
python-magic- Enhanced MIME type detectionmsoffcrypto-tool- Encrypted file detection
🤝 Related Projects
- MCP PDF Tools - Companion server for PDF processing
- FastMCP - The framework powering this server
📝 Behind the Scenes
This README was rewritten during a human-AI collaboration session. The process raised questions about discernment, voice, and what makes documentation actually land:
- AI Isn't New. Your Discernment Is What Matters. — Ryan's take on 40 years of writing code and why discernment matters more than the tools
📜 License
MIT License - see LICENSE for details.
Built with FastMCP and the Model Context Protocol