Ryan Malloy 6fb76d8760
Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
Add MCP resources documentation and fix section format suffix
- Document MCP resource system in README with URI patterns, format
  suffixes, range syntax, and section detection strategies
- Add index_document to Universal Tools table
- Update architecture section to include resources.py
- Fix section:// resource to support .md/.txt/.html format suffixes
  (matching chapter:// behavior)
2026-01-11 10:23:47 -07:00

📊 MCP Office Tools

MCP server for extracting text, tables, images, and data from Microsoft Office files

Python 3.11+ FastMCP License: MIT MCP Protocol

Word, Excel, PowerPoint, CSV — all the formats your AI agent needs to read but can't

InstallationToolsExamplesTesting


Features

  • Universal extraction — Pull text, images, and metadata from any Office format
  • Format-specific tools — Deep analysis for Word (tables, structure), Excel (formulas, charts), PowerPoint
  • Automatic pagination — Large documents get chunked so they don't blow up your context window
  • Fallback processing — When one library chokes on a weird file, we try another. No silent failures.
  • URL support — Pass a URL instead of a file path; we'll download and cache it
  • Legacy formats — Yes, even those .doc and .xls files from 2003 still work

🚀 Installation

# Quick install with uvx (recommended)
uvx mcp-office-tools

# Or install with uv/pip
uv add mcp-office-tools
pip install mcp-office-tools

Claude Desktop Configuration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "office-tools": {
      "command": "uvx",
      "args": ["mcp-office-tools"]
    }
  }
}

Claude Code Configuration

claude mcp add office-tools "uvx mcp-office-tools"

🛠 Available Tools

Universal Tools

Work with all Office formats: Word, Excel, PowerPoint, CSV

Tool Description
extract_text Extract text with optional formatting preservation
extract_images Extract embedded images with size filtering
extract_metadata Get document properties (author, dates, statistics)
detect_office_format Identify format, version, encryption status
analyze_document_health Check integrity, corruption, password protection
get_supported_formats List all supported file extensions
index_document Scan document and create resource URIs for on-demand fetching

Word Tools

Tool Description
convert_to_markdown Convert to Markdown with automatic pagination for large docs
extract_word_tables Extract tables as structured JSON, CSV, or Markdown
analyze_word_structure Analyze headings, sections, styles, and document hierarchy
get_document_outline Get structured outline with chapter detection and word counts
check_style_consistency Find formatting issues, missing chapters, style problems
search_document Search text with context and chapter location
extract_entities Extract people, places, organizations using pattern recognition
get_chapter_summaries Generate chapter previews with opening sentences
save_reading_progress Bookmark your reading position for later
get_reading_progress Resume reading from saved position

Excel Tools

Tool Description
analyze_excel_data Statistical analysis: data types, missing values, outliers
extract_excel_formulas Extract formulas with values and dependency analysis
create_excel_chart_data Generate Chart.js/Plotly-ready data from spreadsheets

📋 Format Support

Here's what works and what's "good enough" — legacy formats from Office 97-2003 have more limited extraction, but they still work:

Format Extension Text Images Metadata Tables Formulas
Word (Modern) .docx -
Word (Legacy) .doc ⚠️ ⚠️ ⚠️ -
Word Template .dotx -
Word Macro .docm -
Excel (Modern) .xlsx
Excel (Legacy) .xls ⚠️ ⚠️ ⚠️
Excel Template .xltx
Excel Macro .xlsm
PowerPoint (Modern) .pptx -
PowerPoint (Legacy) .ppt ⚠️ ⚠️ ⚠️ -
PowerPoint Template .potx -
CSV .csv - ⚠️ -

Full support • ⚠️ Basic/partial support • - Not applicable


🔗 MCP Resources

Instead of returning entire documents in tool responses, you can index a document once and fetch content on-demand via URI-based resources. This keeps context windows manageable when working with large files.

How It Works

  1. Index the documentindex_document scans the file and returns URIs
  2. Fetch what you need — Request specific chapters, sheets, slides, or images by URI
  3. Format on demand — Append .txt or .html to get different output formats

Resource URI Patterns

URI Pattern Description Example
chapter://{doc_id}/{n} Single chapter/section chapter://abc123/3
chapters://{doc_id}/{range} Multiple chapters chapters://abc123/1-5
section://{doc_id}/{n} Section by heading style section://abc123/2
paragraph://{doc_id}/{ch}/{p} Specific paragraph paragraph://abc123/3/7
sheet://{doc_id}/{name} Excel sheet as markdown table sheet://abc123/Revenue
slide://{doc_id}/{n} PowerPoint slide slide://abc123/5
slides://{doc_id}/{range} Multiple slides slides://abc123/1,3,5
image://{doc_id}/{n} Embedded image image://abc123/0

Format Suffixes

Append a format suffix to convert on the fly:

Suffix Output
.md (default) Markdown
.txt Plain text (no formatting)
.html Basic HTML

Examples:

  • chapter://abc123/3 → Markdown (default)
  • chapter://abc123/3.txt → Plain text
  • chapter://abc123/3.html → HTML

Range Syntax

Fetch multiple items at once:

  • 1-5 → Items 1 through 5
  • 1,3,5 → Specific items
  • 1-3,7,9-10 → Mixed ranges

Section Detection

The indexer detects document structure automatically:

  1. Heading 1 styles (primary) — Business docs, manuals, technical documents
  2. "Chapter X" text patterns (fallback) — Books, manuscripts, narratives

Use text_patterns_only=True to skip heading style detection for documents with messy formatting.


🎯 MCP Prompts

Pre-built workflows that chain multiple tools together. Use these as starting points:

Prompt Level Description
explore-document Basic Start with any new document - get structure and identify issues
find-character Basic Track all mentions of a person/character with context
chapter-preview Basic Quick overview of each chapter without full read
resume-reading Intermediate Check saved position and continue reading
document-analysis Intermediate Comprehensive multi-tool analysis
character-journey Advanced Track character arc through entire narrative
document-comparison Advanced Compare entities and themes between chapters
full-reading-session Advanced Guided reading with bookmarking
manuscript-review Advanced Complete editorial workflow for editors

💡 Usage Examples

Extract Text from Any Document

# Simple extraction
result = await extract_text("report.docx")
print(result["text"])

# With formatting preserved
result = await extract_text(
    file_path="report.docx",
    preserve_formatting=True,
    include_metadata=True
)

Convert Word to Markdown (with Pagination)

Large documents get paginated automatically. Three ways to handle it:

# Option 1: Follow the cursor for each chunk
result = await convert_to_markdown("big-manual.docx")
if result.get("pagination", {}).get("has_more"):
    next_page = await convert_to_markdown(
        "big-manual.docx",
        cursor_id=result["pagination"]["cursor_id"]
    )

# Option 2: Grab specific pages
result = await convert_to_markdown("big-manual.docx", page_range="1-10")

# Option 3: Extract by chapter heading
result = await convert_to_markdown("big-manual.docx", chapter_name="Introduction")

Analyze Excel Data Quality

result = await analyze_excel_data(
    file_path="sales-data.xlsx",
    include_statistics=True,
    check_data_quality=True
)

# Returns per-column analysis
# {
#   "analysis": {
#     "Sheet1": {
#       "dimensions": {"rows": 1000, "columns": 12},
#       "column_info": {
#         "Revenue": {
#           "data_type": "float64",
#           "null_percentage": 2.3,
#           "statistics": {"mean": 45000, "median": 42000, ...},
#           "quality_issues": ["5 potential outliers"]
#         }
#       },
#       "data_quality": {
#         "completeness_percentage": 97.8,
#         "duplicate_rows": 12
#       }
#     }
#   }
# }

Extract Excel Formulas

result = await extract_excel_formulas(
    file_path="financial-model.xlsx",
    analyze_dependencies=True
)

# Returns formula details with dependency mapping
# {
#   "formulas": {
#     "Sheet1": [
#       {
#         "cell": "D2",
#         "formula": "=B2*C2",
#         "value": 1500.00,
#         "dependencies": ["B2", "C2"]
#       }
#     ]
#   }
# }

Generate Chart Data

result = await create_excel_chart_data(
    file_path="quarterly-revenue.xlsx",
    chart_type="line",
    output_format="chartjs"
)

# Returns ready-to-use Chart.js configuration
# {
#   "chartjs": {
#     "type": "line",
#     "data": {
#       "labels": ["Q1", "Q2", "Q3", "Q4"],
#       "datasets": [{"label": "Revenue", "data": [100, 120, 115, 140]}]
#     }
#   }
# }

Extract Word Tables

result = await extract_word_tables(
    file_path="contract.docx",
    output_format="markdown"
)

# Returns tables with optional format conversion
# {
#   "tables": [
#     {
#       "table_index": 0,
#       "dimensions": {"rows": 5, "columns": 3},
#       "converted_output": "| Name | Role | Department |\n|---|---|---|\n..."
#     }
#   ]
# }

Process Documents from URLs

# Documents are downloaded and cached automatically
result = await extract_text("https://example.com/report.docx")

# Cache expires after 1 hour by default

Index Document for On-Demand Resource Fetching

# Index the document - returns URIs for all content
result = await index_document("novel.docx")

# Returns:
# {
#   "doc_id": "56036b0f171a",
#   "resources": {
#     "chapter": [
#       {"id": "1", "title": "Chapter 1: The Beginning", "uri": "chapter://56036b0f171a/1"},
#       {"id": "2", "title": "Chapter 2: Rising Action", "uri": "chapter://56036b0f171a/2"},
#       ...
#     ],
#     "image": [
#       {"id": "0", "uri": "image://56036b0f171a/0"},
#       ...
#     ]
#   }
# }

# Now fetch specific content via MCP resources:
# - chapter://56036b0f171a/1      → Chapter 1 as markdown
# - chapter://56036b0f171a/1.txt  → Chapter 1 as plain text
# - chapters://56036b0f171a/1-3   → Chapters 1-3 combined
# - image://56036b0f171a/0        → First embedded image

# Works with Excel and PowerPoint too:
await index_document("data.xlsx")
# → sheet://abc123/Revenue, sheet://abc123/Expenses, ...

await index_document("presentation.pptx")
# → slide://def456/1, slide://def456/2, ...

🧪 Testing

We built a visual test dashboard because staring at pytest output gets old. Run make test and you get an HTML report with pass/fail stats, detailed I/O for each test, and expandable tracebacks when things break.

# Run tests and generate the dashboard
make test

# Just pytest, no dashboard
make test-pytest

# Open existing dashboard
make view-dashboard

The dashboard has an MS Office-inspired theme (Word blue, Excel green, PowerPoint orange) and groups tests by category so you can see what's working at a glance.


🏗 Architecture

The mixin pattern keeps things modular — universal tools work on everything, format-specific tools go deeper. When the primary library can't handle something (corrupted files, weird formatting), we fall back to alternatives.

mcp-office-tools/
├── src/mcp_office_tools/
│   ├── server.py              # FastMCP server + resource templates
│   ├── resources.py           # Resource store for on-demand content
│   ├── mixins/
│   │   ├── universal.py       # Format-agnostic tools (incl. index_document)
│   │   ├── word.py            # Word-specific tools
│   │   ├── excel.py           # Excel-specific tools
│   │   └── powerpoint.py      # PowerPoint tools (WIP)
│   ├── utils/
│   │   ├── validation.py      # File validation
│   │   ├── file_detection.py  # Format detection
│   │   ├── caching.py         # URL caching
│   │   └── decorators.py      # Error handling, defaults
│   └── pagination.py          # Large document pagination
├── tests/                     # pytest test suite
└── reports/                   # Test dashboard output

Processing Libraries

Format Primary Library Fallback
.docx python-docx mammoth
.xlsx openpyxl pandas
.pptx python-pptx -
.doc/.xls/.ppt olefile -
.csv pandas built-in csv

🔧 Development

# Clone and install
git clone https://github.com/yourusername/mcp-office-tools.git
cd mcp-office-tools
uv sync --dev

# Run tests
uv run pytest

# Format and lint
uv run black src/ tests/
uv run ruff check src/ tests/

# Type check
uv run mypy src/

📦 Dependencies

Core:

  • fastmcp - MCP server framework
  • python-docx - Word document processing
  • openpyxl - Excel spreadsheet processing
  • python-pptx - PowerPoint processing
  • pandas - Data analysis and CSV handling
  • mammoth - Word to HTML/Markdown conversion
  • olefile - Legacy OLE format support
  • xlrd - Legacy Excel support
  • pillow - Image processing
  • aiohttp / aiofiles - Async HTTP and file I/O

Optional:

  • python-magic - Enhanced MIME type detection
  • msoffcrypto-tool - Encrypted file detection

📝 Behind the Scenes

This README was rewritten during a human-AI collaboration session. The process raised questions about discernment, voice, and what makes documentation actually land:


📜 License

MIT License - see LICENSE for details.


Built with FastMCP and the Model Context Protocol

Description
Comprehensive Microsoft Office document processing server for MCP (Model Context Protocol) - Word, Excel, PowerPoint support with intelligent fallback systems
Readme MIT 1.4 MiB
Languages
Python 89.5%
HTML 8.3%
Makefile 1.5%
Dockerfile 0.5%
Shell 0.2%