# ๐Ÿ“Š MCP Office Tools **MCP server for extracting text, tables, images, and data from Microsoft Office files** [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg?style=flat-square)](https://www.python.org/downloads/) [![FastMCP](https://img.shields.io/badge/FastMCP-0.5+-green.svg?style=flat-square)](https://gofastmcp.com) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square)](https://opensource.org/licenses/MIT) [![MCP Protocol](https://img.shields.io/badge/MCP-Protocol-purple?style=flat-square)](https://modelcontextprotocol.io) *Word, Excel, PowerPoint, CSV โ€” all the formats your AI agent needs to read but can't* [Installation](#-installation) โ€ข [Tools](#-available-tools) โ€ข [Examples](#-usage-examples) โ€ข [Testing](#-testing)
--- ## โœจ Features - **Universal extraction** โ€” Pull text, images, and metadata from any Office format - **Format-specific tools** โ€” Deep analysis for Word (tables, structure), Excel (formulas, charts), PowerPoint - **Automatic pagination** โ€” Large documents get chunked so they don't blow up your context window - **Fallback processing** โ€” When one library chokes on a weird file, we try another. No silent failures. - **URL support** โ€” Pass a URL instead of a file path; we'll download and cache it - **Legacy formats** โ€” Yes, even those .doc and .xls files from 2003 still work --- ## ๐Ÿš€ Installation ```bash # Quick install with uvx (recommended) uvx mcp-office-tools # Or install with uv/pip uv add mcp-office-tools pip install mcp-office-tools ``` ### Claude Desktop Configuration Add to your `claude_desktop_config.json`: ```json { "mcpServers": { "office-tools": { "command": "uvx", "args": ["mcp-office-tools"] } } } ``` ### Claude Code Configuration ```bash claude mcp add office-tools "uvx mcp-office-tools" ``` --- ## ๐Ÿ›  Available Tools ### Universal Tools *Work with all Office formats: Word, Excel, PowerPoint, CSV* | Tool | Description | |------|-------------| | `extract_text` | Extract text with optional formatting preservation | | `extract_images` | Extract embedded images with size filtering | | `extract_metadata` | Get document properties (author, dates, statistics) | | `detect_office_format` | Identify format, version, encryption status | | `analyze_document_health` | Check integrity, corruption, password protection | | `get_supported_formats` | List all supported file extensions | | `index_document` | Scan document and create resource URIs for on-demand fetching | ### Word Tools | Tool | Description | |------|-------------| | `convert_to_markdown` | Convert to Markdown with automatic pagination for large docs | | `extract_word_tables` | Extract tables as structured JSON, CSV, or Markdown | | `analyze_word_structure` | Analyze headings, sections, styles, and document hierarchy | | `get_document_outline` | Get structured outline with chapter detection and word counts | | `check_style_consistency` | Find formatting issues, missing chapters, style problems | | `search_document` | Search text with context and chapter location | | `extract_entities` | Extract people, places, organizations using pattern recognition | | `get_chapter_summaries` | Generate chapter previews with opening sentences | | `save_reading_progress` | Bookmark your reading position for later | | `get_reading_progress` | Resume reading from saved position | ### Excel Tools | Tool | Description | |------|-------------| | `analyze_excel_data` | Statistical analysis: data types, missing values, outliers | | `extract_excel_formulas` | Extract formulas with values and dependency analysis | | `create_excel_chart_data` | Generate Chart.js/Plotly-ready data from spreadsheets | --- ## ๐Ÿ“‹ Format Support Here's what works and what's "good enough" โ€” legacy formats from Office 97-2003 have more limited extraction, but they still work: | Format | Extension | Text | Images | Metadata | Tables | Formulas | |--------|-----------|:----:|:------:|:--------:|:------:|:--------:| | **Word (Modern)** | `.docx` | โœ… | โœ… | โœ… | โœ… | - | | **Word (Legacy)** | `.doc` | โœ… | โš ๏ธ | โš ๏ธ | โš ๏ธ | - | | **Word Template** | `.dotx` | โœ… | โœ… | โœ… | โœ… | - | | **Word Macro** | `.docm` | โœ… | โœ… | โœ… | โœ… | - | | **Excel (Modern)** | `.xlsx` | โœ… | โœ… | โœ… | โœ… | โœ… | | **Excel (Legacy)** | `.xls` | โœ… | โš ๏ธ | โš ๏ธ | โœ… | โš ๏ธ | | **Excel Template** | `.xltx` | โœ… | โœ… | โœ… | โœ… | โœ… | | **Excel Macro** | `.xlsm` | โœ… | โœ… | โœ… | โœ… | โœ… | | **PowerPoint (Modern)** | `.pptx` | โœ… | โœ… | โœ… | โœ… | - | | **PowerPoint (Legacy)** | `.ppt` | โœ… | โš ๏ธ | โš ๏ธ | โš ๏ธ | - | | **PowerPoint Template** | `.potx` | โœ… | โœ… | โœ… | โœ… | - | | **CSV** | `.csv` | โœ… | - | โš ๏ธ | โœ… | - | โœ… Full support โ€ข โš ๏ธ Basic/partial support โ€ข - Not applicable --- ## ๐Ÿ”— MCP Resources Instead of returning entire documents in tool responses, you can index a document once and fetch content on-demand via URI-based resources. This keeps context windows manageable when working with large files. ### How It Works 1. **Index the document** โ€” `index_document` scans the file and returns URIs 2. **Fetch what you need** โ€” Request specific chapters, sheets, slides, or images by URI 3. **Format on demand** โ€” Append `.txt` or `.html` to get different output formats ### Resource URI Patterns | URI Pattern | Description | Example | |-------------|-------------|---------| | `chapter://{doc_id}/{n}` | Single chapter/section | `chapter://abc123/3` | | `chapters://{doc_id}/{range}` | Multiple chapters | `chapters://abc123/1-5` | | `section://{doc_id}/{n}` | Section by heading style | `section://abc123/2` | | `paragraph://{doc_id}/{ch}/{p}` | Specific paragraph | `paragraph://abc123/3/7` | | `sheet://{doc_id}/{name}` | Excel sheet as markdown table | `sheet://abc123/Revenue` | | `slide://{doc_id}/{n}` | PowerPoint slide | `slide://abc123/5` | | `slides://{doc_id}/{range}` | Multiple slides | `slides://abc123/1,3,5` | | `image://{doc_id}/{n}` | Embedded image | `image://abc123/0` | ### Format Suffixes Append a format suffix to convert on the fly: | Suffix | Output | |--------|--------| | `.md` (default) | Markdown | | `.txt` | Plain text (no formatting) | | `.html` | Basic HTML | Examples: - `chapter://abc123/3` โ†’ Markdown (default) - `chapter://abc123/3.txt` โ†’ Plain text - `chapter://abc123/3.html` โ†’ HTML ### Range Syntax Fetch multiple items at once: - `1-5` โ†’ Items 1 through 5 - `1,3,5` โ†’ Specific items - `1-3,7,9-10` โ†’ Mixed ranges ### Section Detection The indexer detects document structure automatically: 1. **Heading 1 styles** (primary) โ€” Business docs, manuals, technical documents 2. **"Chapter X" text patterns** (fallback) โ€” Books, manuscripts, narratives Use `text_patterns_only=True` to skip heading style detection for documents with messy formatting. --- ## ๐ŸŽฏ MCP Prompts Pre-built workflows that chain multiple tools together. Use these as starting points: | Prompt | Level | Description | |--------|-------|-------------| | `explore-document` | Basic | Start with any new document - get structure and identify issues | | `find-character` | Basic | Track all mentions of a person/character with context | | `chapter-preview` | Basic | Quick overview of each chapter without full read | | `resume-reading` | Intermediate | Check saved position and continue reading | | `document-analysis` | Intermediate | Comprehensive multi-tool analysis | | `character-journey` | Advanced | Track character arc through entire narrative | | `document-comparison` | Advanced | Compare entities and themes between chapters | | `full-reading-session` | Advanced | Guided reading with bookmarking | | `manuscript-review` | Advanced | Complete editorial workflow for editors | --- ## ๐Ÿ’ก Usage Examples ### Extract Text from Any Document ```python # Simple extraction result = await extract_text("report.docx") print(result["text"]) # With formatting preserved result = await extract_text( file_path="report.docx", preserve_formatting=True, include_metadata=True ) ``` ### Convert Word to Markdown (with Pagination) Large documents get paginated automatically. Three ways to handle it: ```python # Option 1: Follow the cursor for each chunk result = await convert_to_markdown("big-manual.docx") if result.get("pagination", {}).get("has_more"): next_page = await convert_to_markdown( "big-manual.docx", cursor_id=result["pagination"]["cursor_id"] ) # Option 2: Grab specific pages result = await convert_to_markdown("big-manual.docx", page_range="1-10") # Option 3: Extract by chapter heading result = await convert_to_markdown("big-manual.docx", chapter_name="Introduction") ``` ### Analyze Excel Data Quality ```python result = await analyze_excel_data( file_path="sales-data.xlsx", include_statistics=True, check_data_quality=True ) # Returns per-column analysis # { # "analysis": { # "Sheet1": { # "dimensions": {"rows": 1000, "columns": 12}, # "column_info": { # "Revenue": { # "data_type": "float64", # "null_percentage": 2.3, # "statistics": {"mean": 45000, "median": 42000, ...}, # "quality_issues": ["5 potential outliers"] # } # }, # "data_quality": { # "completeness_percentage": 97.8, # "duplicate_rows": 12 # } # } # } # } ``` ### Extract Excel Formulas ```python result = await extract_excel_formulas( file_path="financial-model.xlsx", analyze_dependencies=True ) # Returns formula details with dependency mapping # { # "formulas": { # "Sheet1": [ # { # "cell": "D2", # "formula": "=B2*C2", # "value": 1500.00, # "dependencies": ["B2", "C2"] # } # ] # } # } ``` ### Generate Chart Data ```python result = await create_excel_chart_data( file_path="quarterly-revenue.xlsx", chart_type="line", output_format="chartjs" ) # Returns ready-to-use Chart.js configuration # { # "chartjs": { # "type": "line", # "data": { # "labels": ["Q1", "Q2", "Q3", "Q4"], # "datasets": [{"label": "Revenue", "data": [100, 120, 115, 140]}] # } # } # } ``` ### Extract Word Tables ```python result = await extract_word_tables( file_path="contract.docx", output_format="markdown" ) # Returns tables with optional format conversion # { # "tables": [ # { # "table_index": 0, # "dimensions": {"rows": 5, "columns": 3}, # "converted_output": "| Name | Role | Department |\n|---|---|---|\n..." # } # ] # } ``` ### Process Documents from URLs ```python # Documents are downloaded and cached automatically result = await extract_text("https://example.com/report.docx") # Cache expires after 1 hour by default ``` ### Index Document for On-Demand Resource Fetching ```python # Index the document - returns URIs for all content result = await index_document("novel.docx") # Returns: # { # "doc_id": "56036b0f171a", # "resources": { # "chapter": [ # {"id": "1", "title": "Chapter 1: The Beginning", "uri": "chapter://56036b0f171a/1"}, # {"id": "2", "title": "Chapter 2: Rising Action", "uri": "chapter://56036b0f171a/2"}, # ... # ], # "image": [ # {"id": "0", "uri": "image://56036b0f171a/0"}, # ... # ] # } # } # Now fetch specific content via MCP resources: # - chapter://56036b0f171a/1 โ†’ Chapter 1 as markdown # - chapter://56036b0f171a/1.txt โ†’ Chapter 1 as plain text # - chapters://56036b0f171a/1-3 โ†’ Chapters 1-3 combined # - image://56036b0f171a/0 โ†’ First embedded image # Works with Excel and PowerPoint too: await index_document("data.xlsx") # โ†’ sheet://abc123/Revenue, sheet://abc123/Expenses, ... await index_document("presentation.pptx") # โ†’ slide://def456/1, slide://def456/2, ... ``` --- ## ๐Ÿงช Testing We built a visual test dashboard because staring at pytest output gets old. Run `make test` and you get an HTML report with pass/fail stats, detailed I/O for each test, and expandable tracebacks when things break. ```bash # Run tests and generate the dashboard make test # Just pytest, no dashboard make test-pytest # Open existing dashboard make view-dashboard ``` The dashboard has an MS Office-inspired theme (Word blue, Excel green, PowerPoint orange) and groups tests by category so you can see what's working at a glance. --- ## ๐Ÿ— Architecture The mixin pattern keeps things modular โ€” universal tools work on everything, format-specific tools go deeper. When the primary library can't handle something (corrupted files, weird formatting), we fall back to alternatives. ``` mcp-office-tools/ โ”œโ”€โ”€ src/mcp_office_tools/ โ”‚ โ”œโ”€โ”€ server.py # FastMCP server + resource templates โ”‚ โ”œโ”€โ”€ resources.py # Resource store for on-demand content โ”‚ โ”œโ”€โ”€ mixins/ โ”‚ โ”‚ โ”œโ”€โ”€ universal.py # Format-agnostic tools (incl. index_document) โ”‚ โ”‚ โ”œโ”€โ”€ word.py # Word-specific tools โ”‚ โ”‚ โ”œโ”€โ”€ excel.py # Excel-specific tools โ”‚ โ”‚ โ””โ”€โ”€ powerpoint.py # PowerPoint tools (WIP) โ”‚ โ”œโ”€โ”€ utils/ โ”‚ โ”‚ โ”œโ”€โ”€ validation.py # File validation โ”‚ โ”‚ โ”œโ”€โ”€ file_detection.py # Format detection โ”‚ โ”‚ โ”œโ”€โ”€ caching.py # URL caching โ”‚ โ”‚ โ””โ”€โ”€ decorators.py # Error handling, defaults โ”‚ โ””โ”€โ”€ pagination.py # Large document pagination โ”œโ”€โ”€ tests/ # pytest test suite โ””โ”€โ”€ reports/ # Test dashboard output ``` ### Processing Libraries | Format | Primary Library | Fallback | |--------|----------------|----------| | `.docx` | python-docx | mammoth | | `.xlsx` | openpyxl | pandas | | `.pptx` | python-pptx | - | | `.doc`/`.xls`/`.ppt` | olefile | - | | `.csv` | pandas | built-in csv | --- ## ๐Ÿ”ง Development ```bash # Clone and install git clone https://github.com/yourusername/mcp-office-tools.git cd mcp-office-tools uv sync --dev # Run tests uv run pytest # Format and lint uv run black src/ tests/ uv run ruff check src/ tests/ # Type check uv run mypy src/ ``` --- ## ๐Ÿ“ฆ Dependencies **Core:** - `fastmcp` - MCP server framework - `python-docx` - Word document processing - `openpyxl` - Excel spreadsheet processing - `python-pptx` - PowerPoint processing - `pandas` - Data analysis and CSV handling - `mammoth` - Word to HTML/Markdown conversion - `olefile` - Legacy OLE format support - `xlrd` - Legacy Excel support - `pillow` - Image processing - `aiohttp` / `aiofiles` - Async HTTP and file I/O **Optional:** - `python-magic` - Enhanced MIME type detection - `msoffcrypto-tool` - Encrypted file detection --- ## ๐Ÿค Related Projects - **[MCP PDF Tools](https://github.com/yourusername/mcp-pdf-tools)** - Companion server for PDF processing - **[FastMCP](https://gofastmcp.com)** - The framework powering this server ## ๐Ÿ“ Behind the Scenes This README was rewritten during a human-AI collaboration session. The process raised questions about discernment, voice, and what makes documentation actually land: - **[AI Isn't New. Your Discernment Is What Matters.](https://ryanmalloy.com/blog/ai-discernment)** โ€” Ryan's take on 40 years of writing code and why discernment matters more than the tools --- ## ๐Ÿ“œ License MIT License - see [LICENSE](LICENSE) for details. ---
**Built with [FastMCP](https://gofastmcp.com) and the [Model Context Protocol](https://modelcontextprotocol.io)**