# ๐Ÿ“Š MCP Office Tools **MCP server for extracting text, tables, images, and data from Microsoft Office files** [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg?style=flat-square)](https://www.python.org/downloads/) [![FastMCP](https://img.shields.io/badge/FastMCP-0.5+-green.svg?style=flat-square)](https://gofastmcp.com) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square)](https://opensource.org/licenses/MIT) [![MCP Protocol](https://img.shields.io/badge/MCP-Protocol-purple?style=flat-square)](https://modelcontextprotocol.io) *Word, Excel, PowerPoint, CSV โ€” all the formats your AI agent needs to read but can't* [Installation](#-installation) โ€ข [Tools](#-available-tools) โ€ข [Examples](#-usage-examples) โ€ข [Testing](#-testing)
--- ## โœจ Features - **Universal extraction** โ€” Pull text, images, and metadata from any Office format - **Format-specific tools** โ€” Deep analysis for Word (tables, structure), Excel (formulas, charts), PowerPoint - **Automatic pagination** โ€” Large documents get chunked so they don't blow up your context window - **Fallback processing** โ€” When one library chokes on a weird file, we try another. No silent failures. - **URL support** โ€” Pass a URL instead of a file path; we'll download and cache it - **Legacy formats** โ€” Yes, even those .doc and .xls files from 2003 still work --- ## ๐Ÿš€ Installation ```bash # Quick install with uvx (recommended) uvx mcp-office-tools # Or install with uv/pip uv add mcp-office-tools pip install mcp-office-tools ``` ### Claude Desktop Configuration Add to your `claude_desktop_config.json`: ```json { "mcpServers": { "office-tools": { "command": "uvx", "args": ["mcp-office-tools"] } } } ``` ### Claude Code Configuration ```bash claude mcp add office-tools "uvx mcp-office-tools" ``` --- ## ๐Ÿ›  Available Tools ### Universal Tools *Work with all Office formats: Word, Excel, PowerPoint, CSV* | Tool | Description | |------|-------------| | `extract_text` | Extract text with optional formatting preservation | | `extract_images` | Extract embedded images with size filtering | | `extract_metadata` | Get document properties (author, dates, statistics) | | `detect_office_format` | Identify format, version, encryption status | | `analyze_document_health` | Check integrity, corruption, password protection | | `get_supported_formats` | List all supported file extensions | ### Word Tools | Tool | Description | |------|-------------| | `convert_to_markdown` | Convert to Markdown with automatic pagination for large docs | | `extract_word_tables` | Extract tables as structured JSON, CSV, or Markdown | | `analyze_word_structure` | Analyze headings, sections, styles, and document hierarchy | ### Excel Tools | Tool | Description | |------|-------------| | `analyze_excel_data` | Statistical analysis: data types, missing values, outliers | | `extract_excel_formulas` | Extract formulas with values and dependency analysis | | `create_excel_chart_data` | Generate Chart.js/Plotly-ready data from spreadsheets | --- ## ๐Ÿ“‹ Format Support Here's what works and what's "good enough" โ€” legacy formats from Office 97-2003 have more limited extraction, but they still work: | Format | Extension | Text | Images | Metadata | Tables | Formulas | |--------|-----------|:----:|:------:|:--------:|:------:|:--------:| | **Word (Modern)** | `.docx` | โœ… | โœ… | โœ… | โœ… | - | | **Word (Legacy)** | `.doc` | โœ… | โš ๏ธ | โš ๏ธ | โš ๏ธ | - | | **Word Template** | `.dotx` | โœ… | โœ… | โœ… | โœ… | - | | **Word Macro** | `.docm` | โœ… | โœ… | โœ… | โœ… | - | | **Excel (Modern)** | `.xlsx` | โœ… | โœ… | โœ… | โœ… | โœ… | | **Excel (Legacy)** | `.xls` | โœ… | โš ๏ธ | โš ๏ธ | โœ… | โš ๏ธ | | **Excel Template** | `.xltx` | โœ… | โœ… | โœ… | โœ… | โœ… | | **Excel Macro** | `.xlsm` | โœ… | โœ… | โœ… | โœ… | โœ… | | **PowerPoint (Modern)** | `.pptx` | โœ… | โœ… | โœ… | โœ… | - | | **PowerPoint (Legacy)** | `.ppt` | โœ… | โš ๏ธ | โš ๏ธ | โš ๏ธ | - | | **PowerPoint Template** | `.potx` | โœ… | โœ… | โœ… | โœ… | - | | **CSV** | `.csv` | โœ… | - | โš ๏ธ | โœ… | - | โœ… Full support โ€ข โš ๏ธ Basic/partial support โ€ข - Not applicable --- ## ๐Ÿ’ก Usage Examples ### Extract Text from Any Document ```python # Simple extraction result = await extract_text("report.docx") print(result["text"]) # With formatting preserved result = await extract_text( file_path="report.docx", preserve_formatting=True, include_metadata=True ) ``` ### Convert Word to Markdown (with Pagination) Large documents get paginated automatically. Three ways to handle it: ```python # Option 1: Follow the cursor for each chunk result = await convert_to_markdown("big-manual.docx") if result.get("pagination", {}).get("has_more"): next_page = await convert_to_markdown( "big-manual.docx", cursor_id=result["pagination"]["cursor_id"] ) # Option 2: Grab specific pages result = await convert_to_markdown("big-manual.docx", page_range="1-10") # Option 3: Extract by chapter heading result = await convert_to_markdown("big-manual.docx", chapter_name="Introduction") ``` ### Analyze Excel Data Quality ```python result = await analyze_excel_data( file_path="sales-data.xlsx", include_statistics=True, check_data_quality=True ) # Returns per-column analysis # { # "analysis": { # "Sheet1": { # "dimensions": {"rows": 1000, "columns": 12}, # "column_info": { # "Revenue": { # "data_type": "float64", # "null_percentage": 2.3, # "statistics": {"mean": 45000, "median": 42000, ...}, # "quality_issues": ["5 potential outliers"] # } # }, # "data_quality": { # "completeness_percentage": 97.8, # "duplicate_rows": 12 # } # } # } # } ``` ### Extract Excel Formulas ```python result = await extract_excel_formulas( file_path="financial-model.xlsx", analyze_dependencies=True ) # Returns formula details with dependency mapping # { # "formulas": { # "Sheet1": [ # { # "cell": "D2", # "formula": "=B2*C2", # "value": 1500.00, # "dependencies": ["B2", "C2"] # } # ] # } # } ``` ### Generate Chart Data ```python result = await create_excel_chart_data( file_path="quarterly-revenue.xlsx", chart_type="line", output_format="chartjs" ) # Returns ready-to-use Chart.js configuration # { # "chartjs": { # "type": "line", # "data": { # "labels": ["Q1", "Q2", "Q3", "Q4"], # "datasets": [{"label": "Revenue", "data": [100, 120, 115, 140]}] # } # } # } ``` ### Extract Word Tables ```python result = await extract_word_tables( file_path="contract.docx", output_format="markdown" ) # Returns tables with optional format conversion # { # "tables": [ # { # "table_index": 0, # "dimensions": {"rows": 5, "columns": 3}, # "converted_output": "| Name | Role | Department |\n|---|---|---|\n..." # } # ] # } ``` ### Process Documents from URLs ```python # Documents are downloaded and cached automatically result = await extract_text("https://example.com/report.docx") # Cache expires after 1 hour by default ``` --- ## ๐Ÿงช Testing We built a visual test dashboard because staring at pytest output gets old. Run `make test` and you get an HTML report with pass/fail stats, detailed I/O for each test, and expandable tracebacks when things break. ```bash # Run tests and generate the dashboard make test # Just pytest, no dashboard make test-pytest # Open existing dashboard make view-dashboard ``` The dashboard has an MS Office-inspired theme (Word blue, Excel green, PowerPoint orange) and groups tests by category so you can see what's working at a glance. --- ## ๐Ÿ— Architecture The mixin pattern keeps things modular โ€” universal tools work on everything, format-specific tools go deeper. When the primary library can't handle something (corrupted files, weird formatting), we fall back to alternatives. ``` mcp-office-tools/ โ”œโ”€โ”€ src/mcp_office_tools/ โ”‚ โ”œโ”€โ”€ server.py # FastMCP server entry point โ”‚ โ”œโ”€โ”€ mixins/ โ”‚ โ”‚ โ”œโ”€โ”€ universal.py # Format-agnostic tools โ”‚ โ”‚ โ”œโ”€โ”€ word.py # Word-specific tools โ”‚ โ”‚ โ”œโ”€โ”€ excel.py # Excel-specific tools โ”‚ โ”‚ โ””โ”€โ”€ powerpoint.py # PowerPoint tools (WIP) โ”‚ โ”œโ”€โ”€ utils/ โ”‚ โ”‚ โ”œโ”€โ”€ validation.py # File validation โ”‚ โ”‚ โ”œโ”€โ”€ file_detection.py # Format detection โ”‚ โ”‚ โ”œโ”€โ”€ caching.py # URL caching โ”‚ โ”‚ โ””โ”€โ”€ decorators.py # Error handling, defaults โ”‚ โ””โ”€โ”€ pagination.py # Large document pagination โ”œโ”€โ”€ tests/ # pytest test suite โ””โ”€โ”€ reports/ # Test dashboard output ``` ### Processing Libraries | Format | Primary Library | Fallback | |--------|----------------|----------| | `.docx` | python-docx | mammoth | | `.xlsx` | openpyxl | pandas | | `.pptx` | python-pptx | - | | `.doc`/`.xls`/`.ppt` | olefile | - | | `.csv` | pandas | built-in csv | --- ## ๐Ÿ”ง Development ```bash # Clone and install git clone https://github.com/yourusername/mcp-office-tools.git cd mcp-office-tools uv sync --dev # Run tests uv run pytest # Format and lint uv run black src/ tests/ uv run ruff check src/ tests/ # Type check uv run mypy src/ ``` --- ## ๐Ÿ“ฆ Dependencies **Core:** - `fastmcp` - MCP server framework - `python-docx` - Word document processing - `openpyxl` - Excel spreadsheet processing - `python-pptx` - PowerPoint processing - `pandas` - Data analysis and CSV handling - `mammoth` - Word to HTML/Markdown conversion - `olefile` - Legacy OLE format support - `xlrd` - Legacy Excel support - `pillow` - Image processing - `aiohttp` / `aiofiles` - Async HTTP and file I/O **Optional:** - `python-magic` - Enhanced MIME type detection - `msoffcrypto-tool` - Encrypted file detection --- ## ๐Ÿค Related Projects - **[MCP PDF Tools](https://github.com/yourusername/mcp-pdf-tools)** - Companion server for PDF processing - **[FastMCP](https://gofastmcp.com)** - The framework powering this server --- ## ๐Ÿ“œ License MIT License - see [LICENSE](LICENSE) for details. ---
**Built with [FastMCP](https://gofastmcp.com) and the [Model Context Protocol](https://modelcontextprotocol.io)**