From f159efab2c49d5575ba70012a5641106e8eb8a35 Mon Sep 17 00:00:00 2001 From: Ryan Malloy Date: Sun, 11 Jan 2026 00:49:34 -0700 Subject: [PATCH] Improve README tone and clarity - Replace generic opener with direct description - Make feature bullets more conversational (less "feature list" mode) - Add context before format support table - Clarify pagination example with "three ways" structure - Lead testing section with the dashboard hook - Add architecture design rationale - Remove "comprehensive" and "intelligent" buzzwords --- README.md | 56 +++++++++++++++++++++++++------------------------------ 1 file changed, 25 insertions(+), 31 deletions(-) diff --git a/README.md b/README.md index 2d199cc..d6dafd8 100644 --- a/README.md +++ b/README.md @@ -2,14 +2,14 @@ # ๐Ÿ“Š MCP Office Tools -**Comprehensive Microsoft Office document processing for AI agents** +**MCP server for extracting text, tables, images, and data from Microsoft Office files** [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg?style=flat-square)](https://www.python.org/downloads/) [![FastMCP](https://img.shields.io/badge/FastMCP-0.5+-green.svg?style=flat-square)](https://gofastmcp.com) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square)](https://opensource.org/licenses/MIT) [![MCP Protocol](https://img.shields.io/badge/MCP-Protocol-purple?style=flat-square)](https://modelcontextprotocol.io) -*Extract text, tables, images, formulas, and metadata from Word, Excel, PowerPoint, and CSV files* +*Word, Excel, PowerPoint, CSV โ€” all the formats your AI agent needs to read but can't* [Installation](#-installation) โ€ข [Tools](#-available-tools) โ€ข [Examples](#-usage-examples) โ€ข [Testing](#-testing) @@ -19,12 +19,12 @@ ## โœจ Features -- **Universal extraction** - Text, images, and metadata from any Office format -- **Format-specific tools** - Deep analysis for Word, Excel, and PowerPoint -- **Intelligent pagination** - Large documents automatically chunked for AI context limits -- **Multi-library fallbacks** - Never fails silently; tries multiple extraction methods -- **URL support** - Process documents directly from HTTP/HTTPS URLs with caching -- **Legacy format support** - Handles .doc, .xls, .ppt from Office 97-2003 +- **Universal extraction** โ€” Pull text, images, and metadata from any Office format +- **Format-specific tools** โ€” Deep analysis for Word (tables, structure), Excel (formulas, charts), PowerPoint +- **Automatic pagination** โ€” Large documents get chunked so they don't blow up your context window +- **Fallback processing** โ€” When one library chokes on a weird file, we try another. No silent failures. +- **URL support** โ€” Pass a URL instead of a file path; we'll download and cache it +- **Legacy formats** โ€” Yes, even those .doc and .xls files from 2003 still work --- @@ -96,6 +96,8 @@ claude mcp add office-tools "uvx mcp-office-tools" ## ๐Ÿ“‹ Format Support +Here's what works and what's "good enough" โ€” legacy formats from Office 97-2003 have more limited extraction, but they still work: + | Format | Extension | Text | Images | Metadata | Tables | Formulas | |--------|-----------|:----:|:------:|:--------:|:------:|:--------:| | **Word (Modern)** | `.docx` | โœ… | โœ… | โœ… | โœ… | - | @@ -134,28 +136,22 @@ result = await extract_text( ### Convert Word to Markdown (with Pagination) -```python -# For large documents, results are automatically paginated -result = await convert_to_markdown("big-manual.docx") +Large documents get paginated automatically. Three ways to handle it: -# Continue with cursor for next page +```python +# Option 1: Follow the cursor for each chunk +result = await convert_to_markdown("big-manual.docx") if result.get("pagination", {}).get("has_more"): next_page = await convert_to_markdown( "big-manual.docx", cursor_id=result["pagination"]["cursor_id"] ) -# Or use page ranges to get specific sections -result = await convert_to_markdown( - "big-manual.docx", - page_range="1-10" -) +# Option 2: Grab specific pages +result = await convert_to_markdown("big-manual.docx", page_range="1-10") -# Or extract by chapter name -result = await convert_to_markdown( - "big-manual.docx", - chapter_name="Introduction" -) +# Option 3: Extract by chapter heading +result = await convert_to_markdown("big-manual.docx", chapter_name="Introduction") ``` ### Analyze Excel Data Quality @@ -266,29 +262,27 @@ result = await extract_text("https://example.com/report.docx") ## ๐Ÿงช Testing -The project includes a comprehensive test suite with an interactive HTML dashboard: +We built a visual test dashboard because staring at pytest output gets old. Run `make test` and you get an HTML report with pass/fail stats, detailed I/O for each test, and expandable tracebacks when things break. ```bash -# Run all tests with dashboard generation +# Run tests and generate the dashboard make test -# Run just pytest +# Just pytest, no dashboard make test-pytest -# View the test dashboard +# Open existing dashboard make view-dashboard ``` -The test dashboard shows: -- Pass/fail statistics with MS Office-themed styling -- Detailed inputs and outputs for each test -- Expandable error tracebacks for failures -- Category breakdown (Word, Excel, PowerPoint) +The dashboard has an MS Office-inspired theme (Word blue, Excel green, PowerPoint orange) and groups tests by category so you can see what's working at a glance. --- ## ๐Ÿ— Architecture +The mixin pattern keeps things modular โ€” universal tools work on everything, format-specific tools go deeper. When the primary library can't handle something (corrupted files, weird formatting), we fall back to alternatives. + ``` mcp-office-tools/ โ”œโ”€โ”€ src/mcp_office_tools/