Go to file

Ryan Malloy 210aa99e0b Fix page range extraction for large documents and MCP connection

Bug fixes:
- Remove 100-paragraph cap that prevented extracting content past ~page 4
  Now calculates limit based on number of pages requested (300 paras/page)
- Add fallback page estimation when docs lack explicit page breaks
  Uses ~25 paragraphs per page for navigation in non-paginated docs
- Fix _get_available_headings to scan full document (was only first 100 elements)
  Headings like Chapter 10 at element 1524 were invisible
- Fix MCP connection by disabling FastMCP banner (show_banner=False)
  ASCII art banner was corrupting stdout JSON-RPC protocol

Changes:
- Default image_mode changed from 'base64' to 'files' to avoid huge responses
- Add proper .mcp.json config with command/args format
- Add test document to .gitignore for privacy

2026-01-11 04:27:56 -07:00

.github/workflows

Add MS Office-themed test dashboard with interactive reporting

2026-01-11 00:28:12 -07:00

examples

Initial commit: MCP Office Tools v0.1.0

2025-08-18 01:01:48 -06:00

reports

Add behind-the-scenes link to discernment blog post

2026-01-11 02:02:34 -07:00

src/mcp_office_tools

Fix page range extraction for large documents and MCP connection

2026-01-11 04:27:56 -07:00

tests

Fix page range extraction for large documents and MCP connection

2026-01-11 04:27:56 -07:00

.gitignore

Fix page range extraction for large documents and MCP connection

2026-01-11 04:27:56 -07:00

.mcp.json

Fix page range extraction for large documents and MCP connection

2026-01-11 04:27:56 -07:00

ADVANCED_TOOLS_PLAN.md

Add MS Office-themed test dashboard with interactive reporting

2026-01-11 00:28:12 -07:00

CLAUDE.md

Initial commit: MCP Office Tools v0.1.0

2025-08-18 01:01:48 -06:00

IMPLEMENTATION_STATUS.md

Initial commit: MCP Office Tools v0.1.0

2025-08-18 01:01:48 -06:00

LICENSE

Initial commit: MCP Office Tools v0.1.0

2025-08-18 01:01:48 -06:00

Makefile

Add MS Office-themed test dashboard with interactive reporting

2026-01-11 00:28:12 -07:00

pyproject.toml

Initial commit: MCP Office Tools v0.1.0

2025-08-18 01:01:48 -06:00

QUICKSTART_DASHBOARD.md

Add MS Office-themed test dashboard with interactive reporting

2026-01-11 00:28:12 -07:00

README.md

Add behind-the-scenes link to discernment blog post

2026-01-11 02:02:34 -07:00

run_dashboard_tests.py

Add MS Office-themed test dashboard with interactive reporting

2026-01-11 00:28:12 -07:00

test_mcp_tools.py

Add MS Office-themed test dashboard with interactive reporting

2026-01-11 00:28:12 -07:00

test_pagination.py

Implement cursor-based pagination system for large document processing

2025-09-26 19:06:05 -06:00

TESTING_STRATEGY.md

Fix FastMCP stdio server import

2025-09-26 15:49:00 -06:00

torture_test.py

Add decorators for field defaults and error handling, fix Excel performance

2026-01-10 23:51:30 -07:00

uv.lock

Add decorators for field defaults and error handling, fix Excel performance

2026-01-10 23:51:30 -07:00

view_dashboard.sh

Add MS Office-themed test dashboard with interactive reporting

2026-01-11 00:28:12 -07:00

README.md

📊 MCP Office Tools

MCP server for extracting text, tables, images, and data from Microsoft Office files

Word, Excel, PowerPoint, CSV — all the formats your AI agent needs to read but can't

Installation • Tools • Examples • Testing

✨ Features

Universal extraction — Pull text, images, and metadata from any Office format
Format-specific tools — Deep analysis for Word (tables, structure), Excel (formulas, charts), PowerPoint
Automatic pagination — Large documents get chunked so they don't blow up your context window
Fallback processing — When one library chokes on a weird file, we try another. No silent failures.
URL support — Pass a URL instead of a file path; we'll download and cache it
Legacy formats — Yes, even those .doc and .xls files from 2003 still work

🚀 Installation

# Quick install with uvx (recommended)
uvx mcp-office-tools

# Or install with uv/pip
uv add mcp-office-tools
pip install mcp-office-tools

Claude Desktop Configuration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "office-tools": {
      "command": "uvx",
      "args": ["mcp-office-tools"]
    }
  }
}

Claude Code Configuration

claude mcp add office-tools "uvx mcp-office-tools"

🛠 Available Tools

Universal Tools

Work with all Office formats: Word, Excel, PowerPoint, CSV

Tool	Description
`extract_text`	Extract text with optional formatting preservation
`extract_images`	Extract embedded images with size filtering
`extract_metadata`	Get document properties (author, dates, statistics)
`detect_office_format`	Identify format, version, encryption status
`analyze_document_health`	Check integrity, corruption, password protection
`get_supported_formats`	List all supported file extensions

Word Tools

Tool	Description
`convert_to_markdown`	Convert to Markdown with automatic pagination for large docs
`extract_word_tables`	Extract tables as structured JSON, CSV, or Markdown
`analyze_word_structure`	Analyze headings, sections, styles, and document hierarchy

Excel Tools

Tool	Description
`analyze_excel_data`	Statistical analysis: data types, missing values, outliers
`extract_excel_formulas`	Extract formulas with values and dependency analysis
`create_excel_chart_data`	Generate Chart.js/Plotly-ready data from spreadsheets

📋 Format Support

Here's what works and what's "good enough" — legacy formats from Office 97-2003 have more limited extraction, but they still work:

Format	Extension	Text	Images	Metadata	Tables	Formulas
Word (Modern)	`.docx`	✅	✅	✅	✅	-
Word (Legacy)	`.doc`	✅	⚠️	⚠️	⚠️	-
Word Template	`.dotx`	✅	✅	✅	✅	-
Word Macro	`.docm`	✅	✅	✅	✅	-
Excel (Modern)	`.xlsx`	✅	✅	✅	✅	✅
Excel (Legacy)	`.xls`	✅	⚠️	⚠️	✅	⚠️
Excel Template	`.xltx`	✅	✅	✅	✅	✅
Excel Macro	`.xlsm`	✅	✅	✅	✅	✅
PowerPoint (Modern)	`.pptx`	✅	✅	✅	✅	-
PowerPoint (Legacy)	`.ppt`	✅	⚠️	⚠️	⚠️	-
PowerPoint Template	`.potx`	✅	✅	✅	✅	-
CSV	`.csv`	✅	-	⚠️	✅	-

✅ Full support • ⚠️ Basic/partial support • - Not applicable

💡 Usage Examples

Extract Text from Any Document

# Simple extraction
result = await extract_text("report.docx")
print(result["text"])

# With formatting preserved
result = await extract_text(
    file_path="report.docx",
    preserve_formatting=True,
    include_metadata=True
)

Convert Word to Markdown (with Pagination)

Large documents get paginated automatically. Three ways to handle it:

# Option 1: Follow the cursor for each chunk
result = await convert_to_markdown("big-manual.docx")
if result.get("pagination", {}).get("has_more"):
    next_page = await convert_to_markdown(
        "big-manual.docx",
        cursor_id=result["pagination"]["cursor_id"]
    )

# Option 2: Grab specific pages
result = await convert_to_markdown("big-manual.docx", page_range="1-10")

# Option 3: Extract by chapter heading
result = await convert_to_markdown("big-manual.docx", chapter_name="Introduction")

Analyze Excel Data Quality

result = await analyze_excel_data(
    file_path="sales-data.xlsx",
    include_statistics=True,
    check_data_quality=True
)

# Returns per-column analysis
# {
#   "analysis": {
#     "Sheet1": {
#       "dimensions": {"rows": 1000, "columns": 12},
#       "column_info": {
#         "Revenue": {
#           "data_type": "float64",
#           "null_percentage": 2.3,
#           "statistics": {"mean": 45000, "median": 42000, ...},
#           "quality_issues": ["5 potential outliers"]
#         }
#       },
#       "data_quality": {
#         "completeness_percentage": 97.8,
#         "duplicate_rows": 12
#       }
#     }
#   }
# }

Extract Excel Formulas

result = await extract_excel_formulas(
    file_path="financial-model.xlsx",
    analyze_dependencies=True
)

# Returns formula details with dependency mapping
# {
#   "formulas": {
#     "Sheet1": [
#       {
#         "cell": "D2",
#         "formula": "=B2*C2",
#         "value": 1500.00,
#         "dependencies": ["B2", "C2"]
#       }
#     ]
#   }
# }

Generate Chart Data

result = await create_excel_chart_data(
    file_path="quarterly-revenue.xlsx",
    chart_type="line",
    output_format="chartjs"
)

# Returns ready-to-use Chart.js configuration
# {
#   "chartjs": {
#     "type": "line",
#     "data": {
#       "labels": ["Q1", "Q2", "Q3", "Q4"],
#       "datasets": [{"label": "Revenue", "data": [100, 120, 115, 140]}]
#     }
#   }
# }

Extract Word Tables

result = await extract_word_tables(
    file_path="contract.docx",
    output_format="markdown"
)

# Returns tables with optional format conversion
# {
#   "tables": [
#     {
#       "table_index": 0,
#       "dimensions": {"rows": 5, "columns": 3},
#       "converted_output": "| Name | Role | Department |\n|---|---|---|\n..."
#     }
#   ]
# }

Process Documents from URLs

# Documents are downloaded and cached automatically
result = await extract_text("https://example.com/report.docx")

# Cache expires after 1 hour by default

🧪 Testing

We built a visual test dashboard because staring at pytest output gets old. Run make test and you get an HTML report with pass/fail stats, detailed I/O for each test, and expandable tracebacks when things break.

# Run tests and generate the dashboard
make test

# Just pytest, no dashboard
make test-pytest

# Open existing dashboard
make view-dashboard

The dashboard has an MS Office-inspired theme (Word blue, Excel green, PowerPoint orange) and groups tests by category so you can see what's working at a glance.

🏗 Architecture

The mixin pattern keeps things modular — universal tools work on everything, format-specific tools go deeper. When the primary library can't handle something (corrupted files, weird formatting), we fall back to alternatives.

mcp-office-tools/
├── src/mcp_office_tools/
│   ├── server.py              # FastMCP server entry point
│   ├── mixins/
│   │   ├── universal.py       # Format-agnostic tools
│   │   ├── word.py            # Word-specific tools
│   │   ├── excel.py           # Excel-specific tools
│   │   └── powerpoint.py      # PowerPoint tools (WIP)
│   ├── utils/
│   │   ├── validation.py      # File validation
│   │   ├── file_detection.py  # Format detection
│   │   ├── caching.py         # URL caching
│   │   └── decorators.py      # Error handling, defaults
│   └── pagination.py          # Large document pagination
├── tests/                     # pytest test suite
└── reports/                   # Test dashboard output

Processing Libraries

Format	Primary Library	Fallback
`.docx`	python-docx	mammoth
`.xlsx`	openpyxl	pandas
`.pptx`	python-pptx	-
`.doc`/`.xls`/`.ppt`	olefile	-
`.csv`	pandas	built-in csv

🔧 Development

# Clone and install
git clone https://github.com/yourusername/mcp-office-tools.git
cd mcp-office-tools
uv sync --dev

# Run tests
uv run pytest

# Format and lint
uv run black src/ tests/
uv run ruff check src/ tests/

# Type check
uv run mypy src/

📦 Dependencies

Core:

fastmcp - MCP server framework
python-docx - Word document processing
openpyxl - Excel spreadsheet processing
python-pptx - PowerPoint processing
pandas - Data analysis and CSV handling
mammoth - Word to HTML/Markdown conversion
olefile - Legacy OLE format support
xlrd - Legacy Excel support
pillow - Image processing
aiohttp / aiofiles - Async HTTP and file I/O

Optional:

python-magic - Enhanced MIME type detection
msoffcrypto-tool - Encrypted file detection

MCP PDF Tools - Companion server for PDF processing
FastMCP - The framework powering this server

📝 Behind the Scenes

This README was rewritten during a human-AI collaboration session. The process raised questions about discernment, voice, and what makes documentation actually land:

AI Isn't New. Your Discernment Is What Matters. — Ryan's take on 40 years of writing code and why discernment matters more than the tools

📜 License

MIT License - see LICENSE for details.

Built with FastMCP and the Model Context Protocol

Languages

Python 89.5%

HTML 8.3%

Makefile 1.5%

Dockerfile 0.5%

Shell 0.2%

README.md

📊 MCP Office Tools

✨ Features

🚀 Installation

Claude Desktop Configuration

Claude Code Configuration

🛠 Available Tools

Universal Tools

Word Tools

Excel Tools

📋 Format Support

💡 Usage Examples

Extract Text from Any Document

Convert Word to Markdown (with Pagination)

Analyze Excel Data Quality

Extract Excel Formulas

Generate Chart Data

Extract Word Tables

Process Documents from URLs

🧪 Testing

🏗 Architecture

Processing Libraries

🔧 Development

📦 Dependencies

🤝 Related Projects

📝 Behind the Scenes

📜 License