Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
374 lines
12 KiB
Markdown
374 lines
12 KiB
Markdown
<div align="center">
|
|
|
|
# 📎 mcwaddams
|
|
|
|
**MCP server for Microsoft Office document processing**
|
|
|
|
[](https://www.python.org/downloads/)
|
|
[](https://gofastmcp.com)
|
|
[](https://opensource.org/licenses/MIT)
|
|
[](https://modelcontextprotocol.io)
|
|
|
|
*"I was told there would be document extraction."*
|
|
|
|
[Installation](#-installation) • [Tools](#-available-tools) • [Examples](#-usage-examples) • [Testing](#-testing)
|
|
|
|
</div>
|
|
|
|
---
|
|
|
|
## The Backstory
|
|
|
|
Milton Waddams was relocated to the basement. They took his stapler. But down there, surrounded by boxes of `.doc` files from 1997 and `.xls` spreadsheets that predate Unicode, he became something else entirely: a document processing expert.
|
|
|
|
This MCP server channels that energy. It handles the legacy formats nobody else wants to touch. It extracts text from files that should have been migrated to Google Docs a decade ago. It reads the TPS reports.
|
|
|
|
---
|
|
|
|
## ✨ Features
|
|
|
|
- **Universal extraction** — Pull text, images, and metadata from any Office format
|
|
- **Format-specific tools** — Deep analysis for Word (tables, structure), Excel (formulas, charts), PowerPoint
|
|
- **Automatic pagination** — Large documents get chunked so they don't blow up your context window
|
|
- **Fallback processing** — When one library chokes on a weird file, we try another
|
|
- **URL support** — Pass a URL instead of a file path; we'll download and cache it
|
|
- **Legacy formats** — Yes, even those `.doc` and `.xls` files from the basement
|
|
|
|
---
|
|
|
|
## 🚀 Installation
|
|
|
|
```bash
|
|
# Quick install with uvx (recommended)
|
|
uvx mcwaddams
|
|
|
|
# Or install with uv/pip
|
|
uv add mcwaddams
|
|
pip install mcwaddams
|
|
```
|
|
|
|
### Claude Desktop Configuration
|
|
|
|
Add to your `claude_desktop_config.json`:
|
|
|
|
```json
|
|
{
|
|
"mcpServers": {
|
|
"mcwaddams": {
|
|
"command": "uvx",
|
|
"args": ["mcwaddams"]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Claude Code Configuration
|
|
|
|
```bash
|
|
claude mcp add mcwaddams "uvx mcwaddams"
|
|
```
|
|
|
|
---
|
|
|
|
## 🛠 Available Tools
|
|
|
|
### Universal Tools
|
|
*Work with all Office formats: Word, Excel, PowerPoint, CSV*
|
|
|
|
| Tool | Description |
|
|
|------|-------------|
|
|
| `extract_text` | Extract text with optional formatting preservation |
|
|
| `extract_images` | Extract embedded images with size filtering |
|
|
| `extract_metadata` | Get document properties (author, dates, statistics) |
|
|
| `detect_office_format` | Identify format, version, encryption status |
|
|
| `analyze_document_health` | Check integrity, corruption, password protection |
|
|
| `get_supported_formats` | List all supported file extensions |
|
|
| `index_document` | Scan document and create resource URIs for on-demand fetching |
|
|
|
|
### Word Tools
|
|
|
|
| Tool | Description |
|
|
|------|-------------|
|
|
| `convert_to_markdown` | Convert to Markdown with automatic pagination for large docs |
|
|
| `extract_word_tables` | Extract tables as structured JSON, CSV, or Markdown |
|
|
| `analyze_word_structure` | Analyze headings, sections, styles, and document hierarchy |
|
|
| `get_document_outline` | Get structured outline with chapter detection and word counts |
|
|
| `check_style_consistency` | Find formatting issues, missing chapters, style problems |
|
|
| `search_document` | Search text with context and chapter location |
|
|
| `extract_entities` | Extract people, places, organizations using pattern recognition |
|
|
| `get_chapter_summaries` | Generate chapter previews with opening sentences |
|
|
| `save_reading_progress` | Bookmark your reading position for later |
|
|
| `get_reading_progress` | Resume reading from saved position |
|
|
|
|
### Excel Tools
|
|
|
|
| Tool | Description |
|
|
|------|-------------|
|
|
| `analyze_excel_data` | Statistical analysis: data types, missing values, outliers |
|
|
| `extract_excel_formulas` | Extract formulas with values and dependency analysis |
|
|
| `create_excel_chart_data` | Generate Chart.js/Plotly-ready data from spreadsheets |
|
|
|
|
---
|
|
|
|
## 📋 Format Support
|
|
|
|
Here's what works and what's "good enough" — legacy formats from Office 97-2003 have more limited extraction, but they still work:
|
|
|
|
| Format | Extension | Text | Images | Metadata | Tables | Formulas |
|
|
|--------|-----------|:----:|:------:|:--------:|:------:|:--------:|
|
|
| **Word (Modern)** | `.docx` | ✅ | ✅ | ✅ | ✅ | - |
|
|
| **Word (Legacy)** | `.doc` | ✅ | ⚠️ | ⚠️ | ⚠️ | - |
|
|
| **Word Template** | `.dotx` | ✅ | ✅ | ✅ | ✅ | - |
|
|
| **Word Macro** | `.docm` | ✅ | ✅ | ✅ | ✅ | - |
|
|
| **Excel (Modern)** | `.xlsx` | ✅ | ✅ | ✅ | ✅ | ✅ |
|
|
| **Excel (Legacy)** | `.xls` | ✅ | ⚠️ | ⚠️ | ✅ | ⚠️ |
|
|
| **Excel Template** | `.xltx` | ✅ | ✅ | ✅ | ✅ | ✅ |
|
|
| **Excel Macro** | `.xlsm` | ✅ | ✅ | ✅ | ✅ | ✅ |
|
|
| **PowerPoint (Modern)** | `.pptx` | ✅ | ✅ | ✅ | ✅ | - |
|
|
| **PowerPoint (Legacy)** | `.ppt` | ✅ | ⚠️ | ⚠️ | ⚠️ | - |
|
|
| **PowerPoint Template** | `.potx` | ✅ | ✅ | ✅ | ✅ | - |
|
|
| **CSV** | `.csv` | ✅ | - | ⚠️ | ✅ | - |
|
|
|
|
✅ Full support • ⚠️ Basic/partial support • - Not applicable
|
|
|
|
---
|
|
|
|
## 🔗 MCP Resources
|
|
|
|
Instead of returning entire documents in tool responses, you can index a document once and fetch content on-demand via URI-based resources. This keeps context windows manageable when working with large files.
|
|
|
|
### How It Works
|
|
|
|
1. **Index the document** — `index_document` scans the file and returns URIs
|
|
2. **Fetch what you need** — Request specific chapters, sheets, slides, or images by URI
|
|
3. **Format on demand** — Append `.txt` or `.html` to get different output formats
|
|
|
|
### Resource URI Patterns
|
|
|
|
| URI Pattern | Description | Example |
|
|
|-------------|-------------|---------|
|
|
| `chapter://{doc_id}/{n}` | Single chapter/section | `chapter://abc123/3` |
|
|
| `chapters://{doc_id}/{range}` | Multiple chapters | `chapters://abc123/1-5` |
|
|
| `section://{doc_id}/{n}` | Section by heading style | `section://abc123/2` |
|
|
| `paragraph://{doc_id}/{ch}/{p}` | Specific paragraph | `paragraph://abc123/3/7` |
|
|
| `sheet://{doc_id}/{name}` | Excel sheet as markdown table | `sheet://abc123/Revenue` |
|
|
| `slide://{doc_id}/{n}` | PowerPoint slide | `slide://abc123/5` |
|
|
| `slides://{doc_id}/{range}` | Multiple slides | `slides://abc123/1,3,5` |
|
|
| `image://{doc_id}/{n}` | Embedded image | `image://abc123/0` |
|
|
|
|
### Format Suffixes
|
|
|
|
Append a format suffix to convert on the fly:
|
|
|
|
| Suffix | Output |
|
|
|--------|--------|
|
|
| `.md` (default) | Markdown |
|
|
| `.txt` | Plain text (no formatting) |
|
|
| `.html` | Basic HTML |
|
|
|
|
Examples:
|
|
- `chapter://abc123/3` → Markdown (default)
|
|
- `chapter://abc123/3.txt` → Plain text
|
|
- `chapter://abc123/3.html` → HTML
|
|
|
|
### Range Syntax
|
|
|
|
Fetch multiple items at once:
|
|
- `1-5` → Items 1 through 5
|
|
- `1,3,5` → Specific items
|
|
- `1-3,7,9-10` → Mixed ranges
|
|
|
|
### Section Detection
|
|
|
|
The indexer detects document structure automatically:
|
|
|
|
1. **Heading 1 styles** (primary) — Business docs, manuals, technical documents
|
|
2. **"Chapter X" text patterns** (fallback) — Books, manuscripts, narratives
|
|
|
|
Use `text_patterns_only=True` to skip heading style detection for documents with messy formatting.
|
|
|
|
---
|
|
|
|
## 🎯 MCP Prompts
|
|
|
|
Pre-built workflows that chain multiple tools together:
|
|
|
|
| Prompt | Level | Description |
|
|
|--------|-------|-------------|
|
|
| `explore-document` | Basic | Start with any new document - get structure and identify issues |
|
|
| `find-character` | Basic | Track all mentions of a person/character with context |
|
|
| `chapter-preview` | Basic | Quick overview of each chapter without full read |
|
|
| `resume-reading` | Intermediate | Check saved position and continue reading |
|
|
| `document-analysis` | Intermediate | Comprehensive multi-tool analysis |
|
|
| `character-journey` | Advanced | Track character arc through entire narrative |
|
|
| `document-comparison` | Advanced | Compare entities and themes between chapters |
|
|
| `full-reading-session` | Advanced | Guided reading with bookmarking |
|
|
| `manuscript-review` | Advanced | Complete editorial workflow for editors |
|
|
|
|
---
|
|
|
|
## 💡 Usage Examples
|
|
|
|
### Extract Text from Any Document
|
|
|
|
```python
|
|
# Simple extraction
|
|
result = await extract_text("report.docx")
|
|
print(result["text"])
|
|
|
|
# With formatting preserved
|
|
result = await extract_text(
|
|
file_path="report.docx",
|
|
preserve_formatting=True,
|
|
include_metadata=True
|
|
)
|
|
```
|
|
|
|
### Convert Word to Markdown (with Pagination)
|
|
|
|
Large documents get paginated automatically. Three ways to handle it:
|
|
|
|
```python
|
|
# Option 1: Follow the cursor for each chunk
|
|
result = await convert_to_markdown("big-manual.docx")
|
|
if result.get("pagination", {}).get("has_more"):
|
|
next_page = await convert_to_markdown(
|
|
"big-manual.docx",
|
|
cursor_id=result["pagination"]["cursor_id"]
|
|
)
|
|
|
|
# Option 2: Grab specific pages
|
|
result = await convert_to_markdown("big-manual.docx", page_range="1-10")
|
|
|
|
# Option 3: Extract by chapter heading
|
|
result = await convert_to_markdown("big-manual.docx", chapter_name="Introduction")
|
|
```
|
|
|
|
### Analyze Excel Data Quality
|
|
|
|
```python
|
|
result = await analyze_excel_data(
|
|
file_path="sales-data.xlsx",
|
|
include_statistics=True,
|
|
check_data_quality=True
|
|
)
|
|
|
|
# Returns per-column analysis with quality issues
|
|
```
|
|
|
|
### Index Document for On-Demand Resource Fetching
|
|
|
|
```python
|
|
# Index the document - returns URIs for all content
|
|
result = await index_document("novel.docx")
|
|
|
|
# Returns:
|
|
# {
|
|
# "doc_id": "56036b0f171a",
|
|
# "resources": {
|
|
# "chapter": [
|
|
# {"id": "1", "title": "Chapter 1", "uri": "chapter://56036b0f171a/1"},
|
|
# ...
|
|
# ],
|
|
# "image": [
|
|
# {"id": "0", "uri": "image://56036b0f171a/0"},
|
|
# ...
|
|
# ]
|
|
# }
|
|
# }
|
|
|
|
# Fetch specific content via MCP resources:
|
|
# - chapter://56036b0f171a/1 → Chapter 1 as markdown
|
|
# - chapter://56036b0f171a/1.txt → Chapter 1 as plain text
|
|
# - chapters://56036b0f171a/1-3 → Chapters 1-3 combined
|
|
```
|
|
|
|
---
|
|
|
|
## 🧪 Testing
|
|
|
|
```bash
|
|
# Run tests and generate the dashboard
|
|
make test
|
|
|
|
# Just pytest
|
|
make test-pytest
|
|
|
|
# Open dashboard
|
|
make view-dashboard
|
|
```
|
|
|
|
---
|
|
|
|
## 🏗 Architecture
|
|
|
|
The mixin pattern keeps things modular — universal tools work on everything, format-specific tools go deeper.
|
|
|
|
```
|
|
mcwaddams/
|
|
├── src/mcwaddams/
|
|
│ ├── server.py # FastMCP server + resource templates
|
|
│ ├── resources.py # Resource store for on-demand content
|
|
│ ├── mixins/
|
|
│ │ ├── universal.py # Format-agnostic tools
|
|
│ │ ├── word.py # Word-specific tools
|
|
│ │ ├── excel.py # Excel-specific tools
|
|
│ │ └── powerpoint.py # PowerPoint tools
|
|
│ ├── utils/ # Validation, caching, detection
|
|
│ └── pagination.py # Large document pagination
|
|
├── tests/
|
|
└── reports/
|
|
```
|
|
|
|
### Processing Libraries
|
|
|
|
| Format | Primary Library | Fallback |
|
|
|--------|----------------|----------|
|
|
| `.docx` | python-docx | mammoth |
|
|
| `.xlsx` | openpyxl | pandas |
|
|
| `.pptx` | python-pptx | - |
|
|
| `.doc`/`.xls`/`.ppt` | olefile | - |
|
|
| `.csv` | pandas | built-in csv |
|
|
|
|
---
|
|
|
|
## 🔧 Development
|
|
|
|
```bash
|
|
git clone https://github.com/ryanmalloy/mcwaddams.git
|
|
cd mcwaddams
|
|
uv sync --dev
|
|
|
|
uv run pytest
|
|
uv run black src/ tests/
|
|
uv run ruff check src/ tests/
|
|
```
|
|
|
|
---
|
|
|
|
## 👤 Author
|
|
|
|
**Ryan Malloy** — [ryanmalloy.com](https://ryanmalloy.com)
|
|
|
|
This package emerged from a human-AI collaboration session. The process raised questions about discernment, voice, and what makes tools actually useful:
|
|
|
|
- **[AI Isn't New. Your Discernment Is What Matters.](https://ryanmalloy.com/blog/ai-discernment)** — 40 years of writing code and why discernment matters more than the tools
|
|
|
|
---
|
|
|
|
## 📜 License
|
|
|
|
MIT License - see [LICENSE](LICENSE) for details.
|
|
|
|
---
|
|
|
|
<div align="center">
|
|
|
|
*Named for Milton Waddams, who was relocated to the basement with the legacy documents.*
|
|
|
|
*"I could set the building on fire..."*
|
|
|
|
**Built with [FastMCP](https://gofastmcp.com) and the [Model Context Protocol](https://modelcontextprotocol.io)**
|
|
|
|
</div>
|