Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
Named for Milton Waddams, who was relocated to the basement with boxes of legacy documents. He handles the .doc and .xls files from 1997 that nobody else wants to touch. - Rename package from mcp-office-tools to mcwaddams - Update author to Ryan Malloy - Update all imports and references - Add Office Space themed README narrative - All 53 tests passing
182 lines
6.1 KiB
Markdown
182 lines
6.1 KiB
Markdown
# DOCX Processing Fixes
|
|
|
|
This document captures critical bugs discovered and fixed while processing complex Word documents (specifically a 200+ page manuscript with 10 chapters).
|
|
|
|
## Summary
|
|
|
|
| # | Bug | Impact | Root Cause |
|
|
|---|-----|--------|------------|
|
|
| 1 | FastMCP banner corruption | MCP connection fails | ASCII art breaks JSON-RPC |
|
|
| 2 | Page range cap | Wrong content extracted | Used max page# instead of count |
|
|
| 3 | Heading scan limit | Chapters not found | Only scanned first 100 elements |
|
|
| 4 | Short-text fallback logic | Chapters not found | `elif` prevented fallback |
|
|
| 5 | **xpath API mismatch** | **Complete silent failure** | **python-docx != lxml API** |
|
|
| 6 | Image mode default | Response too large | Base64 bloats output |
|
|
|
|
---
|
|
|
|
## 1. FastMCP Banner Corruption
|
|
|
|
**File:** `src/mcwaddams/server.py`
|
|
|
|
**Symptom:** MCP connection fails with `Invalid JSON: EOF while parsing`
|
|
|
|
**Cause:** FastMCP's default startup banner prints ASCII art to stdout, corrupting the JSON-RPC protocol on stdio transport.
|
|
|
|
**Fix:**
|
|
```python
|
|
def main():
|
|
# CRITICAL: show_banner=False is required for stdio transport!
|
|
# FastMCP's banner prints ASCII art to stdout which breaks JSON-RPC protocol
|
|
app.run(show_banner=False)
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Page Range Cap Bug
|
|
|
|
**File:** `src/mcwaddams/utils/word_processing.py`
|
|
|
|
**Symptom:** Requesting pages 1-5 returns truncated content, but pages 195-200 returns everything.
|
|
|
|
**Cause:** The paragraph limit was calculated using the *maximum page number* instead of the *count of pages requested*.
|
|
|
|
**Before:**
|
|
```python
|
|
max_paragraphs = max(page_numbers) * 50 # pages 1-5 = 250 max, pages 195-200 = 10,000 max!
|
|
```
|
|
|
|
**After:**
|
|
```python
|
|
num_pages_requested = len(page_numbers) # pages 1-5 = 5, pages 195-200 = 6
|
|
max_paragraphs = num_pages_requested * 300 # Generous limit per page
|
|
max_chars = num_pages_requested * 50000
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Heading Scan Limit Bug
|
|
|
|
**File:** `src/mcwaddams/utils/word_processing.py`
|
|
|
|
**Symptom:** `_get_available_headings()` returns empty list for documents with chapters beyond the first few pages.
|
|
|
|
**Cause:** The function only scanned the first 100 body elements, but Chapter 10 was at element 1524.
|
|
|
|
**Before:**
|
|
```python
|
|
for element in doc.element.body[:100]: # Only first 100 elements!
|
|
# find headings...
|
|
```
|
|
|
|
**After:**
|
|
```python
|
|
for element in doc.element.body: # Scan ALL elements
|
|
if len(headings) >= 30:
|
|
break # Limit output, not search
|
|
# find headings...
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Short-Text Fallback Logic Bug
|
|
|
|
**File:** `src/mcwaddams/utils/word_processing.py`
|
|
|
|
**Symptom:** Chapter search fails even when chapter text exists and is under 100 characters.
|
|
|
|
**Cause:** The `elif` for short-text detection was attached to `if style_elem`, meaning it only ran when NO style existed. Paragraphs with any non-heading style (Normal, BodyText, etc.) skipped the fallback entirely.
|
|
|
|
**Before:**
|
|
```python
|
|
if style_elem:
|
|
if 'heading' in style_val.lower():
|
|
chapter_start_idx = elem_idx
|
|
break
|
|
elif len(text_content.strip()) < 100: # Only runs if style_elem is empty!
|
|
chapter_start_idx = elem_idx
|
|
break
|
|
```
|
|
|
|
**After:**
|
|
```python
|
|
is_heading_style = False
|
|
if style_elem:
|
|
style_val = style_elem[0].get(...)
|
|
is_heading_style = 'heading' in style_val.lower()
|
|
|
|
# Independent check - runs regardless of whether style exists
|
|
if is_heading_style or len(text_content.strip()) < 100:
|
|
chapter_start_idx = elem_idx
|
|
break
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Critical xpath API Mismatch (ROOT CAUSE)
|
|
|
|
**File:** `src/mcwaddams/utils/word_processing.py`
|
|
|
|
**Symptom:** Chapter search always returns "not found" even for chapters that clearly exist.
|
|
|
|
**Cause:** python-docx wraps lxml elements with custom classes (`CT_Document`, `CT_Body`, `CT_P`) that override `xpath()` with a **different method signature**. Standard lxml accepts `xpath(expr, namespaces={...})`, but python-docx's version **rejects the `namespaces` keyword argument**.
|
|
|
|
All 8 xpath calls were wrapped in try/except blocks, so they **silently failed** - the chapter search never actually executed.
|
|
|
|
**Before (silently fails):**
|
|
```python
|
|
# These all throw: "BaseOxmlElement.xpath() got an unexpected keyword argument 'namespaces'"
|
|
text_elems = para.xpath('.//w:t', namespaces={'w': 'http://...'})
|
|
style_elem = para.xpath('.//w:pStyle', namespaces={'w': 'http://...'})
|
|
```
|
|
|
|
**After (works correctly):**
|
|
```python
|
|
from docx.oxml.ns import qn
|
|
|
|
# Use findall() with qn() helper for text elements
|
|
text_elems = para.findall('.//' + qn('w:t'))
|
|
text_content = ''.join(t.text or '' for t in text_elems)
|
|
|
|
# Use find() chain for nested elements (pStyle is inside pPr)
|
|
pPr = para.find(qn('w:pPr'))
|
|
if pPr is not None:
|
|
pStyle = pPr.find(qn('w:pStyle'))
|
|
if pStyle is not None:
|
|
style_val = pStyle.get(qn('w:val'), '')
|
|
```
|
|
|
|
**Key Insight:** The `qn()` function from `docx.oxml.ns` converts prefixed names like `'w:t'` to their fully qualified form `'{http://...}t'`, which works with python-docx's element methods.
|
|
|
|
---
|
|
|
|
## 6. Image Mode Default
|
|
|
|
**File:** `src/mcwaddams/mixins/word.py`
|
|
|
|
**Symptom:** Responses exceed token limits when documents contain images.
|
|
|
|
**Cause:** Default `image_mode="base64"` embeds full image data inline, bloating responses.
|
|
|
|
**Fix:**
|
|
```python
|
|
image_mode: str = Field(
|
|
default="files", # Changed from "base64"
|
|
description="Image handling mode: 'files' (saves to disk), 'base64' (embeds inline), 'references' (metadata only)"
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Lessons Learned
|
|
|
|
1. **Silent failures are dangerous.** Wrapping xpath calls in try/except hid the API mismatch for months. Consider logging exceptions even when swallowing them.
|
|
|
|
2. **Test with real documents.** Unit tests with mocked data passed, but real documents exposed the xpath API issue immediately.
|
|
|
|
3. **python-docx is not lxml.** Despite being built on lxml, python-docx's element classes have different method signatures. Always use `qn()` and `findall()`/`find()` instead of `xpath()` with namespace dicts.
|
|
|
|
4. **Check your loop bounds.** Scanning "first 100 elements" seemed reasonable but failed for long documents. Limit the *output*, not the *search*.
|
|
|
|
5. **Understand your conditionals.** The `if/elif` logic bug is subtle - the fallback was syntactically correct but semantically wrong for the use case.
|