Named for Milton Waddams, who was relocated to the basement with boxes of legacy documents. He handles the .doc and .xls files from 1997 that nobody else wants to touch. - Rename package from mcp-office-tools to mcwaddams - Update author to Ryan Malloy - Update all imports and references - Add Office Space themed README narrative - All 53 tests passing
6.1 KiB
DOCX Processing Fixes
This document captures critical bugs discovered and fixed while processing complex Word documents (specifically a 200+ page manuscript with 10 chapters).
Summary
| # | Bug | Impact | Root Cause |
|---|---|---|---|
| 1 | FastMCP banner corruption | MCP connection fails | ASCII art breaks JSON-RPC |
| 2 | Page range cap | Wrong content extracted | Used max page# instead of count |
| 3 | Heading scan limit | Chapters not found | Only scanned first 100 elements |
| 4 | Short-text fallback logic | Chapters not found | elif prevented fallback |
| 5 | xpath API mismatch | Complete silent failure | python-docx != lxml API |
| 6 | Image mode default | Response too large | Base64 bloats output |
1. FastMCP Banner Corruption
File: src/mcwaddams/server.py
Symptom: MCP connection fails with Invalid JSON: EOF while parsing
Cause: FastMCP's default startup banner prints ASCII art to stdout, corrupting the JSON-RPC protocol on stdio transport.
Fix:
def main():
# CRITICAL: show_banner=False is required for stdio transport!
# FastMCP's banner prints ASCII art to stdout which breaks JSON-RPC protocol
app.run(show_banner=False)
2. Page Range Cap Bug
File: src/mcwaddams/utils/word_processing.py
Symptom: Requesting pages 1-5 returns truncated content, but pages 195-200 returns everything.
Cause: The paragraph limit was calculated using the maximum page number instead of the count of pages requested.
Before:
max_paragraphs = max(page_numbers) * 50 # pages 1-5 = 250 max, pages 195-200 = 10,000 max!
After:
num_pages_requested = len(page_numbers) # pages 1-5 = 5, pages 195-200 = 6
max_paragraphs = num_pages_requested * 300 # Generous limit per page
max_chars = num_pages_requested * 50000
3. Heading Scan Limit Bug
File: src/mcwaddams/utils/word_processing.py
Symptom: _get_available_headings() returns empty list for documents with chapters beyond the first few pages.
Cause: The function only scanned the first 100 body elements, but Chapter 10 was at element 1524.
Before:
for element in doc.element.body[:100]: # Only first 100 elements!
# find headings...
After:
for element in doc.element.body: # Scan ALL elements
if len(headings) >= 30:
break # Limit output, not search
# find headings...
4. Short-Text Fallback Logic Bug
File: src/mcwaddams/utils/word_processing.py
Symptom: Chapter search fails even when chapter text exists and is under 100 characters.
Cause: The elif for short-text detection was attached to if style_elem, meaning it only ran when NO style existed. Paragraphs with any non-heading style (Normal, BodyText, etc.) skipped the fallback entirely.
Before:
if style_elem:
if 'heading' in style_val.lower():
chapter_start_idx = elem_idx
break
elif len(text_content.strip()) < 100: # Only runs if style_elem is empty!
chapter_start_idx = elem_idx
break
After:
is_heading_style = False
if style_elem:
style_val = style_elem[0].get(...)
is_heading_style = 'heading' in style_val.lower()
# Independent check - runs regardless of whether style exists
if is_heading_style or len(text_content.strip()) < 100:
chapter_start_idx = elem_idx
break
5. Critical xpath API Mismatch (ROOT CAUSE)
File: src/mcwaddams/utils/word_processing.py
Symptom: Chapter search always returns "not found" even for chapters that clearly exist.
Cause: python-docx wraps lxml elements with custom classes (CT_Document, CT_Body, CT_P) that override xpath() with a different method signature. Standard lxml accepts xpath(expr, namespaces={...}), but python-docx's version rejects the namespaces keyword argument.
All 8 xpath calls were wrapped in try/except blocks, so they silently failed - the chapter search never actually executed.
Before (silently fails):
# These all throw: "BaseOxmlElement.xpath() got an unexpected keyword argument 'namespaces'"
text_elems = para.xpath('.//w:t', namespaces={'w': 'http://...'})
style_elem = para.xpath('.//w:pStyle', namespaces={'w': 'http://...'})
After (works correctly):
from docx.oxml.ns import qn
# Use findall() with qn() helper for text elements
text_elems = para.findall('.//' + qn('w:t'))
text_content = ''.join(t.text or '' for t in text_elems)
# Use find() chain for nested elements (pStyle is inside pPr)
pPr = para.find(qn('w:pPr'))
if pPr is not None:
pStyle = pPr.find(qn('w:pStyle'))
if pStyle is not None:
style_val = pStyle.get(qn('w:val'), '')
Key Insight: The qn() function from docx.oxml.ns converts prefixed names like 'w:t' to their fully qualified form '{http://...}t', which works with python-docx's element methods.
6. Image Mode Default
File: src/mcwaddams/mixins/word.py
Symptom: Responses exceed token limits when documents contain images.
Cause: Default image_mode="base64" embeds full image data inline, bloating responses.
Fix:
image_mode: str = Field(
default="files", # Changed from "base64"
description="Image handling mode: 'files' (saves to disk), 'base64' (embeds inline), 'references' (metadata only)"
)
Lessons Learned
-
Silent failures are dangerous. Wrapping xpath calls in try/except hid the API mismatch for months. Consider logging exceptions even when swallowing them.
-
Test with real documents. Unit tests with mocked data passed, but real documents exposed the xpath API issue immediately.
-
python-docx is not lxml. Despite being built on lxml, python-docx's element classes have different method signatures. Always use
qn()andfindall()/find()instead ofxpath()with namespace dicts. -
Check your loop bounds. Scanning "first 100 elements" seemed reasonable but failed for long documents. Limit the output, not the search.
-
Understand your conditionals. The
if/eliflogic bug is subtle - the fallback was syntactically correct but semantically wrong for the use case.