mcp-office-tools/docs/DOCX_PROCESSING_FIXES.md
Ryan Malloy 34e636e782 Add documentation for DOCX processing fixes
Documents 6 critical bugs discovered while processing a 200+ page
manuscript, including the root cause xpath API mismatch between
python-docx and lxml that caused silent failures in chapter search.
2026-01-11 06:47:39 -07:00

6.2 KiB

DOCX Processing Fixes

This document captures critical bugs discovered and fixed while processing complex Word documents (specifically a 200+ page manuscript with 10 chapters).

Summary

# Bug Impact Root Cause
1 FastMCP banner corruption MCP connection fails ASCII art breaks JSON-RPC
2 Page range cap Wrong content extracted Used max page# instead of count
3 Heading scan limit Chapters not found Only scanned first 100 elements
4 Short-text fallback logic Chapters not found elif prevented fallback
5 xpath API mismatch Complete silent failure python-docx != lxml API
6 Image mode default Response too large Base64 bloats output

1. FastMCP Banner Corruption

File: src/mcp_office_tools/server.py

Symptom: MCP connection fails with Invalid JSON: EOF while parsing

Cause: FastMCP's default startup banner prints ASCII art to stdout, corrupting the JSON-RPC protocol on stdio transport.

Fix:

def main():
    # CRITICAL: show_banner=False is required for stdio transport!
    # FastMCP's banner prints ASCII art to stdout which breaks JSON-RPC protocol
    app.run(show_banner=False)

2. Page Range Cap Bug

File: src/mcp_office_tools/utils/word_processing.py

Symptom: Requesting pages 1-5 returns truncated content, but pages 195-200 returns everything.

Cause: The paragraph limit was calculated using the maximum page number instead of the count of pages requested.

Before:

max_paragraphs = max(page_numbers) * 50  # pages 1-5 = 250 max, pages 195-200 = 10,000 max!

After:

num_pages_requested = len(page_numbers)  # pages 1-5 = 5, pages 195-200 = 6
max_paragraphs = num_pages_requested * 300  # Generous limit per page
max_chars = num_pages_requested * 50000

3. Heading Scan Limit Bug

File: src/mcp_office_tools/utils/word_processing.py

Symptom: _get_available_headings() returns empty list for documents with chapters beyond the first few pages.

Cause: The function only scanned the first 100 body elements, but Chapter 10 was at element 1524.

Before:

for element in doc.element.body[:100]:  # Only first 100 elements!
    # find headings...

After:

for element in doc.element.body:  # Scan ALL elements
    if len(headings) >= 30:
        break  # Limit output, not search
    # find headings...

4. Short-Text Fallback Logic Bug

File: src/mcp_office_tools/utils/word_processing.py

Symptom: Chapter search fails even when chapter text exists and is under 100 characters.

Cause: The elif for short-text detection was attached to if style_elem, meaning it only ran when NO style existed. Paragraphs with any non-heading style (Normal, BodyText, etc.) skipped the fallback entirely.

Before:

if style_elem:
    if 'heading' in style_val.lower():
        chapter_start_idx = elem_idx
        break
elif len(text_content.strip()) < 100:  # Only runs if style_elem is empty!
    chapter_start_idx = elem_idx
    break

After:

is_heading_style = False
if style_elem:
    style_val = style_elem[0].get(...)
    is_heading_style = 'heading' in style_val.lower()

# Independent check - runs regardless of whether style exists
if is_heading_style or len(text_content.strip()) < 100:
    chapter_start_idx = elem_idx
    break

5. Critical xpath API Mismatch (ROOT CAUSE)

File: src/mcp_office_tools/utils/word_processing.py

Symptom: Chapter search always returns "not found" even for chapters that clearly exist.

Cause: python-docx wraps lxml elements with custom classes (CT_Document, CT_Body, CT_P) that override xpath() with a different method signature. Standard lxml accepts xpath(expr, namespaces={...}), but python-docx's version rejects the namespaces keyword argument.

All 8 xpath calls were wrapped in try/except blocks, so they silently failed - the chapter search never actually executed.

Before (silently fails):

# These all throw: "BaseOxmlElement.xpath() got an unexpected keyword argument 'namespaces'"
text_elems = para.xpath('.//w:t', namespaces={'w': 'http://...'})
style_elem = para.xpath('.//w:pStyle', namespaces={'w': 'http://...'})

After (works correctly):

from docx.oxml.ns import qn

# Use findall() with qn() helper for text elements
text_elems = para.findall('.//' + qn('w:t'))
text_content = ''.join(t.text or '' for t in text_elems)

# Use find() chain for nested elements (pStyle is inside pPr)
pPr = para.find(qn('w:pPr'))
if pPr is not None:
    pStyle = pPr.find(qn('w:pStyle'))
    if pStyle is not None:
        style_val = pStyle.get(qn('w:val'), '')

Key Insight: The qn() function from docx.oxml.ns converts prefixed names like 'w:t' to their fully qualified form '{http://...}t', which works with python-docx's element methods.


6. Image Mode Default

File: src/mcp_office_tools/mixins/word.py

Symptom: Responses exceed token limits when documents contain images.

Cause: Default image_mode="base64" embeds full image data inline, bloating responses.

Fix:

image_mode: str = Field(
    default="files",  # Changed from "base64"
    description="Image handling mode: 'files' (saves to disk), 'base64' (embeds inline), 'references' (metadata only)"
)

Lessons Learned

  1. Silent failures are dangerous. Wrapping xpath calls in try/except hid the API mismatch for months. Consider logging exceptions even when swallowing them.

  2. Test with real documents. Unit tests with mocked data passed, but real documents exposed the xpath API issue immediately.

  3. python-docx is not lxml. Despite being built on lxml, python-docx's element classes have different method signatures. Always use qn() and findall()/find() instead of xpath() with namespace dicts.

  4. Check your loop bounds. Scanning "first 100 elements" seemed reasonable but failed for long documents. Limit the output, not the search.

  5. Understand your conditionals. The if/elif logic bug is subtle - the fallback was syntactically correct but semantically wrong for the use case.