From 34e636e7826f24c8f5efe666b8df66fa0da8d711 Mon Sep 17 00:00:00 2001 From: Ryan Malloy Date: Sun, 11 Jan 2026 06:47:39 -0700 Subject: [PATCH] Add documentation for DOCX processing fixes Documents 6 critical bugs discovered while processing a 200+ page manuscript, including the root cause xpath API mismatch between python-docx and lxml that caused silent failures in chapter search. --- docs/DOCX_PROCESSING_FIXES.md | 181 ++++++++++++++++++++++++++++++++++ 1 file changed, 181 insertions(+) create mode 100644 docs/DOCX_PROCESSING_FIXES.md diff --git a/docs/DOCX_PROCESSING_FIXES.md b/docs/DOCX_PROCESSING_FIXES.md new file mode 100644 index 0000000..ae2dc2f --- /dev/null +++ b/docs/DOCX_PROCESSING_FIXES.md @@ -0,0 +1,181 @@ +# DOCX Processing Fixes + +This document captures critical bugs discovered and fixed while processing complex Word documents (specifically a 200+ page manuscript with 10 chapters). + +## Summary + +| # | Bug | Impact | Root Cause | +|---|-----|--------|------------| +| 1 | FastMCP banner corruption | MCP connection fails | ASCII art breaks JSON-RPC | +| 2 | Page range cap | Wrong content extracted | Used max page# instead of count | +| 3 | Heading scan limit | Chapters not found | Only scanned first 100 elements | +| 4 | Short-text fallback logic | Chapters not found | `elif` prevented fallback | +| 5 | **xpath API mismatch** | **Complete silent failure** | **python-docx != lxml API** | +| 6 | Image mode default | Response too large | Base64 bloats output | + +--- + +## 1. FastMCP Banner Corruption + +**File:** `src/mcp_office_tools/server.py` + +**Symptom:** MCP connection fails with `Invalid JSON: EOF while parsing` + +**Cause:** FastMCP's default startup banner prints ASCII art to stdout, corrupting the JSON-RPC protocol on stdio transport. + +**Fix:** +```python +def main(): + # CRITICAL: show_banner=False is required for stdio transport! + # FastMCP's banner prints ASCII art to stdout which breaks JSON-RPC protocol + app.run(show_banner=False) +``` + +--- + +## 2. Page Range Cap Bug + +**File:** `src/mcp_office_tools/utils/word_processing.py` + +**Symptom:** Requesting pages 1-5 returns truncated content, but pages 195-200 returns everything. + +**Cause:** The paragraph limit was calculated using the *maximum page number* instead of the *count of pages requested*. + +**Before:** +```python +max_paragraphs = max(page_numbers) * 50 # pages 1-5 = 250 max, pages 195-200 = 10,000 max! +``` + +**After:** +```python +num_pages_requested = len(page_numbers) # pages 1-5 = 5, pages 195-200 = 6 +max_paragraphs = num_pages_requested * 300 # Generous limit per page +max_chars = num_pages_requested * 50000 +``` + +--- + +## 3. Heading Scan Limit Bug + +**File:** `src/mcp_office_tools/utils/word_processing.py` + +**Symptom:** `_get_available_headings()` returns empty list for documents with chapters beyond the first few pages. + +**Cause:** The function only scanned the first 100 body elements, but Chapter 10 was at element 1524. + +**Before:** +```python +for element in doc.element.body[:100]: # Only first 100 elements! + # find headings... +``` + +**After:** +```python +for element in doc.element.body: # Scan ALL elements + if len(headings) >= 30: + break # Limit output, not search + # find headings... +``` + +--- + +## 4. Short-Text Fallback Logic Bug + +**File:** `src/mcp_office_tools/utils/word_processing.py` + +**Symptom:** Chapter search fails even when chapter text exists and is under 100 characters. + +**Cause:** The `elif` for short-text detection was attached to `if style_elem`, meaning it only ran when NO style existed. Paragraphs with any non-heading style (Normal, BodyText, etc.) skipped the fallback entirely. + +**Before:** +```python +if style_elem: + if 'heading' in style_val.lower(): + chapter_start_idx = elem_idx + break +elif len(text_content.strip()) < 100: # Only runs if style_elem is empty! + chapter_start_idx = elem_idx + break +``` + +**After:** +```python +is_heading_style = False +if style_elem: + style_val = style_elem[0].get(...) + is_heading_style = 'heading' in style_val.lower() + +# Independent check - runs regardless of whether style exists +if is_heading_style or len(text_content.strip()) < 100: + chapter_start_idx = elem_idx + break +``` + +--- + +## 5. Critical xpath API Mismatch (ROOT CAUSE) + +**File:** `src/mcp_office_tools/utils/word_processing.py` + +**Symptom:** Chapter search always returns "not found" even for chapters that clearly exist. + +**Cause:** python-docx wraps lxml elements with custom classes (`CT_Document`, `CT_Body`, `CT_P`) that override `xpath()` with a **different method signature**. Standard lxml accepts `xpath(expr, namespaces={...})`, but python-docx's version **rejects the `namespaces` keyword argument**. + +All 8 xpath calls were wrapped in try/except blocks, so they **silently failed** - the chapter search never actually executed. + +**Before (silently fails):** +```python +# These all throw: "BaseOxmlElement.xpath() got an unexpected keyword argument 'namespaces'" +text_elems = para.xpath('.//w:t', namespaces={'w': 'http://...'}) +style_elem = para.xpath('.//w:pStyle', namespaces={'w': 'http://...'}) +``` + +**After (works correctly):** +```python +from docx.oxml.ns import qn + +# Use findall() with qn() helper for text elements +text_elems = para.findall('.//' + qn('w:t')) +text_content = ''.join(t.text or '' for t in text_elems) + +# Use find() chain for nested elements (pStyle is inside pPr) +pPr = para.find(qn('w:pPr')) +if pPr is not None: + pStyle = pPr.find(qn('w:pStyle')) + if pStyle is not None: + style_val = pStyle.get(qn('w:val'), '') +``` + +**Key Insight:** The `qn()` function from `docx.oxml.ns` converts prefixed names like `'w:t'` to their fully qualified form `'{http://...}t'`, which works with python-docx's element methods. + +--- + +## 6. Image Mode Default + +**File:** `src/mcp_office_tools/mixins/word.py` + +**Symptom:** Responses exceed token limits when documents contain images. + +**Cause:** Default `image_mode="base64"` embeds full image data inline, bloating responses. + +**Fix:** +```python +image_mode: str = Field( + default="files", # Changed from "base64" + description="Image handling mode: 'files' (saves to disk), 'base64' (embeds inline), 'references' (metadata only)" +) +``` + +--- + +## Lessons Learned + +1. **Silent failures are dangerous.** Wrapping xpath calls in try/except hid the API mismatch for months. Consider logging exceptions even when swallowing them. + +2. **Test with real documents.** Unit tests with mocked data passed, but real documents exposed the xpath API issue immediately. + +3. **python-docx is not lxml.** Despite being built on lxml, python-docx's element classes have different method signatures. Always use `qn()` and `findall()`/`find()` instead of `xpath()` with namespace dicts. + +4. **Check your loop bounds.** Scanning "first 100 elements" seemed reasonable but failed for long documents. Limit the *output*, not the *search*. + +5. **Understand your conditionals.** The `if/elif` logic bug is subtle - the fallback was syntactically correct but semantically wrong for the use case.