Documents 6 critical bugs discovered while processing a 200+ page manuscript, including the root cause xpath API mismatch between python-docx and lxml that caused silent failures in chapter search.
182 lines
6.2 KiB
Markdown
182 lines
6.2 KiB
Markdown
# DOCX Processing Fixes
|
|
|
|
This document captures critical bugs discovered and fixed while processing complex Word documents (specifically a 200+ page manuscript with 10 chapters).
|
|
|
|
## Summary
|
|
|
|
| # | Bug | Impact | Root Cause |
|
|
|---|-----|--------|------------|
|
|
| 1 | FastMCP banner corruption | MCP connection fails | ASCII art breaks JSON-RPC |
|
|
| 2 | Page range cap | Wrong content extracted | Used max page# instead of count |
|
|
| 3 | Heading scan limit | Chapters not found | Only scanned first 100 elements |
|
|
| 4 | Short-text fallback logic | Chapters not found | `elif` prevented fallback |
|
|
| 5 | **xpath API mismatch** | **Complete silent failure** | **python-docx != lxml API** |
|
|
| 6 | Image mode default | Response too large | Base64 bloats output |
|
|
|
|
---
|
|
|
|
## 1. FastMCP Banner Corruption
|
|
|
|
**File:** `src/mcp_office_tools/server.py`
|
|
|
|
**Symptom:** MCP connection fails with `Invalid JSON: EOF while parsing`
|
|
|
|
**Cause:** FastMCP's default startup banner prints ASCII art to stdout, corrupting the JSON-RPC protocol on stdio transport.
|
|
|
|
**Fix:**
|
|
```python
|
|
def main():
|
|
# CRITICAL: show_banner=False is required for stdio transport!
|
|
# FastMCP's banner prints ASCII art to stdout which breaks JSON-RPC protocol
|
|
app.run(show_banner=False)
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Page Range Cap Bug
|
|
|
|
**File:** `src/mcp_office_tools/utils/word_processing.py`
|
|
|
|
**Symptom:** Requesting pages 1-5 returns truncated content, but pages 195-200 returns everything.
|
|
|
|
**Cause:** The paragraph limit was calculated using the *maximum page number* instead of the *count of pages requested*.
|
|
|
|
**Before:**
|
|
```python
|
|
max_paragraphs = max(page_numbers) * 50 # pages 1-5 = 250 max, pages 195-200 = 10,000 max!
|
|
```
|
|
|
|
**After:**
|
|
```python
|
|
num_pages_requested = len(page_numbers) # pages 1-5 = 5, pages 195-200 = 6
|
|
max_paragraphs = num_pages_requested * 300 # Generous limit per page
|
|
max_chars = num_pages_requested * 50000
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Heading Scan Limit Bug
|
|
|
|
**File:** `src/mcp_office_tools/utils/word_processing.py`
|
|
|
|
**Symptom:** `_get_available_headings()` returns empty list for documents with chapters beyond the first few pages.
|
|
|
|
**Cause:** The function only scanned the first 100 body elements, but Chapter 10 was at element 1524.
|
|
|
|
**Before:**
|
|
```python
|
|
for element in doc.element.body[:100]: # Only first 100 elements!
|
|
# find headings...
|
|
```
|
|
|
|
**After:**
|
|
```python
|
|
for element in doc.element.body: # Scan ALL elements
|
|
if len(headings) >= 30:
|
|
break # Limit output, not search
|
|
# find headings...
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Short-Text Fallback Logic Bug
|
|
|
|
**File:** `src/mcp_office_tools/utils/word_processing.py`
|
|
|
|
**Symptom:** Chapter search fails even when chapter text exists and is under 100 characters.
|
|
|
|
**Cause:** The `elif` for short-text detection was attached to `if style_elem`, meaning it only ran when NO style existed. Paragraphs with any non-heading style (Normal, BodyText, etc.) skipped the fallback entirely.
|
|
|
|
**Before:**
|
|
```python
|
|
if style_elem:
|
|
if 'heading' in style_val.lower():
|
|
chapter_start_idx = elem_idx
|
|
break
|
|
elif len(text_content.strip()) < 100: # Only runs if style_elem is empty!
|
|
chapter_start_idx = elem_idx
|
|
break
|
|
```
|
|
|
|
**After:**
|
|
```python
|
|
is_heading_style = False
|
|
if style_elem:
|
|
style_val = style_elem[0].get(...)
|
|
is_heading_style = 'heading' in style_val.lower()
|
|
|
|
# Independent check - runs regardless of whether style exists
|
|
if is_heading_style or len(text_content.strip()) < 100:
|
|
chapter_start_idx = elem_idx
|
|
break
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Critical xpath API Mismatch (ROOT CAUSE)
|
|
|
|
**File:** `src/mcp_office_tools/utils/word_processing.py`
|
|
|
|
**Symptom:** Chapter search always returns "not found" even for chapters that clearly exist.
|
|
|
|
**Cause:** python-docx wraps lxml elements with custom classes (`CT_Document`, `CT_Body`, `CT_P`) that override `xpath()` with a **different method signature**. Standard lxml accepts `xpath(expr, namespaces={...})`, but python-docx's version **rejects the `namespaces` keyword argument**.
|
|
|
|
All 8 xpath calls were wrapped in try/except blocks, so they **silently failed** - the chapter search never actually executed.
|
|
|
|
**Before (silently fails):**
|
|
```python
|
|
# These all throw: "BaseOxmlElement.xpath() got an unexpected keyword argument 'namespaces'"
|
|
text_elems = para.xpath('.//w:t', namespaces={'w': 'http://...'})
|
|
style_elem = para.xpath('.//w:pStyle', namespaces={'w': 'http://...'})
|
|
```
|
|
|
|
**After (works correctly):**
|
|
```python
|
|
from docx.oxml.ns import qn
|
|
|
|
# Use findall() with qn() helper for text elements
|
|
text_elems = para.findall('.//' + qn('w:t'))
|
|
text_content = ''.join(t.text or '' for t in text_elems)
|
|
|
|
# Use find() chain for nested elements (pStyle is inside pPr)
|
|
pPr = para.find(qn('w:pPr'))
|
|
if pPr is not None:
|
|
pStyle = pPr.find(qn('w:pStyle'))
|
|
if pStyle is not None:
|
|
style_val = pStyle.get(qn('w:val'), '')
|
|
```
|
|
|
|
**Key Insight:** The `qn()` function from `docx.oxml.ns` converts prefixed names like `'w:t'` to their fully qualified form `'{http://...}t'`, which works with python-docx's element methods.
|
|
|
|
---
|
|
|
|
## 6. Image Mode Default
|
|
|
|
**File:** `src/mcp_office_tools/mixins/word.py`
|
|
|
|
**Symptom:** Responses exceed token limits when documents contain images.
|
|
|
|
**Cause:** Default `image_mode="base64"` embeds full image data inline, bloating responses.
|
|
|
|
**Fix:**
|
|
```python
|
|
image_mode: str = Field(
|
|
default="files", # Changed from "base64"
|
|
description="Image handling mode: 'files' (saves to disk), 'base64' (embeds inline), 'references' (metadata only)"
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Lessons Learned
|
|
|
|
1. **Silent failures are dangerous.** Wrapping xpath calls in try/except hid the API mismatch for months. Consider logging exceptions even when swallowing them.
|
|
|
|
2. **Test with real documents.** Unit tests with mocked data passed, but real documents exposed the xpath API issue immediately.
|
|
|
|
3. **python-docx is not lxml.** Despite being built on lxml, python-docx's element classes have different method signatures. Always use `qn()` and `findall()`/`find()` instead of `xpath()` with namespace dicts.
|
|
|
|
4. **Check your loop bounds.** Scanning "first 100 elements" seemed reasonable but failed for long documents. Limit the *output*, not the *search*.
|
|
|
|
5. **Understand your conditionals.** The `if/elif` logic bug is subtle - the fallback was syntactically correct but semantically wrong for the use case.
|