5 Commits

Author SHA1 Message Date
89ad0c849d Improve section detection with heading styles + fallback
Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
- Primary: Detect sections via Heading 1 styles (structured)
- Fallback: Detect chapters via "Chapter X" text patterns
- Add text_patterns_only flag to skip heading styles (for messy docs)

This handles both well-structured business documents (manuals, PRDs)
and narrative content (books with explicit chapter headings).
2026-01-11 09:40:38 -07:00
d569034fa3 Add MCP resource system for embedded document content
Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
Implements URI-based access to document content with:

- ResourceStore for caching extracted images, chapters, sheets, slides
- Content-based document IDs (SHA256 hash) for stable URIs across sessions
- 11 resource templates with flexible URI patterns:
  - Binary: image://, chart://, media://, embed://
  - Text: chapter://, section://, sheet://, slide://
  - Ranges: chapters://doc/1-5, slides://doc/1,3,5
  - Hierarchical: paragraph://doc/3/5

- Format suffixes for output control:
  - chapter://doc/3.md (default markdown)
  - chapter://doc/3.txt (plain text)
  - chapter://doc/3.html (basic HTML)

- index_document tool scans and populates resources:
  - Word: chapters as markdown, embedded images
  - Excel: sheets as markdown tables
  - PowerPoint: slides as markdown

Tool responses return URIs instead of blobs - clients fetch only what they need.
2026-01-11 09:04:29 -07:00
af6aadf559 Refactor: Extract processing logic into utility modules
Complete architecture cleanup - eliminated duplicate server files:
- Deleted server_monolithic.py (2249 lines)
- Deleted server_legacy.py (2209 lines)

New utility modules created:
- utils/word_processing.py - Word extraction/conversion (preserves page range fixes)
- utils/excel_processing.py - Excel extraction
- utils/powerpoint_processing.py - PowerPoint extraction
- utils/processing.py - Universal helpers (parse_page_range, health checks, etc.)

Updated mixins to import from utils instead of server_monolithic.
Entry point remains server.py (48 lines) using mixin architecture.

All 53 tests pass. Coverage improved from 11% to 22% by removing duplicate code.
2026-01-11 05:08:18 -07:00
0748eec48d Fix FastMCP stdio server import
- Use app.run_stdio_async() instead of deprecated stdio_server import
- Aligns with FastMCP 2.11.3 API
- Server now starts correctly with uv run mcp-office-tools
- Maintains all MCPMixin functionality and tool registration
2025-09-26 15:49:00 -06:00
9d6a9fc24c Refactor server architecture using mcpmixin pattern
- Split monolithic 2209-line server.py into organized mixin classes
- UniversalMixin: Format-agnostic tools (extract_text, extract_images, etc.)
- WordMixin: Word-specific tools (convert_to_markdown with chapter_name support)
- ExcelMixin: Placeholder for future Excel-specific tools
- PowerPointMixin: Placeholder for future PowerPoint-specific tools

Benefits:
• Improved maintainability and separation of concerns
• Better testability with isolated mixins
• Easier team collaboration on different file types
• Reduced cognitive load per module
• Preserved all 7 existing tools with full functionality

Architecture now supports clean expansion for format-specific tools
while maintaining backward compatibility through legacy server backup.
2025-09-26 13:08:53 -06:00