35 Commits

Author SHA1 Message Date
322ed78427 Add Docker deployment with streamable-http transport for hosted MCP
Some checks failed
Test Dashboard / test-and-dashboard (push) Has been cancelled
- Add Dockerfile with multi-stage build using uv
- Add docker-compose.yml with caddy-docker-proxy labels for /mcp endpoint
- Add .env.example for deployment configuration
- Update Makefile with docker-* targets
- Update server.py to support MCP_TRANSPORT env var:
  - 'stdio' (default): Local CLI usage with Claude Code
  - 'streamable-http': Hosted HTTP mode behind reverse proxy

Hosted server will be available at:
  https://mcwaddams.supported.systems/mcp
2026-01-11 14:27:50 -07:00
3d469e5696 Add author section with ryanmalloy.com link
Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
2026-01-11 11:54:17 -07:00
31948d6ffc Rename package to mcwaddams
Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
Named for Milton Waddams, who was relocated to the basement with
boxes of legacy documents. He handles the .doc and .xls files from
1997 that nobody else wants to touch.

- Rename package from mcp-office-tools to mcwaddams
- Update author to Ryan Malloy
- Update all imports and references
- Add Office Space themed README narrative
- All 53 tests passing
2026-01-11 11:35:35 -07:00
6fb76d8760 Add MCP resources documentation and fix section format suffix
Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
- Document MCP resource system in README with URI patterns, format
  suffixes, range syntax, and section detection strategies
- Add index_document to Universal Tools table
- Update architecture section to include resources.py
- Fix section:// resource to support .md/.txt/.html format suffixes
  (matching chapter:// behavior)
2026-01-11 10:23:47 -07:00
89ad0c849d Improve section detection with heading styles + fallback
Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
- Primary: Detect sections via Heading 1 styles (structured)
- Fallback: Detect chapters via "Chapter X" text patterns
- Add text_patterns_only flag to skip heading styles (for messy docs)

This handles both well-structured business documents (manuals, PRDs)
and narrative content (books with explicit chapter headings).
2026-01-11 09:40:38 -07:00
d569034fa3 Add MCP resource system for embedded document content
Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
Implements URI-based access to document content with:

- ResourceStore for caching extracted images, chapters, sheets, slides
- Content-based document IDs (SHA256 hash) for stable URIs across sessions
- 11 resource templates with flexible URI patterns:
  - Binary: image://, chart://, media://, embed://
  - Text: chapter://, section://, sheet://, slide://
  - Ranges: chapters://doc/1-5, slides://doc/1,3,5
  - Hierarchical: paragraph://doc/3/5

- Format suffixes for output control:
  - chapter://doc/3.md (default markdown)
  - chapter://doc/3.txt (plain text)
  - chapter://doc/3.html (basic HTML)

- index_document tool scans and populates resources:
  - Word: chapters as markdown, embedded images
  - Excel: sheets as markdown tables
  - PowerPoint: slides as markdown

Tool responses return URIs instead of blobs - clients fetch only what they need.
2026-01-11 09:04:29 -07:00
11defb4eae Update README and gitignore for new document tools
Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
- Add 7 new Word tools to README (outline, search, entities, etc.)
- Add 9 MCP prompts section with workflow descriptions
- Gitignore reading progress bookmark files (.*.reading_progress.json)
- Gitignore local .mcp.json and test documents
2026-01-11 07:41:49 -07:00
4b38f6455c Add document navigation tools and MCP prompts
New tools for Word document analysis:
- extract_entities: Pattern-based extraction of people, places, organizations
- get_chapter_summaries: Chapter previews with opening sentences and word counts
- save_reading_progress: Bookmark reading position to JSON file
- get_reading_progress: Resume reading from saved position

New MCP prompts (basic to advanced workflows):
- explore-document: Get started with a new document
- find-character: Track character mentions
- chapter-preview: Quick chapter overviews
- resume-reading: Continue where you left off
- document-analysis: Comprehensive multi-tool analysis
- character-journey: Track character arc through narrative
- document-comparison: Compare entities between chapters
- full-reading-session: Guided reading with bookmarking
- manuscript-review: Complete editorial workflow

Updated test counts for 19 total tools (6 universal + 10 word + 3 excel)
2026-01-11 07:23:15 -07:00
1abce7f26d Add document navigation tools: outline, style check, search
New tools for easier document navigation:
- get_document_outline: Structured view of headings with chapter detection
- check_style_consistency: Find formatting issues and missing chapters
- search_document: Search with context and chapter location

All tools tested with 200+ page manuscript. Detects issues like
Chapter 3 being styled as "normal" instead of "Heading 1".
2026-01-11 07:15:43 -07:00
34e636e782 Add documentation for DOCX processing fixes
Documents 6 critical bugs discovered while processing a 200+ page
manuscript, including the root cause xpath API mismatch between
python-docx and lxml that caused silent failures in chapter search.
2026-01-11 06:47:39 -07:00
2f39c4ec5b Fix critical xpath API bug breaking chapter/heading detection
python-docx elements don't support xpath() with namespaces kwarg.
The calls silently failed in try/except blocks, causing chapter search
and heading detection to never find matches.

Fixed by replacing xpath(..., namespaces={...}) with:
- findall('.//' + qn('w:t')) for text elements
- find(qn('w:pPr')) + find(qn('w:pStyle')) for style detection
- get(qn('w:val')) for attribute values

Also fixed logic bug where elif prevented short-text fallback from
running when a non-heading style existed on the paragraph.
2026-01-11 05:20:05 -07:00
af6aadf559 Refactor: Extract processing logic into utility modules
Complete architecture cleanup - eliminated duplicate server files:
- Deleted server_monolithic.py (2249 lines)
- Deleted server_legacy.py (2209 lines)

New utility modules created:
- utils/word_processing.py - Word extraction/conversion (preserves page range fixes)
- utils/excel_processing.py - Excel extraction
- utils/powerpoint_processing.py - PowerPoint extraction
- utils/processing.py - Universal helpers (parse_page_range, health checks, etc.)

Updated mixins to import from utils instead of server_monolithic.
Entry point remains server.py (48 lines) using mixin architecture.

All 53 tests pass. Coverage improved from 11% to 22% by removing duplicate code.
2026-01-11 05:08:18 -07:00
8249afb763 Fix banner issue in server.py entry point
The pyproject.toml script entry point (mcp-office-tools) uses server.py,
not server_monolithic.py. Applied same show_banner=False fix and
simplified to use app.run() instead of asyncio.run(app.run_stdio_async()).
2026-01-11 04:32:46 -07:00
210aa99e0b Fix page range extraction for large documents and MCP connection
Bug fixes:
- Remove 100-paragraph cap that prevented extracting content past ~page 4
  Now calculates limit based on number of pages requested (300 paras/page)
- Add fallback page estimation when docs lack explicit page breaks
  Uses ~25 paragraphs per page for navigation in non-paginated docs
- Fix _get_available_headings to scan full document (was only first 100 elements)
  Headings like Chapter 10 at element 1524 were invisible
- Fix MCP connection by disabling FastMCP banner (show_banner=False)
  ASCII art banner was corrupting stdout JSON-RPC protocol

Changes:
- Default image_mode changed from 'base64' to 'files' to avoid huge responses
- Add proper .mcp.json config with command/args format
- Add test document to .gitignore for privacy
2026-01-11 04:27:56 -07:00
35869b6099 Add behind-the-scenes link to discernment blog post
Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
Links README to Ryan's AI discernment article, which discusses
the documentation rewrite process and connects to the model's
perspective in the collaborations archive.
2026-01-11 02:02:34 -07:00
f159efab2c Improve README tone and clarity
Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
- Replace generic opener with direct description
- Make feature bullets more conversational (less "feature list" mode)
- Add context before format support table
- Clarify pagination example with "three ways" structure
- Lead testing section with the dashboard hook
- Add architecture design rationale
- Remove "comprehensive" and "intelligent" buzzwords
2026-01-11 00:49:34 -07:00
036160d029 Update README with accurate tool documentation
Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
- Document all 12 actual MCP tools (6 universal, 3 Word, 3 Excel)
- Add comprehensive format support matrix with feature breakdown
- Include practical usage examples with real output structures
- Add test dashboard section
- Simplify installation with uvx/Claude Code instructions
- Remove marketing fluff; focus on technical accuracy
2026-01-11 00:45:00 -07:00
c935cec7b6 Add MS Office-themed test dashboard with interactive reporting
- Self-contained HTML dashboard with MS Office 365 design
- pytest plugin captures inputs, outputs, and errors per test
- Unified orchestrator runs pytest + torture tests together
- Test files persisted in reports/test_files/ with relative links
- GitHub Actions workflow with PR comments and job summaries
- Makefile with convenient commands (test, view-dashboard, etc.)
- Works offline with embedded JSON data (no CORS issues)
2026-01-11 00:28:12 -07:00
76c7a0b2d0 Add decorators for field defaults and error handling, fix Excel performance
- Create @resolve_field_defaults decorator to handle Pydantic FieldInfo
  objects when tools are called directly (outside MCP framework)
- Create @handle_office_errors decorator for consistent error wrapping
- Apply decorators to Excel and Word mixins, removing ~100 lines of
  boilerplate code
- Fix Excel formula extraction performance: load workbooks once before
  loop instead of per-cell (100x faster with calculated values)
- Update test suite to use correct mock patch paths (patch where names
  are looked up, not where defined)
- Add torture_test.py for real document validation
2026-01-10 23:51:30 -07:00
1ad2abb617 Implement cursor-based pagination system for large document processing
- Add comprehensive pagination infrastructure based on MCP Playwright patterns
- Integrate automatic pagination into convert_to_markdown tool for documents >25k tokens
- Support cursor-based navigation with session isolation and security
- Prevent MCP token limit errors for massive documents (200+ pages)
- Maintain document structure and context across paginated sections
- Add configurable page sizes, return_all bypass, and intelligent token estimation
- Enable seamless navigation through extremely dense documents that exceed limits by 100x
2025-09-26 19:06:05 -06:00
0748eec48d Fix FastMCP stdio server import
- Use app.run_stdio_async() instead of deprecated stdio_server import
- Aligns with FastMCP 2.11.3 API
- Server now starts correctly with uv run mcp-office-tools
- Maintains all MCPMixin functionality and tool registration
2025-09-26 15:49:00 -06:00
22f657b32b Fix server entry point for pyproject.toml script
- Add main() function back to server.py for CLI script entry point
- Maintains FastMCP MCPMixin pattern while fixing uvx execution
- Server now starts properly with 'uvx --from . mcp-office-tools'
- Preserves all 7 tools with official mixin registration
2025-09-26 14:15:25 -06:00
9d6a9fc24c Refactor server architecture using mcpmixin pattern
- Split monolithic 2209-line server.py into organized mixin classes
- UniversalMixin: Format-agnostic tools (extract_text, extract_images, etc.)
- WordMixin: Word-specific tools (convert_to_markdown with chapter_name support)
- ExcelMixin: Placeholder for future Excel-specific tools
- PowerPointMixin: Placeholder for future PowerPoint-specific tools

Benefits:
• Improved maintainability and separation of concerns
• Better testability with isolated mixins
• Easier team collaboration on different file types
• Reduced cognitive load per module
• Preserved all 7 existing tools with full functionality

Architecture now supports clean expansion for format-specific tools
while maintaining backward compatibility through legacy server backup.
2025-09-26 13:08:53 -06:00
778ef3a2d4 Add chapter-based extraction for documents without bookmarks
- Add chapter_name parameter to convert_to_markdown tool
- Implement _find_chapter_content_range() for heading-based navigation
- Add _get_available_headings() to help users find chapter names
- Include chapter extraction metadata in results
- Enhanced ultra-fast summary with available headings
- Provides alternative to bookmark extraction when bookmarks unavailable
2025-08-22 08:14:23 -06:00
6484036b69 📖 Add bookmark-based chapter extraction for precise content targeting
- Add bookmark_name parameter for extracting specific chapters/sections
- Implement bookmark boundary detection using Word XML structure
- Extract content between bookmark start/end markers with smart extension
- More reliable than page ranges - bookmarks are anchored to exact locations
- Support chapter extraction like bookmark_name='Chapter1_Start'
- Include bookmark metadata in response with element ranges
- Perfect for extracting individual chapters from large documents
2025-08-22 08:02:50 -06:00
b2033fc239 🔥 Fix critical issue: page_range was processing entire document
- Replace unreliable Word page detection with element-based limiting
- Cap extraction at 25 paragraphs per 'page' requested (max 100 total)
- Cap extraction at 8k chars per 'page' requested (max 40k total)
- Add early termination when limits reached
- Add processing_limits metadata to show actual extraction stats
- Prevent 1.28M token responses by stopping at reasonable content limits
- Single page (page_range='1') now limited to ~25 paragraphs/8k chars
2025-08-22 08:00:02 -06:00
431022e113 🚀 Add ultra-fast summary mode to prevent massive 1M+ token responses
- Bypass all complex processing in summary_only mode
- Extract only first 50 paragraphs, max 10 headings, 5 content paragraphs
- Add bookmark detection for chapter navigation hints
- Limit summary content to 2000 chars max
- Prevent 1,282,370 token responses with surgical precision
- Show bookmark names as chapter start indicators
2025-08-22 07:56:19 -06:00
3dffce6904 Add aggressive content limiting to prevent MCP 25k token errors
- Implement smart content truncation at ~80k chars (~20k tokens)
- Preserve document structure when truncating (stop at natural breaks)
- Add clear truncation notices with guidance for smaller ranges
- Update chunking suggestions to use safer 8-page chunks
- Enhance recommendations to suggest 3-8 page ranges
- Prevent 29,869 > 25,000 token errors while maintaining usability
2025-08-21 02:50:04 -06:00
9c2f299d49 📋 Add comprehensive Table of Contents extraction with smart chunking
- Extract headings with page numbers during document processing
- Generate optimized page ranges for each section/chapter
- Provide intelligent chunking suggestions (15-page optimal chunks)
- Classify section types (chapter, section, subsection, etc.)
- Calculate actual section lengths based on heading positions
- Include suggested_chunking with ready-to-use page ranges
- Perfect for extracting 200+ page documents section by section
2025-08-21 02:47:01 -06:00
d94bd39da6 🧠 Add intelligent processing recommendations for optimal workflow
- Analyze document size and complexity before processing
- Provide clear workflow recommendations in response metadata
- Strongly recommend summary_only + page_range for large documents (>10 pages)
- Add warning system for suboptimal usage patterns
- Update parameter descriptions with best practice guidance
- Help users avoid 25k token response limits proactively
2025-08-19 13:16:48 -06:00
a485e05759 Implement true page-range filtering for efficient processing
- Add page break detection using Word XML structure
- Process only specified pages instead of full document + truncation
- Route page-range requests to python-docx for granular control
- Skip mammoth for page-specific processing (mammoth processes full doc)
- Add page metadata to results when filtering is used
- Significantly reduce memory usage and response size for large documents
2025-08-19 13:12:19 -06:00
f884c99bbd 🎯 Add page-range chunking and summary mode for large documents
- Replace character-based chunking with page-range support (e.g., '1-5', '1,3,5-10')
- Add summary_only mode to prevent large response errors (>25k tokens)
- Implement response size limiting with 5000 char truncation in summary mode
- Support selective page processing for better memory efficiency
- Maintain backward compatibility with existing parameters
2025-08-18 23:32:00 -06:00
b3caed78d3 Add comprehensive Markdown conversion with image support
- Add convert_to_markdown tool for .docx/.doc files
- Support multiple image handling modes (base64, files, references)
- Implement large document chunking for performance
- Preserve document structure (headings, lists, tables)
- Smart fallback methods (mammoth → python-docx → custom)
- Handle both modern and legacy Word formats
2025-08-18 23:23:59 -06:00
1b359c4c7c Transform README into a stunning showcase
- Add eye-catching visual design with emojis and badges
- Create compelling hero section with value proposition
- Include real-world benchmarks and performance metrics
- Add enterprise success stories and use cases
- Implement collapsible sections for better organization
- Include Mermaid architecture diagram
- Add comprehensive feature matrix with visual indicators
- Create roadmap and community sections
- Enhance installation and setup instructions
- Make it GitHub-ready with proper formatting

🚀 Now ready to wow potential users and contributors!
2025-08-18 01:05:03 -06:00
b681cb030b Initial commit: MCP Office Tools v0.1.0
- Comprehensive Microsoft Office document processing server
- Support for Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt), CSV
- 6 universal tools: extract_text, extract_images, extract_metadata, detect_office_format, analyze_document_health, get_supported_formats
- Multi-library fallback system for robust processing
- URL support with intelligent caching
- Legacy Office format support (97-2003)
- FastMCP integration with async architecture
- Production ready with comprehensive documentation

🤖 Generated with Claude Code (claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 01:01:48 -06:00