mcp-office-tools

Author	SHA1	Message	Date
Ryan Malloy	89ad0c849d	Improve section detection with heading styles + fallback Some checks are pending Test Dashboard / test-and-dashboard (push) Waiting to run Details - Primary: Detect sections via Heading 1 styles (structured) - Fallback: Detect chapters via "Chapter X" text patterns - Add text_patterns_only flag to skip heading styles (for messy docs) This handles both well-structured business documents (manuals, PRDs) and narrative content (books with explicit chapter headings).	2026-01-11 09:40:38 -07:00
Ryan Malloy	d569034fa3	Add MCP resource system for embedded document content Some checks are pending Test Dashboard / test-and-dashboard (push) Waiting to run Details Implements URI-based access to document content with: - ResourceStore for caching extracted images, chapters, sheets, slides - Content-based document IDs (SHA256 hash) for stable URIs across sessions - 11 resource templates with flexible URI patterns: - Binary: image://, chart://, media://, embed:// - Text: chapter://, section://, sheet://, slide:// - Ranges: chapters://doc/1-5, slides://doc/1,3,5 - Hierarchical: paragraph://doc/3/5 - Format suffixes for output control: - chapter://doc/3.md (default markdown) - chapter://doc/3.txt (plain text) - chapter://doc/3.html (basic HTML) - index_document tool scans and populates resources: - Word: chapters as markdown, embedded images - Excel: sheets as markdown tables - PowerPoint: slides as markdown Tool responses return URIs instead of blobs - clients fetch only what they need.	2026-01-11 09:04:29 -07:00
Ryan Malloy	11defb4eae	Update README and gitignore for new document tools Some checks are pending Test Dashboard / test-and-dashboard (push) Waiting to run Details - Add 7 new Word tools to README (outline, search, entities, etc.) - Add 9 MCP prompts section with workflow descriptions - Gitignore reading progress bookmark files (.*.reading_progress.json) - Gitignore local .mcp.json and test documents	2026-01-11 07:41:49 -07:00
Ryan Malloy	4b38f6455c	Add document navigation tools and MCP prompts New tools for Word document analysis: - extract_entities: Pattern-based extraction of people, places, organizations - get_chapter_summaries: Chapter previews with opening sentences and word counts - save_reading_progress: Bookmark reading position to JSON file - get_reading_progress: Resume reading from saved position New MCP prompts (basic to advanced workflows): - explore-document: Get started with a new document - find-character: Track character mentions - chapter-preview: Quick chapter overviews - resume-reading: Continue where you left off - document-analysis: Comprehensive multi-tool analysis - character-journey: Track character arc through narrative - document-comparison: Compare entities between chapters - full-reading-session: Guided reading with bookmarking - manuscript-review: Complete editorial workflow Updated test counts for 19 total tools (6 universal + 10 word + 3 excel)	2026-01-11 07:23:15 -07:00
Ryan Malloy	1abce7f26d	Add document navigation tools: outline, style check, search New tools for easier document navigation: - get_document_outline: Structured view of headings with chapter detection - check_style_consistency: Find formatting issues and missing chapters - search_document: Search with context and chapter location All tools tested with 200+ page manuscript. Detects issues like Chapter 3 being styled as "normal" instead of "Heading 1".	2026-01-11 07:15:43 -07:00
Ryan Malloy	34e636e782	Add documentation for DOCX processing fixes Documents 6 critical bugs discovered while processing a 200+ page manuscript, including the root cause xpath API mismatch between python-docx and lxml that caused silent failures in chapter search.	2026-01-11 06:47:39 -07:00
Ryan Malloy	2f39c4ec5b	Fix critical xpath API bug breaking chapter/heading detection python-docx elements don't support xpath() with namespaces kwarg. The calls silently failed in try/except blocks, causing chapter search and heading detection to never find matches. Fixed by replacing xpath(..., namespaces={...}) with: - findall('.//' + qn('w:t')) for text elements - find(qn('w:pPr')) + find(qn('w:pStyle')) for style detection - get(qn('w:val')) for attribute values Also fixed logic bug where elif prevented short-text fallback from running when a non-heading style existed on the paragraph.	2026-01-11 05:20:05 -07:00
Ryan Malloy	af6aadf559	Refactor: Extract processing logic into utility modules Complete architecture cleanup - eliminated duplicate server files: - Deleted server_monolithic.py (2249 lines) - Deleted server_legacy.py (2209 lines) New utility modules created: - utils/word_processing.py - Word extraction/conversion (preserves page range fixes) - utils/excel_processing.py - Excel extraction - utils/powerpoint_processing.py - PowerPoint extraction - utils/processing.py - Universal helpers (parse_page_range, health checks, etc.) Updated mixins to import from utils instead of server_monolithic. Entry point remains server.py (48 lines) using mixin architecture. All 53 tests pass. Coverage improved from 11% to 22% by removing duplicate code.	2026-01-11 05:08:18 -07:00
Ryan Malloy	8249afb763	Fix banner issue in server.py entry point The pyproject.toml script entry point (mcp-office-tools) uses server.py, not server_monolithic.py. Applied same show_banner=False fix and simplified to use app.run() instead of asyncio.run(app.run_stdio_async()).	2026-01-11 04:32:46 -07:00
Ryan Malloy	210aa99e0b	Fix page range extraction for large documents and MCP connection Bug fixes: - Remove 100-paragraph cap that prevented extracting content past ~page 4 Now calculates limit based on number of pages requested (300 paras/page) - Add fallback page estimation when docs lack explicit page breaks Uses ~25 paragraphs per page for navigation in non-paginated docs - Fix _get_available_headings to scan full document (was only first 100 elements) Headings like Chapter 10 at element 1524 were invisible - Fix MCP connection by disabling FastMCP banner (show_banner=False) ASCII art banner was corrupting stdout JSON-RPC protocol Changes: - Default image_mode changed from 'base64' to 'files' to avoid huge responses - Add proper .mcp.json config with command/args format - Add test document to .gitignore for privacy	2026-01-11 04:27:56 -07:00
Ryan Malloy	35869b6099	Add behind-the-scenes link to discernment blog post Some checks are pending Test Dashboard / test-and-dashboard (push) Waiting to run Details Links README to Ryan's AI discernment article, which discusses the documentation rewrite process and connects to the model's perspective in the collaborations archive.	2026-01-11 02:02:34 -07:00
Ryan Malloy	f159efab2c	Improve README tone and clarity Some checks are pending Test Dashboard / test-and-dashboard (push) Waiting to run Details - Replace generic opener with direct description - Make feature bullets more conversational (less "feature list" mode) - Add context before format support table - Clarify pagination example with "three ways" structure - Lead testing section with the dashboard hook - Add architecture design rationale - Remove "comprehensive" and "intelligent" buzzwords	2026-01-11 00:49:34 -07:00
Ryan Malloy	036160d029	Update README with accurate tool documentation Some checks are pending Test Dashboard / test-and-dashboard (push) Waiting to run Details - Document all 12 actual MCP tools (6 universal, 3 Word, 3 Excel) - Add comprehensive format support matrix with feature breakdown - Include practical usage examples with real output structures - Add test dashboard section - Simplify installation with uvx/Claude Code instructions - Remove marketing fluff; focus on technical accuracy	2026-01-11 00:45:00 -07:00
Ryan Malloy	c935cec7b6	Add MS Office-themed test dashboard with interactive reporting - Self-contained HTML dashboard with MS Office 365 design - pytest plugin captures inputs, outputs, and errors per test - Unified orchestrator runs pytest + torture tests together - Test files persisted in reports/test_files/ with relative links - GitHub Actions workflow with PR comments and job summaries - Makefile with convenient commands (test, view-dashboard, etc.) - Works offline with embedded JSON data (no CORS issues)	2026-01-11 00:28:12 -07:00
Ryan Malloy	76c7a0b2d0	Add decorators for field defaults and error handling, fix Excel performance - Create @resolve_field_defaults decorator to handle Pydantic FieldInfo objects when tools are called directly (outside MCP framework) - Create @handle_office_errors decorator for consistent error wrapping - Apply decorators to Excel and Word mixins, removing ~100 lines of boilerplate code - Fix Excel formula extraction performance: load workbooks once before loop instead of per-cell (100x faster with calculated values) - Update test suite to use correct mock patch paths (patch where names are looked up, not where defined) - Add torture_test.py for real document validation	2026-01-10 23:51:30 -07:00
Ryan Malloy	1ad2abb617	Implement cursor-based pagination system for large document processing - Add comprehensive pagination infrastructure based on MCP Playwright patterns - Integrate automatic pagination into convert_to_markdown tool for documents >25k tokens - Support cursor-based navigation with session isolation and security - Prevent MCP token limit errors for massive documents (200+ pages) - Maintain document structure and context across paginated sections - Add configurable page sizes, return_all bypass, and intelligent token estimation - Enable seamless navigation through extremely dense documents that exceed limits by 100x	2025-09-26 19:06:05 -06:00
Ryan Malloy	0748eec48d	Fix FastMCP stdio server import - Use app.run_stdio_async() instead of deprecated stdio_server import - Aligns with FastMCP 2.11.3 API - Server now starts correctly with uv run mcp-office-tools - Maintains all MCPMixin functionality and tool registration	2025-09-26 15:49:00 -06:00
Ryan Malloy	22f657b32b	Fix server entry point for pyproject.toml script - Add main() function back to server.py for CLI script entry point - Maintains FastMCP MCPMixin pattern while fixing uvx execution - Server now starts properly with 'uvx --from . mcp-office-tools' - Preserves all 7 tools with official mixin registration	2025-09-26 14:15:25 -06:00
Ryan Malloy	9d6a9fc24c	Refactor server architecture using mcpmixin pattern - Split monolithic 2209-line server.py into organized mixin classes - UniversalMixin: Format-agnostic tools (extract_text, extract_images, etc.) - WordMixin: Word-specific tools (convert_to_markdown with chapter_name support) - ExcelMixin: Placeholder for future Excel-specific tools - PowerPointMixin: Placeholder for future PowerPoint-specific tools Benefits: • Improved maintainability and separation of concerns • Better testability with isolated mixins • Easier team collaboration on different file types • Reduced cognitive load per module • Preserved all 7 existing tools with full functionality Architecture now supports clean expansion for format-specific tools while maintaining backward compatibility through legacy server backup.	2025-09-26 13:08:53 -06:00
Ryan Malloy	778ef3a2d4	Add chapter-based extraction for documents without bookmarks - Add chapter_name parameter to convert_to_markdown tool - Implement _find_chapter_content_range() for heading-based navigation - Add _get_available_headings() to help users find chapter names - Include chapter extraction metadata in results - Enhanced ultra-fast summary with available headings - Provides alternative to bookmark extraction when bookmarks unavailable	2025-08-22 08:14:23 -06:00
Ryan Malloy	6484036b69	📖 Add bookmark-based chapter extraction for precise content targeting - Add bookmark_name parameter for extracting specific chapters/sections - Implement bookmark boundary detection using Word XML structure - Extract content between bookmark start/end markers with smart extension - More reliable than page ranges - bookmarks are anchored to exact locations - Support chapter extraction like bookmark_name='Chapter1_Start' - Include bookmark metadata in response with element ranges - Perfect for extracting individual chapters from large documents	2025-08-22 08:02:50 -06:00
Ryan Malloy	b2033fc239	🔥 Fix critical issue: page_range was processing entire document - Replace unreliable Word page detection with element-based limiting - Cap extraction at 25 paragraphs per 'page' requested (max 100 total) - Cap extraction at 8k chars per 'page' requested (max 40k total) - Add early termination when limits reached - Add processing_limits metadata to show actual extraction stats - Prevent 1.28M token responses by stopping at reasonable content limits - Single page (page_range='1') now limited to ~25 paragraphs/8k chars	2025-08-22 08:00:02 -06:00
Ryan Malloy	431022e113	🚀 Add ultra-fast summary mode to prevent massive 1M+ token responses - Bypass all complex processing in summary_only mode - Extract only first 50 paragraphs, max 10 headings, 5 content paragraphs - Add bookmark detection for chapter navigation hints - Limit summary content to 2000 chars max - Prevent 1,282,370 token responses with surgical precision - Show bookmark names as chapter start indicators	2025-08-22 07:56:19 -06:00
Ryan Malloy	3dffce6904	⚡ Add aggressive content limiting to prevent MCP 25k token errors - Implement smart content truncation at ~80k chars (~20k tokens) - Preserve document structure when truncating (stop at natural breaks) - Add clear truncation notices with guidance for smaller ranges - Update chunking suggestions to use safer 8-page chunks - Enhance recommendations to suggest 3-8 page ranges - Prevent 29,869 > 25,000 token errors while maintaining usability	2025-08-21 02:50:04 -06:00
Ryan Malloy	9c2f299d49	📋 Add comprehensive Table of Contents extraction with smart chunking - Extract headings with page numbers during document processing - Generate optimized page ranges for each section/chapter - Provide intelligent chunking suggestions (15-page optimal chunks) - Classify section types (chapter, section, subsection, etc.) - Calculate actual section lengths based on heading positions - Include suggested_chunking with ready-to-use page ranges - Perfect for extracting 200+ page documents section by section	2025-08-21 02:47:01 -06:00
Ryan Malloy	d94bd39da6	🧠 Add intelligent processing recommendations for optimal workflow - Analyze document size and complexity before processing - Provide clear workflow recommendations in response metadata - Strongly recommend summary_only + page_range for large documents (>10 pages) - Add warning system for suboptimal usage patterns - Update parameter descriptions with best practice guidance - Help users avoid 25k token response limits proactively	2025-08-19 13:16:48 -06:00
Ryan Malloy	a485e05759	⚡ Implement true page-range filtering for efficient processing - Add page break detection using Word XML structure - Process only specified pages instead of full document + truncation - Route page-range requests to python-docx for granular control - Skip mammoth for page-specific processing (mammoth processes full doc) - Add page metadata to results when filtering is used - Significantly reduce memory usage and response size for large documents	2025-08-19 13:12:19 -06:00
Ryan Malloy	f884c99bbd	🎯 Add page-range chunking and summary mode for large documents - Replace character-based chunking with page-range support (e.g., '1-5', '1,3,5-10') - Add summary_only mode to prevent large response errors (>25k tokens) - Implement response size limiting with 5000 char truncation in summary mode - Support selective page processing for better memory efficiency - Maintain backward compatibility with existing parameters	2025-08-18 23:32:00 -06:00
Ryan Malloy	b3caed78d3	✨ Add comprehensive Markdown conversion with image support - Add convert_to_markdown tool for .docx/.doc files - Support multiple image handling modes (base64, files, references) - Implement large document chunking for performance - Preserve document structure (headings, lists, tables) - Smart fallback methods (mammoth → python-docx → custom) - Handle both modern and legacy Word formats	2025-08-18 23:23:59 -06:00
Ryan Malloy	1b359c4c7c	✨ Transform README into a stunning showcase - Add eye-catching visual design with emojis and badges - Create compelling hero section with value proposition - Include real-world benchmarks and performance metrics - Add enterprise success stories and use cases - Implement collapsible sections for better organization - Include Mermaid architecture diagram - Add comprehensive feature matrix with visual indicators - Create roadmap and community sections - Enhance installation and setup instructions - Make it GitHub-ready with proper formatting 🚀 Now ready to wow potential users and contributors!	2025-08-18 01:05:03 -06:00
Ryan Malloy	b681cb030b	Initial commit: MCP Office Tools v0.1.0 - Comprehensive Microsoft Office document processing server - Support for Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt), CSV - 6 universal tools: extract_text, extract_images, extract_metadata, detect_office_format, analyze_document_health, get_supported_formats - Multi-library fallback system for robust processing - URL support with intelligent caching - Legacy Office format support (97-2003) - FastMCP integration with async architecture - Production ready with comprehensive documentation 🤖 Generated with Claude Code (claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 01:01:48 -06:00

31 Commits