9 Commits

Author SHA1 Message Date
b2033fc239 🔥 Fix critical issue: page_range was processing entire document
- Replace unreliable Word page detection with element-based limiting
- Cap extraction at 25 paragraphs per 'page' requested (max 100 total)
- Cap extraction at 8k chars per 'page' requested (max 40k total)
- Add early termination when limits reached
- Add processing_limits metadata to show actual extraction stats
- Prevent 1.28M token responses by stopping at reasonable content limits
- Single page (page_range='1') now limited to ~25 paragraphs/8k chars
2025-08-22 08:00:02 -06:00
431022e113 🚀 Add ultra-fast summary mode to prevent massive 1M+ token responses
- Bypass all complex processing in summary_only mode
- Extract only first 50 paragraphs, max 10 headings, 5 content paragraphs
- Add bookmark detection for chapter navigation hints
- Limit summary content to 2000 chars max
- Prevent 1,282,370 token responses with surgical precision
- Show bookmark names as chapter start indicators
2025-08-22 07:56:19 -06:00
3dffce6904 Add aggressive content limiting to prevent MCP 25k token errors
- Implement smart content truncation at ~80k chars (~20k tokens)
- Preserve document structure when truncating (stop at natural breaks)
- Add clear truncation notices with guidance for smaller ranges
- Update chunking suggestions to use safer 8-page chunks
- Enhance recommendations to suggest 3-8 page ranges
- Prevent 29,869 > 25,000 token errors while maintaining usability
2025-08-21 02:50:04 -06:00
9c2f299d49 📋 Add comprehensive Table of Contents extraction with smart chunking
- Extract headings with page numbers during document processing
- Generate optimized page ranges for each section/chapter
- Provide intelligent chunking suggestions (15-page optimal chunks)
- Classify section types (chapter, section, subsection, etc.)
- Calculate actual section lengths based on heading positions
- Include suggested_chunking with ready-to-use page ranges
- Perfect for extracting 200+ page documents section by section
2025-08-21 02:47:01 -06:00
d94bd39da6 🧠 Add intelligent processing recommendations for optimal workflow
- Analyze document size and complexity before processing
- Provide clear workflow recommendations in response metadata
- Strongly recommend summary_only + page_range for large documents (>10 pages)
- Add warning system for suboptimal usage patterns
- Update parameter descriptions with best practice guidance
- Help users avoid 25k token response limits proactively
2025-08-19 13:16:48 -06:00
a485e05759 Implement true page-range filtering for efficient processing
- Add page break detection using Word XML structure
- Process only specified pages instead of full document + truncation
- Route page-range requests to python-docx for granular control
- Skip mammoth for page-specific processing (mammoth processes full doc)
- Add page metadata to results when filtering is used
- Significantly reduce memory usage and response size for large documents
2025-08-19 13:12:19 -06:00
f884c99bbd 🎯 Add page-range chunking and summary mode for large documents
- Replace character-based chunking with page-range support (e.g., '1-5', '1,3,5-10')
- Add summary_only mode to prevent large response errors (>25k tokens)
- Implement response size limiting with 5000 char truncation in summary mode
- Support selective page processing for better memory efficiency
- Maintain backward compatibility with existing parameters
2025-08-18 23:32:00 -06:00
b3caed78d3 Add comprehensive Markdown conversion with image support
- Add convert_to_markdown tool for .docx/.doc files
- Support multiple image handling modes (base64, files, references)
- Implement large document chunking for performance
- Preserve document structure (headings, lists, tables)
- Smart fallback methods (mammoth → python-docx → custom)
- Handle both modern and legacy Word formats
2025-08-18 23:23:59 -06:00
b681cb030b Initial commit: MCP Office Tools v0.1.0
- Comprehensive Microsoft Office document processing server
- Support for Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt), CSV
- 6 universal tools: extract_text, extract_images, extract_metadata, detect_office_format, analyze_document_health, get_supported_formats
- Multi-library fallback system for robust processing
- URL support with intelligent caching
- Legacy Office format support (97-2003)
- FastMCP integration with async architecture
- Production ready with comprehensive documentation

🤖 Generated with Claude Code (claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 01:01:48 -06:00