mcp-pdf-tools

MCP/mcp-pdf-tools

Fork 0

Commit Graph

Author	SHA1	Message	Date
Ryan Malloy	dfbf3d1870	🔧 v2.0.7: Fix table extraction token overflow with smart limiting PROBLEM: Table extraction from large PDFs was exceeding MCP's 25,000 token limit, causing "response too large" errors. A 5-page PDF with large tables generated 59,005 tokens, more than double the allowed limit. SOLUTION: Added flexible table data limiting with two new parameters: - max_rows_per_table: Limit rows returned per table (prevents overflow) - summary_only: Return only metadata without table data IMPLEMENTATION: 1. Added new parameters to extract_tables() method signature 2. Created _process_table_data() helper for consistent limiting logic 3. Updated all 3 extraction methods (Camelot, pdfplumber, Tabula) 4. Enhanced table metadata with truncation tracking: - total_rows: Full row count from PDF - rows_returned: Actual rows in response (after limiting) - rows_truncated: Number of rows omitted (if limited) USAGE EXAMPLES: # Summary mode - metadata only (smallest response) extract_tables(pdf_path, pages="1-5", summary_only=True) # Limited data - first 100 rows per table extract_tables(pdf_path, pages="1-5", max_rows_per_table=100) # Full data (default behavior, may overflow on large tables) extract_tables(pdf_path, pages="1-5") BENEFITS: - Prevents MCP token overflow errors - Maintains backward compatibility (new params are optional) - Clear guidance through metadata (shows when truncation occurred) - Flexible - users choose between summary/limited/full modes FILES MODIFIED: - src/mcp_pdf/mixins_official/table_extraction.py (all changes) - src/mcp_pdf/server.py (version bump to 2.0.7) - pyproject.toml (version bump to 2.0.7) VERSION: 2.0.7 PUBLISHED: https://pypi.org/project/mcp-pdf/2.0.7/	2025-11-03 18:26:34 -07:00
Ryan Malloy	3327137536	🚀 v2.0.5: Fix page range parsing across all PDF tools Major architectural improvements and bug fixes in the v2.0.x series: ## v2.0.5 - Page Range Parsing (Current Release) - Fix page range parsing bug affecting 6 mixins (e.g., "93-95" or "11-30") - Create shared parse_pages_parameter() utility function - Support mixed formats: "1,3-5,7,10-15" - Update: pdf_utilities, content_analysis, image_processing, misc_tools, table_extraction, text_extraction ## v2.0.4 - Chunk Hint Fix - Fix next_chunk_hint to show correct page ranges - Dynamic calculation based on actual pages being extracted - Example: "30-50" now correctly shows "40-49" for next chunk ## v2.0.3 - Initial Range Support - Add page range support to text extraction ("11-30") - Fix _parse_pages_parameter to handle ranges with Python's range() - Convert 1-based user input to 0-based internal indexing ## v2.0.2 - Lazy Import Fix - Fix ModuleNotFoundError for reportlab on startup - Implement lazy imports for optional dependencies - Graceful degradation with helpful error messages ## v2.0.1 - Dependency Restructuring - Move reportlab to optional [forms] extra - Document installation: uvx --with mcp-pdf[forms] mcp-pdf ## v2.0.0 - Official FastMCP Pattern Migration - Migrate to official fastmcp.contrib.mcp_mixin pattern - Create 12 specialized mixins with 42 tools total - Architecture: mixins_official/ using MCPMixin base class - Backwards compatibility: server_legacy.py preserved Technical Improvements: - Centralized utility functions (DRY principle) - Consistent behavior across all PDF tools - Better error messages with actionable instructions - Library-specific adapters for table extraction Files Changed: - New: src/mcp_pdf/mixins_official/utils.py (shared utilities) - Updated: 6 mixins with improved page parsing - Version: pyproject.toml, server.py → 2.0.5 PyPI: https://pypi.org/project/mcp-pdf/2.0.5/	2025-11-03 17:12:37 -07:00

Author

SHA1

Message

Date

Ryan Malloy

dfbf3d1870

🔧 v2.0.7: Fix table extraction token overflow with smart limiting

PROBLEM:
Table extraction from large PDFs was exceeding MCP's 25,000 token limit,
causing "response too large" errors. A 5-page PDF with large tables
generated 59,005 tokens, more than double the allowed limit.

SOLUTION:
Added flexible table data limiting with two new parameters:
- max_rows_per_table: Limit rows returned per table (prevents overflow)
- summary_only: Return only metadata without table data

IMPLEMENTATION:
1. Added new parameters to extract_tables() method signature
2. Created _process_table_data() helper for consistent limiting logic
3. Updated all 3 extraction methods (Camelot, pdfplumber, Tabula)
4. Enhanced table metadata with truncation tracking:
   - total_rows: Full row count from PDF
   - rows_returned: Actual rows in response (after limiting)
   - rows_truncated: Number of rows omitted (if limited)

USAGE EXAMPLES:
# Summary mode - metadata only (smallest response)
extract_tables(pdf_path, pages="1-5", summary_only=True)

# Limited data - first 100 rows per table
extract_tables(pdf_path, pages="1-5", max_rows_per_table=100)

# Full data (default behavior, may overflow on large tables)
extract_tables(pdf_path, pages="1-5")

BENEFITS:
- Prevents MCP token overflow errors
- Maintains backward compatibility (new params are optional)
- Clear guidance through metadata (shows when truncation occurred)
- Flexible - users choose between summary/limited/full modes

FILES MODIFIED:
- src/mcp_pdf/mixins_official/table_extraction.py (all changes)
- src/mcp_pdf/server.py (version bump to 2.0.7)
- pyproject.toml (version bump to 2.0.7)

VERSION: 2.0.7
PUBLISHED: https://pypi.org/project/mcp-pdf/2.0.7/

2025-11-03 18:26:34 -07:00

Ryan Malloy

3327137536

🚀 v2.0.5: Fix page range parsing across all PDF tools

Major architectural improvements and bug fixes in the v2.0.x series:

## v2.0.5 - Page Range Parsing (Current Release)
- Fix page range parsing bug affecting 6 mixins (e.g., "93-95" or "11-30")
- Create shared parse_pages_parameter() utility function
- Support mixed formats: "1,3-5,7,10-15"
- Update: pdf_utilities, content_analysis, image_processing, misc_tools, table_extraction, text_extraction

## v2.0.4 - Chunk Hint Fix
- Fix next_chunk_hint to show correct page ranges
- Dynamic calculation based on actual pages being extracted
- Example: "30-50" now correctly shows "40-49" for next chunk

## v2.0.3 - Initial Range Support
- Add page range support to text extraction ("11-30")
- Fix _parse_pages_parameter to handle ranges with Python's range()
- Convert 1-based user input to 0-based internal indexing

## v2.0.2 - Lazy Import Fix
- Fix ModuleNotFoundError for reportlab on startup
- Implement lazy imports for optional dependencies
- Graceful degradation with helpful error messages

## v2.0.1 - Dependency Restructuring
- Move reportlab to optional [forms] extra
- Document installation: uvx --with mcp-pdf[forms] mcp-pdf

## v2.0.0 - Official FastMCP Pattern Migration
- Migrate to official fastmcp.contrib.mcp_mixin pattern
- Create 12 specialized mixins with 42 tools total
- Architecture: mixins_official/ using MCPMixin base class
- Backwards compatibility: server_legacy.py preserved

Technical Improvements:
- Centralized utility functions (DRY principle)
- Consistent behavior across all PDF tools
- Better error messages with actionable instructions
- Library-specific adapters for table extraction

Files Changed:
- New: src/mcp_pdf/mixins_official/utils.py (shared utilities)
- Updated: 6 mixins with improved page parsing
- Version: pyproject.toml, server.py → 2.0.5

PyPI: https://pypi.org/project/mcp-pdf/2.0.5/

2025-11-03 17:12:37 -07:00

2 Commits