mcp-pdf-tools

Author	SHA1	Message	Date
Ryan Malloy	dfbf3d1870	🔧 v2.0.7: Fix table extraction token overflow with smart limiting PROBLEM: Table extraction from large PDFs was exceeding MCP's 25,000 token limit, causing "response too large" errors. A 5-page PDF with large tables generated 59,005 tokens, more than double the allowed limit. SOLUTION: Added flexible table data limiting with two new parameters: - max_rows_per_table: Limit rows returned per table (prevents overflow) - summary_only: Return only metadata without table data IMPLEMENTATION: 1. Added new parameters to extract_tables() method signature 2. Created _process_table_data() helper for consistent limiting logic 3. Updated all 3 extraction methods (Camelot, pdfplumber, Tabula) 4. Enhanced table metadata with truncation tracking: - total_rows: Full row count from PDF - rows_returned: Actual rows in response (after limiting) - rows_truncated: Number of rows omitted (if limited) USAGE EXAMPLES: # Summary mode - metadata only (smallest response) extract_tables(pdf_path, pages="1-5", summary_only=True) # Limited data - first 100 rows per table extract_tables(pdf_path, pages="1-5", max_rows_per_table=100) # Full data (default behavior, may overflow on large tables) extract_tables(pdf_path, pages="1-5") BENEFITS: - Prevents MCP token overflow errors - Maintains backward compatibility (new params are optional) - Clear guidance through metadata (shows when truncation occurred) - Flexible - users choose between summary/limited/full modes FILES MODIFIED: - src/mcp_pdf/mixins_official/table_extraction.py (all changes) - src/mcp_pdf/server.py (version bump to 2.0.7) - pyproject.toml (version bump to 2.0.7) VERSION: 2.0.7 PUBLISHED: https://pypi.org/project/mcp-pdf/2.0.7/	2025-11-03 18:26:34 -07:00
Ryan Malloy	fa65fa6e0c	🔧 v2.0.6: Fix async/await bug in validate_output_path calls Remove incorrect 'await' keywords from validate_output_path() calls across all mixins. validate_output_path() is a synchronous function, not async. Fixed in 15 locations across 6 mixins: - advanced_forms.py (4 calls) - annotations.py (3 calls) - document_assembly.py (2 calls) - form_management.py (2 calls) - image_processing.py (1 call) - misc_tools.py (4 calls) Error: 'object PosixPath can't be used in 'await' expression' Root cause: Incorrectly awaiting synchronous Path validation function Fix: Removed await keyword from all validate_output_path() calls PyPI: https://pypi.org/project/mcp-pdf/2.0.6/	2025-11-03 18:03:34 -07:00
Ryan Malloy	3327137536	🚀 v2.0.5: Fix page range parsing across all PDF tools Major architectural improvements and bug fixes in the v2.0.x series: ## v2.0.5 - Page Range Parsing (Current Release) - Fix page range parsing bug affecting 6 mixins (e.g., "93-95" or "11-30") - Create shared parse_pages_parameter() utility function - Support mixed formats: "1,3-5,7,10-15" - Update: pdf_utilities, content_analysis, image_processing, misc_tools, table_extraction, text_extraction ## v2.0.4 - Chunk Hint Fix - Fix next_chunk_hint to show correct page ranges - Dynamic calculation based on actual pages being extracted - Example: "30-50" now correctly shows "40-49" for next chunk ## v2.0.3 - Initial Range Support - Add page range support to text extraction ("11-30") - Fix _parse_pages_parameter to handle ranges with Python's range() - Convert 1-based user input to 0-based internal indexing ## v2.0.2 - Lazy Import Fix - Fix ModuleNotFoundError for reportlab on startup - Implement lazy imports for optional dependencies - Graceful degradation with helpful error messages ## v2.0.1 - Dependency Restructuring - Move reportlab to optional [forms] extra - Document installation: uvx --with mcp-pdf[forms] mcp-pdf ## v2.0.0 - Official FastMCP Pattern Migration - Migrate to official fastmcp.contrib.mcp_mixin pattern - Create 12 specialized mixins with 42 tools total - Architecture: mixins_official/ using MCPMixin base class - Backwards compatibility: server_legacy.py preserved Technical Improvements: - Centralized utility functions (DRY principle) - Consistent behavior across all PDF tools - Better error messages with actionable instructions - Library-specific adapters for table extraction Files Changed: - New: src/mcp_pdf/mixins_official/utils.py (shared utilities) - Updated: 6 mixins with improved page parsing - Version: pyproject.toml, server.py → 2.0.5 PyPI: https://pypi.org/project/mcp-pdf/2.0.5/	2025-11-03 17:12:37 -07:00
Ryan Malloy	8cbf542df1	🔧 Fix output path security with MCP_PDF_ALLOWED_PATHS environment variable BREAKING ISSUE FIXED: - Users reported "Output path not allowed: images" error - extract_images tool was rejecting relative paths due to overly restrictive security NEW SECURITY MODEL: - MCP_PDF_ALLOWED_PATHS environment variable controls allowed output directories - If unset: Allows any directory with "security theater" warnings - If set: Restricts outputs to specified colon-separated paths - Cross-platform compatible (: on Unix, ; on Windows) SECURITY PHILOSOPHY ENHANCED: - "TRUST NO ONE" - honest about application-level security limitations - Clear warnings that this is "security theater" - Emphasis on OS-level permissions and process isolation - Educational guidance on real security practices TECHNICAL CHANGES: - validate_output_path() rewritten with environment variable control - Path validation uses relative_to() for proper containment checking - Enhanced warning messages with security education - Updated documentation with honest security assessment DOCUMENTATION UPDATES: - Added MCP_PDF_ALLOWED_PATHS to configuration section - New "REAL Security" section with OS-level recommendations - Clear explanation of security theater vs actual protection Version: 1.1.1 (patch version for critical bugfix)	2025-09-23 23:40:05 -06:00
Ryan Malloy	856dd41996	✨ Add comprehensive link extraction tool (24th PDF tool) New Features: - extract_links: Extract all PDF hyperlinks with advanced filtering - Page-specific filtering (e.g., "1,3,5" or "1-5,8,10-12") - Link type categorization: external URLs, internal pages, emails, documents - Coordinate tracking for precise link positioning - FastMCP integration with proper tool registration - Version banner display following CLAUDE.md guidelines Technical Improvements: - Enhanced startup banner with package version display - Updated documentation to reflect 24 specialized tools - Proper FastMCP @mcp.tool() decorator usage - Comprehensive error handling and security validation Documentation Updates: - README.md: Updated tool count and installation guides - CLAUDE.md: Added link extraction to implemented features - LOCAL_DEVELOPMENT.md: Enhanced with scoped installation commands Version: 1.1.0 (minor version bump for new feature)	2025-09-23 20:41:16 -06:00
Ryan Malloy	ebf6bb8a43	🚀 Release v1.0.1: Bug fixes and local development tools - Fix variable scope bug in extract_text function - Add local development setup with claude-mcp-manager - Update author information - Add comprehensive local development documentation 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-07 00:58:51 -06:00
Ryan Malloy	8d01c44d4f	🚀 Rename to mcp-pdf and prepare for PyPI publication Package Rebranding: - Renamed package from mcp-pdf-tools to mcp-pdf (cleaner name) - Updated version to 1.0.0 (production ready with security hardening) - Updated all import paths and references throughout codebase PyPI Preparation: - Enhanced package description and metadata - Added proper project URLs and homepage - Updated CLI command from mcp-pdf-tools to mcp-pdf - Built distribution packages (wheel + source) Testing & Validation: - All 20 security tests pass with new package structure - Local installation and import tests successful - CLI command working correctly - Package ready for PyPI publication The secure, production-ready PDF processing platform is now ready for public distribution and installation via pip. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-06 15:42:59 -06:00
Ryan Malloy	75f8548668	🔒 Comprehensive security hardening and vulnerability fixes Some checks failed Security Scan / security-scan (push) Has been cancelled Details Implemented extensive security improvements to prevent attacks and ensure production readiness: Critical Security Fixes: - Fixed path traversal vulnerability in get_pdf_image function - Added file size limits (100MB PDFs, 50MB images) to prevent DoS - Implemented secure output path validation with directory restrictions - Added page count limits (1000 pages max) for resource protection - Secured JSON parameter parsing with 10KB size limits Access Control & Validation: - URL allowlisting with SSRF protection (blocks localhost, internal IPs) - IPv6 security handling for comprehensive host blocking - Input validation framework with length limits and sanitization - Secure file permissions (0o700 dirs, 0o600 files) Error Handling & Privacy: - Sanitized error messages to prevent information disclosure - Automatic removal of sensitive patterns (paths, emails, SSNs) - Generic error responses for failed operations Infrastructure & Monitoring: - Added security scanning tools (safety, pip-audit) - GitHub Actions workflow for continuous vulnerability monitoring - Daily automated security assessments - Fixed pypdf vulnerability (5.9.0 → 6.0.0) Testing & Validation: - 20 comprehensive security tests (all passing) - Integration tests confirming functionality preservation - Zero known vulnerabilities in dependencies - Validated all security functions work correctly All security measures tested and verified. Project now production-ready with enterprise-grade security posture. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-06 15:35:31 -06:00
Ryan Malloy	ab1d9ed13e	✨ Add comprehensive PDF annotations and markup tools Implement complete collaboration toolkit with: - add_sticky_notes: Comment annotations with color support - add_highlights: Text highlighting with 8 color options - add_stamps: Approval stamps (APPROVED, DRAFT, CONFIDENTIAL, etc.) - extract_all_annotations: Export to JSON/CSV formats Also includes document assembly features: - merge_pdfs_advanced: Combine PDFs with bookmark preservation - split_pdf_by_pages: Extract specific page ranges - split_pdf_by_bookmarks: Auto-split by chapters/sections - reorder_pdf_pages: Rearrange page sequences All tools tested and working with proper error handling.	2025-09-04 17:18:06 -06:00
Ryan Malloy	95596e0236	✨ Add comprehensive PDF form creation and validation tools - Add complete PDF form lifecycle management - Create new forms with text, checkbox, dropdown, signature fields - Fill existing forms with JSON data and optional flattening - Add fields to existing PDFs with flexible positioning - Advanced field types: radio groups, textareas, date fields - Comprehensive validation engine with regex patterns - Email, phone, number, date format validation - Required field checking and length constraints - Visual validation cues with asterisks and format hints - Multi-field error reporting with detailed feedback - International character support and edge case handling - Enterprise-ready for complex business forms	2025-09-03 02:33:01 -06:00
Ryan Malloy	ae80388ec4	🎯 Add custom output paths and clean summary for image extraction Enhance extract_images with user-specified output directories and concise summary responses to improve user control and reduce context window clutter. Key Features: • Custom Output Directory: Users can specify where images are saved • Clean Summary Output: Concise extraction results instead of verbose metadata • Automatic Directory Creation: Creates output directories as needed • File-Level Details: Individual file info with human-readable sizes • Extraction Summary: Quick overview with total size and file count New Parameters: + output_directory: Optional custom path for saving extracted images + Defaults to cache directory if not specified + Creates directories automatically with proper permissions Response Format: - Removed: Verbose image metadata arrays that fill context windows + Added: Clean summary with extraction statistics + Added: File list with essential details (filename, path, size, dimensions) + Added: Human-readable extraction summary Benefits: ✅ User control over image file locations ✅ Reduced context window pollution ✅ Essential information without verbosity ✅ Better integration with user workflows ✅ Maintains MCP resource compatibility for cached images Example Response: { "success": true, "images_extracted": 3, "total_size": "2.4 MB", "output_directory": "/path/to/custom/dir", "files": [{"filename": "page_1_image_0.png", "path": "/path/...", "size": "800 KB", "dimensions": "1920x1080"}] } 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-20 13:50:09 -06:00
Ryan Malloy	e087a3b7a0	✨ Add MCP resource URIs for extracted PDF images Implement proper MCP resource protocol for image access, eliminating the need for clients to handle local file paths and enabling seamless image integration. Key Features: • MCP Resource Endpoint: pdf-image://{image_id} for direct image access • extract_images(): Returns resource_uri field with MCP resource links • pdf_to_markdown(): Embeds resource URIs in markdown image references • Automatic MIME type detection (image/png, image/jpeg) • Seamless client integration without file path handling Benefits: ✅ Direct image access via MCP resource protocol ✅ No local file path dependencies for MCP clients ✅ Proper MIME type handling for image display ✅ Clean markdown with working image links ✅ Standards-compliant MCP resource implementation Response Format Enhancement: + "resource_uri": "pdf-image://page_1_image_0" + Works in markdown: \![Image](pdf-image://page_1_image_0) + MIME Type: image/png or image/jpeg + Direct client access without file system dependencies This resolves the limitation where extracted images were only available as local file paths, making them truly accessible to MCP clients through the standardized resource protocol. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-20 11:42:46 -06:00
Ryan Malloy	374339a15d	🔧 Fix verbose base64 output in image extraction functions Resolve MCP client context overflow by saving images to files instead of returning base64-encoded data that fills client message windows. Key Changes: • extract_images(): Save images to CACHE_DIR with file paths in response • pdf_to_markdown(): Save embedded images to files with path references • Add format_file_size() utility for human-readable file sizes • Update function descriptions to clarify file-based output Benefits: ✅ Prevents context message window overflow in MCP clients ✅ Returns clean, concise metadata with file paths ✅ Maintains full image access through saved files ✅ Improves user experience with readable file sizes ✅ Reduces memory usage and response payload sizes Response Format Changes: - Remove: "data": "<base64_string>" (verbose) + Add: "file_path": "/tmp/mcp-pdf-processing/image.png" + Add: "filename": "page_1_image_0.png" + Add: "size_bytes": 12345 + Add: "size_human": "12.1 KB" This resolves the issue where image extraction caused excessive verbose output that overwhelmed MCP client interfaces. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-20 11:34:42 -06:00
Ryan Malloy	f601d44d99	Fix page numbering: Switch to user-friendly 1-based indexing Problem: Zero-based page numbers were confusing for users who naturally think of pages starting from 1. Solution: - Updated `parse_pages_parameter()` to convert 1-based user input to 0-based internal representation - All user-facing documentation now uses 1-based page numbering (page 1 = first page) - Internal processing continues to use 0-based indexing for PyMuPDF compatibility - Output page numbers are consistently displayed as 1-based for users Changes: - Enhanced documentation strings to clarify "1-based" page numbering - Updated README examples with 1-based page numbers and clarifying comments - Fixed split_pdf function to handle 1-based input correctly - Updated test cases to verify 1-based -> 0-based conversion - Added feature highlight: "User-Friendly: All page numbers use 1-based indexing" Impact: Much more intuitive for users - no more confusion about which page is "page 0"\! 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-11 04:32:20 -06:00
Ryan Malloy	f0365a0d75	Implement comprehensive PDF processing suite with 15 additional advanced tools Major expansion from 8 to 23 total tools covering: Document Analysis & Intelligence: - analyze_pdf_health: Comprehensive quality and health analysis - analyze_pdf_security: Security features and vulnerability assessment - classify_content: AI-powered document type classification - summarize_content: Intelligent content summarization with key insights - compare_pdfs: Advanced document comparison (text, structure, metadata) Layout & Visual Analysis: - analyze_layout: Page layout analysis with column detection - extract_charts: Chart, diagram, and visual element extraction - detect_watermarks: Watermark detection and analysis Content Manipulation: - extract_form_data: Interactive PDF form data extraction - split_pdf: Split PDFs at specified pages - merge_pdfs: Merge multiple PDFs into one - rotate_pages: Rotate pages by 90°/180°/270° Optimization & Utilities: - convert_to_images: Convert PDF pages to image files - optimize_pdf: File size optimization with quality levels - repair_pdf: Corrupted PDF repair and recovery Technical Enhancements: - All tools support HTTPS URLs with intelligent caching - Fixed MCP parameter validation for pages parameter - Comprehensive error handling and validation - Updated documentation with usage examples 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-11 04:27:04 -06:00
Ryan Malloy	58d43851b9	Add HTTPS URL support and fix MCP parameter validation Features: - HTTPS URL support: Process PDFs directly from URLs with intelligent caching - Smart caching: 1-hour cache to avoid repeated downloads - Content validation: Verify downloads are actually PDF files - Security: Proper User-Agent headers, HTTPS preferred over HTTP - MCP parameter fixes: Handle pages parameter as string "[2,3]" format - Backward compatibility: Still supports local file paths and list parameters Technical changes: - Added download_pdf_from_url() with caching and validation - Updated validate_pdf_path() to handle URLs and local paths - Added parse_pages_parameter() for flexible parameter parsing - Updated all 8 tools to accept string pages parameters - Enhanced error handling for network and validation issues All tools now support: - Local paths: "/path/to/file.pdf" - HTTPS URLs: "https://example.com/document.pdf" - Flexible pages: "[2,3]", "1,2,3", or [1,2,3] 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-11 02:25:53 -06:00
Ryan Malloy	c902e81e4d	Initial commit: Complete MCP PDF Tools server implementation Features: - 8 comprehensive PDF processing tools with intelligent fallbacks - Text extraction (PyMuPDF, pdfplumber, pypdf with auto-selection) - Table extraction (Camelot → pdfplumber → Tabula fallback chain) - OCR processing with Tesseract and preprocessing options - Document analysis (structure, metadata, scanned detection) - Image extraction with filtering capabilities - PDF to markdown conversion with metadata - Built on FastMCP framework with full MCP protocol support - Comprehensive error handling and user-friendly messages - Docker support and cross-platform compatibility - Complete test suite and examples 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-10 16:36:21 -06:00

17 Commits