mcp-pdf-tools

Author	SHA1	Message	Date
Ryan Malloy	75f8548668	🔒 Comprehensive security hardening and vulnerability fixes Some checks failed Security Scan / security-scan (push) Has been cancelled Details Implemented extensive security improvements to prevent attacks and ensure production readiness: Critical Security Fixes: - Fixed path traversal vulnerability in get_pdf_image function - Added file size limits (100MB PDFs, 50MB images) to prevent DoS - Implemented secure output path validation with directory restrictions - Added page count limits (1000 pages max) for resource protection - Secured JSON parameter parsing with 10KB size limits Access Control & Validation: - URL allowlisting with SSRF protection (blocks localhost, internal IPs) - IPv6 security handling for comprehensive host blocking - Input validation framework with length limits and sanitization - Secure file permissions (0o700 dirs, 0o600 files) Error Handling & Privacy: - Sanitized error messages to prevent information disclosure - Automatic removal of sensitive patterns (paths, emails, SSNs) - Generic error responses for failed operations Infrastructure & Monitoring: - Added security scanning tools (safety, pip-audit) - GitHub Actions workflow for continuous vulnerability monitoring - Daily automated security assessments - Fixed pypdf vulnerability (5.9.0 → 6.0.0) Testing & Validation: - 20 comprehensive security tests (all passing) - Integration tests confirming functionality preservation - Zero known vulnerabilities in dependencies - Validated all security functions work correctly All security measures tested and verified. Project now production-ready with enterprise-grade security posture. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-06 15:35:31 -06:00
Ryan Malloy	ab1d9ed13e	✨ Add comprehensive PDF annotations and markup tools Implement complete collaboration toolkit with: - add_sticky_notes: Comment annotations with color support - add_highlights: Text highlighting with 8 color options - add_stamps: Approval stamps (APPROVED, DRAFT, CONFIDENTIAL, etc.) - extract_all_annotations: Export to JSON/CSV formats Also includes document assembly features: - merge_pdfs_advanced: Combine PDFs with bookmark preservation - split_pdf_by_pages: Extract specific page ranges - split_pdf_by_bookmarks: Auto-split by chapters/sections - reorder_pdf_pages: Rearrange page sequences All tools tested and working with proper error handling.	2025-09-04 17:18:06 -06:00
Ryan Malloy	95596e0236	✨ Add comprehensive PDF form creation and validation tools - Add complete PDF form lifecycle management - Create new forms with text, checkbox, dropdown, signature fields - Fill existing forms with JSON data and optional flattening - Add fields to existing PDFs with flexible positioning - Advanced field types: radio groups, textareas, date fields - Comprehensive validation engine with regex patterns - Email, phone, number, date format validation - Required field checking and length constraints - Visual validation cues with asterisks and format hints - Multi-field error reporting with detailed feedback - International character support and edge case handling - Enterprise-ready for complex business forms	2025-09-03 02:33:01 -06:00
Ryan Malloy	ae80388ec4	🎯 Add custom output paths and clean summary for image extraction Enhance extract_images with user-specified output directories and concise summary responses to improve user control and reduce context window clutter. Key Features: • Custom Output Directory: Users can specify where images are saved • Clean Summary Output: Concise extraction results instead of verbose metadata • Automatic Directory Creation: Creates output directories as needed • File-Level Details: Individual file info with human-readable sizes • Extraction Summary: Quick overview with total size and file count New Parameters: + output_directory: Optional custom path for saving extracted images + Defaults to cache directory if not specified + Creates directories automatically with proper permissions Response Format: - Removed: Verbose image metadata arrays that fill context windows + Added: Clean summary with extraction statistics + Added: File list with essential details (filename, path, size, dimensions) + Added: Human-readable extraction summary Benefits: ✅ User control over image file locations ✅ Reduced context window pollution ✅ Essential information without verbosity ✅ Better integration with user workflows ✅ Maintains MCP resource compatibility for cached images Example Response: { "success": true, "images_extracted": 3, "total_size": "2.4 MB", "output_directory": "/path/to/custom/dir", "files": [{"filename": "page_1_image_0.png", "path": "/path/...", "size": "800 KB", "dimensions": "1920x1080"}] } 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-20 13:50:09 -06:00
Ryan Malloy	e087a3b7a0	✨ Add MCP resource URIs for extracted PDF images Implement proper MCP resource protocol for image access, eliminating the need for clients to handle local file paths and enabling seamless image integration. Key Features: • MCP Resource Endpoint: pdf-image://{image_id} for direct image access • extract_images(): Returns resource_uri field with MCP resource links • pdf_to_markdown(): Embeds resource URIs in markdown image references • Automatic MIME type detection (image/png, image/jpeg) • Seamless client integration without file path handling Benefits: ✅ Direct image access via MCP resource protocol ✅ No local file path dependencies for MCP clients ✅ Proper MIME type handling for image display ✅ Clean markdown with working image links ✅ Standards-compliant MCP resource implementation Response Format Enhancement: + "resource_uri": "pdf-image://page_1_image_0" + Works in markdown: \![Image](pdf-image://page_1_image_0) + MIME Type: image/png or image/jpeg + Direct client access without file system dependencies This resolves the limitation where extracted images were only available as local file paths, making them truly accessible to MCP clients through the standardized resource protocol. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-20 11:42:46 -06:00
Ryan Malloy	374339a15d	🔧 Fix verbose base64 output in image extraction functions Resolve MCP client context overflow by saving images to files instead of returning base64-encoded data that fills client message windows. Key Changes: • extract_images(): Save images to CACHE_DIR with file paths in response • pdf_to_markdown(): Save embedded images to files with path references • Add format_file_size() utility for human-readable file sizes • Update function descriptions to clarify file-based output Benefits: ✅ Prevents context message window overflow in MCP clients ✅ Returns clean, concise metadata with file paths ✅ Maintains full image access through saved files ✅ Improves user experience with readable file sizes ✅ Reduces memory usage and response payload sizes Response Format Changes: - Remove: "data": "<base64_string>" (verbose) + Add: "file_path": "/tmp/mcp-pdf-processing/image.png" + Add: "filename": "page_1_image_0.png" + Add: "size_bytes": 12345 + Add: "size_human": "12.1 KB" This resolves the issue where image extraction caused excessive verbose output that overwhelmed MCP client interfaces. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-20 11:34:42 -06:00
Ryan Malloy	10ef5028eb	📖 Add Claude Code integration command to documentation Feature prominent Claude Code integration instructions: - Add recommended one-line command for Claude Code users - Update installation section with uvx commands - Include git.supported.systems repository URLs - Highlight seamless AI-powered document processing integration Command for Claude Code users: claude mcp add -s local -- legacy-files uvx --from git+https://git.supported.systems/MCP/mcp-legacy-files.git mcp-legacy-files This enables direct access to all 9 vintage format processors within Claude Code for seamless AI-enhanced document processing workflows. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 23:11:28 -06:00
Ryan Malloy	78a8c40e71	Transform README into comprehensive project showcase Major Enhancement: Combined blog post storytelling with technical documentation to create an engaging, comprehensive project showcase. What's New: 📖 Compelling Narrative: Tells the complete story from 8 tools → 23 tools 🎯 Real-World Examples: Business intelligence, academic research, security workflows 🧠 Technical Deep-Dives: Architecture decisions, intelligent fallbacks, UX design ⚡ Performance Insights: Async architecture, caching strategies, resource management 🔧 Complete Documentation: Installation, usage, troubleshooting, contributing Key Sections Added: - "What We Built" - Project overview and use cases - "Key Innovations" - Document intelligence, layout processing, web integration - "Real-World Usage Examples" - 4 comprehensive workflow examples - "Performance & Architecture" - Technical implementation details - "Architecture Deep-Dive" - Code examples and design decisions - "Why MCP PDF Tools?" - Value proposition and differentiators Impact: - Much more engaging for new users and contributors - Showcases the full scope of capabilities (23 tools\!) - Provides clear guidance for different use cases - Demonstrates technical sophistication and quality - Perfect for sharing, contributing, and adoption Now developers can understand not just HOW to use the tools, but WHY this project exists and what makes it special in the PDF processing landscape. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-12 08:40:59 -06:00
Ryan Malloy	f601d44d99	Fix page numbering: Switch to user-friendly 1-based indexing Problem: Zero-based page numbers were confusing for users who naturally think of pages starting from 1. Solution: - Updated `parse_pages_parameter()` to convert 1-based user input to 0-based internal representation - All user-facing documentation now uses 1-based page numbering (page 1 = first page) - Internal processing continues to use 0-based indexing for PyMuPDF compatibility - Output page numbers are consistently displayed as 1-based for users Changes: - Enhanced documentation strings to clarify "1-based" page numbering - Updated README examples with 1-based page numbers and clarifying comments - Fixed split_pdf function to handle 1-based input correctly - Updated test cases to verify 1-based -> 0-based conversion - Added feature highlight: "User-Friendly: All page numbers use 1-based indexing" Impact: Much more intuitive for users - no more confusion about which page is "page 0"\! 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-11 04:32:20 -06:00
Ryan Malloy	f0365a0d75	Implement comprehensive PDF processing suite with 15 additional advanced tools Major expansion from 8 to 23 total tools covering: Document Analysis & Intelligence: - analyze_pdf_health: Comprehensive quality and health analysis - analyze_pdf_security: Security features and vulnerability assessment - classify_content: AI-powered document type classification - summarize_content: Intelligent content summarization with key insights - compare_pdfs: Advanced document comparison (text, structure, metadata) Layout & Visual Analysis: - analyze_layout: Page layout analysis with column detection - extract_charts: Chart, diagram, and visual element extraction - detect_watermarks: Watermark detection and analysis Content Manipulation: - extract_form_data: Interactive PDF form data extraction - split_pdf: Split PDFs at specified pages - merge_pdfs: Merge multiple PDFs into one - rotate_pages: Rotate pages by 90°/180°/270° Optimization & Utilities: - convert_to_images: Convert PDF pages to image files - optimize_pdf: File size optimization with quality levels - repair_pdf: Corrupted PDF repair and recovery Technical Enhancements: - All tools support HTTPS URLs with intelligent caching - Fixed MCP parameter validation for pages parameter - Comprehensive error handling and validation - Updated documentation with usage examples 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-11 04:27:04 -06:00
Ryan Malloy	58d43851b9	Add HTTPS URL support and fix MCP parameter validation Features: - HTTPS URL support: Process PDFs directly from URLs with intelligent caching - Smart caching: 1-hour cache to avoid repeated downloads - Content validation: Verify downloads are actually PDF files - Security: Proper User-Agent headers, HTTPS preferred over HTTP - MCP parameter fixes: Handle pages parameter as string "[2,3]" format - Backward compatibility: Still supports local file paths and list parameters Technical changes: - Added download_pdf_from_url() with caching and validation - Updated validate_pdf_path() to handle URLs and local paths - Added parse_pages_parameter() for flexible parameter parsing - Updated all 8 tools to accept string pages parameters - Enhanced error handling for network and validation issues All tools now support: - Local paths: "/path/to/file.pdf" - HTTPS URLs: "https://example.com/document.pdf" - Flexible pages: "[2,3]", "1,2,3", or [1,2,3] 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-11 02:25:53 -06:00
Ryan Malloy	478ab41b1f	Merge remote repository with local MCP PDF Tools implementation Resolved README.md conflict by preserving comprehensive documentation while maintaining repository structure from git.supported.systems 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-10 17:00:49 -06:00
Ryan Malloy	dfc6fe1149	Initial commit	2025-08-10 22:59:46 +00:00
Ryan Malloy	c902e81e4d	Initial commit: Complete MCP PDF Tools server implementation Features: - 8 comprehensive PDF processing tools with intelligent fallbacks - Text extraction (PyMuPDF, pdfplumber, pypdf with auto-selection) - Table extraction (Camelot → pdfplumber → Tabula fallback chain) - OCR processing with Tesseract and preprocessing options - Document analysis (structure, metadata, scanned detection) - Image extraction with filtering capabilities - PDF to markdown conversion with metadata - Built on FastMCP framework with full MCP protocol support - Comprehensive error handling and user-friendly messages - Docker support and cross-platform compatibility - Complete test suite and examples 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-10 16:36:21 -06:00

14 Commits