Compare commits

...

3 Commits

Author SHA1 Message Date
f601d44d99 Fix page numbering: Switch to user-friendly 1-based indexing
**Problem**: Zero-based page numbers were confusing for users who naturally
think of pages starting from 1.

**Solution**:
- Updated `parse_pages_parameter()` to convert 1-based user input to 0-based internal representation
- All user-facing documentation now uses 1-based page numbering (page 1 = first page)
- Internal processing continues to use 0-based indexing for PyMuPDF compatibility
- Output page numbers are consistently displayed as 1-based for users

**Changes**:
- Enhanced documentation strings to clarify "1-based" page numbering
- Updated README examples with 1-based page numbers and clarifying comments
- Fixed split_pdf function to handle 1-based input correctly
- Updated test cases to verify 1-based -> 0-based conversion
- Added feature highlight: "User-Friendly: All page numbers use 1-based indexing"

**Impact**: Much more intuitive for users - no more confusion about which page is "page 0"\!

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-11 04:32:20 -06:00
f0365a0d75 Implement comprehensive PDF processing suite with 15 additional advanced tools
Major expansion from 8 to 23 total tools covering:

**Document Analysis & Intelligence:**
- analyze_pdf_health: Comprehensive quality and health analysis
- analyze_pdf_security: Security features and vulnerability assessment
- classify_content: AI-powered document type classification
- summarize_content: Intelligent content summarization with key insights
- compare_pdfs: Advanced document comparison (text, structure, metadata)

**Layout & Visual Analysis:**
- analyze_layout: Page layout analysis with column detection
- extract_charts: Chart, diagram, and visual element extraction
- detect_watermarks: Watermark detection and analysis

**Content Manipulation:**
- extract_form_data: Interactive PDF form data extraction
- split_pdf: Split PDFs at specified pages
- merge_pdfs: Merge multiple PDFs into one
- rotate_pages: Rotate pages by 90°/180°/270°

**Optimization & Utilities:**
- convert_to_images: Convert PDF pages to image files
- optimize_pdf: File size optimization with quality levels
- repair_pdf: Corrupted PDF repair and recovery

**Technical Enhancements:**
- All tools support HTTPS URLs with intelligent caching
- Fixed MCP parameter validation for pages parameter
- Comprehensive error handling and validation
- Updated documentation with usage examples

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-11 04:27:04 -06:00
58d43851b9 Add HTTPS URL support and fix MCP parameter validation
Features:
- HTTPS URL support: Process PDFs directly from URLs with intelligent caching
- Smart caching: 1-hour cache to avoid repeated downloads
- Content validation: Verify downloads are actually PDF files
- Security: Proper User-Agent headers, HTTPS preferred over HTTP
- MCP parameter fixes: Handle pages parameter as string "[2,3]" format
- Backward compatibility: Still supports local file paths and list parameters

Technical changes:
- Added download_pdf_from_url() with caching and validation
- Updated validate_pdf_path() to handle URLs and local paths
- Added parse_pages_parameter() for flexible parameter parsing
- Updated all 8 tools to accept string pages parameters
- Enhanced error handling for network and validation issues

All tools now support:
- Local paths: "/path/to/file.pdf"
- HTTPS URLs: "https://example.com/document.pdf"
- Flexible pages: "[2,3]", "1,2,3", or [1,2,3]

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-11 02:25:53 -06:00
9 changed files with 2715 additions and 43 deletions

11
.mcp.json Normal file
View File

@ -0,0 +1,11 @@
{
"mcpServers": {
"pdf-tools": {
"command": "uv",
"args": ["run", "mcp-pdf-tools"],
"env": {
"PDF_TEMP_DIR": "/tmp/mcp-pdf-processing"
}
}
}
}

88
CLAUDE_DESKTOP_SETUP.md Normal file
View File

@ -0,0 +1,88 @@
# Claude Desktop MCP Configuration
This document explains how the MCP PDF Tools server has been configured for Claude Desktop.
## Configuration Location
The MCP configuration has been added to:
```
/home/rpm/.config/Claude/claude_desktop_config.json
```
## PDF Tools Server Configuration
The following configuration has been added to your Claude Desktop:
```json
{
"mcpServers": {
"pdf-tools": {
"command": "uv",
"args": [
"--directory",
"/home/rpm/claude/mcp-pdf-tools",
"run",
"mcp-pdf-tools"
],
"env": {
"PDF_TEMP_DIR": "/tmp/mcp-pdf-processing"
}
}
}
}
```
## What This Enables
With this configuration, all your Claude sessions will have access to:
- **extract_text**: Extract text from PDFs with multiple method support
- **extract_tables**: Extract tables from PDFs with intelligent fallbacks
- **extract_images**: Extract and filter images from PDFs
- **extract_metadata**: Get comprehensive PDF metadata and file information
- **get_document_structure**: Analyze PDF structure, outline, and fonts
- **is_scanned_pdf**: Detect if PDFs are scanned/image-based
- **ocr_pdf**: Perform OCR on scanned PDFs with preprocessing
- **pdf_to_markdown**: Convert PDFs to clean markdown format
## Environment Variables
- `PDF_TEMP_DIR`: Set to `/tmp/mcp-pdf-processing` for temporary file processing
## Backup
A backup of your original configuration has been saved to:
```
/home/rpm/.config/Claude/claude_desktop_config.json.backup
```
## Testing
The server has been tested and is working correctly. You can verify it's available in new Claude sessions by checking for the `mcp__pdf-tools__*` functions.
## Troubleshooting
If you encounter issues:
1. **Server not starting**: Check that all dependencies are installed:
```bash
cd /home/rpm/claude/mcp-pdf-tools
uv sync --dev
```
2. **System dependencies missing**: Install required packages:
```bash
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript python3-tk default-jre-headless
```
3. **Permission issues**: Ensure temp directory exists:
```bash
mkdir -p /tmp/mcp-pdf-processing
chmod 755 /tmp/mcp-pdf-processing
```
4. **Test server manually**:
```bash
cd /home/rpm/claude/mcp-pdf-tools
uv run mcp-pdf-tools --help
```

164
README.md
View File

@ -10,7 +10,31 @@ A comprehensive FastMCP server for PDF processing operations. This server provid
- **Document Analysis**: Extract structure, metadata, and check if PDFs are scanned - **Document Analysis**: Extract structure, metadata, and check if PDFs are scanned
- **Image Extraction**: Extract images with size filtering - **Image Extraction**: Extract images with size filtering
- **Format Conversion**: Convert PDFs to clean Markdown format - **Format Conversion**: Convert PDFs to clean Markdown format
- **URL Support**: Process PDFs directly from HTTPS URLs with intelligent caching
- **Smart Detection**: Automatically detect the best method for each operation - **Smart Detection**: Automatically detect the best method for each operation
- **User-Friendly**: All page numbers use 1-based indexing (page 1 = first page)
## URL Support
All tools support processing PDFs directly from HTTPS URLs:
```bash
# Extract text from URL
mcp_pdf_tools extract_text "https://example.com/document.pdf"
# Extract tables from URL
mcp_pdf_tools extract_tables "https://example.com/report.pdf"
# Convert URL PDF to markdown
mcp_pdf_tools pdf_to_markdown "https://example.com/paper.pdf"
```
**Features:**
- **Intelligent caching**: Downloaded PDFs are cached for 1 hour to avoid repeated downloads
- **Content validation**: Verifies content is actually a PDF file (checks magic bytes and content-type)
- **Security**: HTTPS URLs recommended (HTTP URLs show security warnings)
- **Proper headers**: Sends appropriate User-Agent for better server compatibility
- **Error handling**: Clear error messages for network issues or invalid content
## Installation ## Installation
@ -110,7 +134,7 @@ result = await extract_text(
# Extract specific pages with layout preservation # Extract specific pages with layout preservation
result = await extract_text( result = await extract_text(
pdf_path="/path/to/document.pdf", pdf_path="/path/to/document.pdf",
pages=[0, 1, 2], # First 3 pages pages=[1, 2, 3], # First 3 pages (1-based numbering)
preserve_layout=True, preserve_layout=True,
method="pdfplumber" # Or "auto", "pymupdf", "pypdf" method="pdfplumber" # Or "auto", "pymupdf", "pypdf"
) )
@ -127,7 +151,7 @@ result = await extract_tables(
# Extract tables from specific pages in markdown format # Extract tables from specific pages in markdown format
result = await extract_tables( result = await extract_tables(
pdf_path="/path/to/document.pdf", pdf_path="/path/to/document.pdf",
pages=[2, 3], pages=[2, 3], # Pages 2 and 3 (1-based numbering)
output_format="markdown" # Or "json", "csv" output_format="markdown" # Or "json", "csv"
) )
``` ```
@ -191,18 +215,150 @@ result = await extract_images(
) )
``` ```
### Advanced Analysis
```python
# Analyze document health and quality
result = await analyze_pdf_health(
pdf_path="/path/to/document.pdf"
)
# Classify content type and structure
result = await classify_content(
pdf_path="/path/to/document.pdf"
)
# Generate content summary
result = await summarize_content(
pdf_path="/path/to/document.pdf",
summary_length="medium", # "short", "medium", "long"
pages="1,2,3" # Specific pages (1-based numbering)
)
# Analyze page layout
result = await analyze_layout(
pdf_path="/path/to/document.pdf",
pages="1,2,3", # Specific pages (1-based numbering)
include_coordinates=True
)
```
### Content Manipulation
```python
# Extract form data
result = await extract_form_data(
pdf_path="/path/to/form.pdf"
)
# Split PDF into separate files
result = await split_pdf(
pdf_path="/path/to/document.pdf",
split_pages="5,10,15", # Split after pages 5, 10, 15 (1-based)
output_prefix="section"
)
# Merge multiple PDFs
result = await merge_pdfs(
pdf_paths=["/path/to/doc1.pdf", "/path/to/doc2.pdf"],
output_filename="merged_document.pdf"
)
# Rotate specific pages
result = await rotate_pages(
pdf_path="/path/to/document.pdf",
page_rotations={"1": 90, "3": 180} # Page 1: 90°, Page 3: 180° (1-based)
)
```
### Optimization and Repair
```python
# Optimize PDF file size
result = await optimize_pdf(
pdf_path="/path/to/large.pdf",
optimization_level="balanced", # "light", "balanced", "aggressive"
preserve_quality=True
)
# Repair corrupted PDF
result = await repair_pdf(
pdf_path="/path/to/corrupted.pdf"
)
# Compare two PDFs
result = await compare_pdfs(
pdf_path1="/path/to/original.pdf",
pdf_path2="/path/to/modified.pdf",
comparison_type="all" # "text", "structure", "metadata", "all"
)
```
### Visual Analysis
```python
# Extract charts and diagrams
result = await extract_charts(
pdf_path="/path/to/report.pdf",
pages="2,3,4", # Pages 2, 3, 4 (1-based numbering)
min_size=150 # Minimum size for chart detection
)
# Detect watermarks
result = await detect_watermarks(
pdf_path="/path/to/document.pdf"
)
# Security analysis
result = await analyze_pdf_security(
pdf_path="/path/to/document.pdf"
)
```
## Available Tools ## Available Tools
### Core Processing Tools
| Tool | Description | | Tool | Description |
|------|-------------| |------|-------------|
| `extract_text` | Extract text with multiple methods and layout preservation | | `extract_text` | Extract text with multiple methods and layout preservation |
| `extract_tables` | Extract tables in various formats (JSON, CSV, Markdown) | | `extract_tables` | Extract tables in various formats (JSON, CSV, Markdown) |
| `ocr_pdf` | Perform OCR on scanned PDFs with preprocessing | | `ocr_pdf` | Perform OCR on scanned PDFs with preprocessing |
| `extract_images` | Extract images with filtering options |
| `pdf_to_markdown` | Convert PDF to clean Markdown format |
### Document Analysis Tools
| Tool | Description |
|------|-------------|
| `is_scanned_pdf` | Check if a PDF is scanned or text-based | | `is_scanned_pdf` | Check if a PDF is scanned or text-based |
| `get_document_structure` | Extract document structure, outline, and basic metadata | | `get_document_structure` | Extract document structure, outline, and basic metadata |
| `extract_metadata` | Extract comprehensive metadata and file statistics | | `extract_metadata` | Extract comprehensive metadata and file statistics |
| `pdf_to_markdown` | Convert PDF to clean Markdown format | | `analyze_pdf_health` | Comprehensive PDF health and quality analysis |
| `extract_images` | Extract images with filtering options | | `analyze_pdf_security` | Analyze PDF security features and potential issues |
| `classify_content` | Classify and analyze PDF content type and structure |
| `summarize_content` | Generate summary and key insights from PDF content |
### Layout and Visual Analysis Tools
| Tool | Description |
|------|-------------|
| `analyze_layout` | Analyze PDF page layout including text blocks, columns, and spacing |
| `extract_charts` | Extract and analyze charts, diagrams, and visual elements |
| `detect_watermarks` | Detect and analyze watermarks in PDF |
### Content Manipulation Tools
| Tool | Description |
|------|-------------|
| `extract_form_data` | Extract form fields and their values from PDF forms |
| `split_pdf` | Split PDF into multiple files at specified pages |
| `merge_pdfs` | Merge multiple PDFs into a single file |
| `rotate_pages` | Rotate specific pages by 90, 180, or 270 degrees |
### Utility and Optimization Tools
| Tool | Description |
|------|-------------|
| `compare_pdfs` | Compare two PDFs for differences in text, structure, and metadata |
| `convert_to_images` | Convert PDF pages to image files |
| `optimize_pdf` | Optimize PDF file size and performance |
| `repair_pdf` | Attempt to repair corrupted or damaged PDF files |
## Development ## Development

View File

@ -0,0 +1,16 @@
{
"mcpServers": {
"pdf-tools": {
"command": "uv",
"args": [
"--directory",
"/home/rpm/claude/mcp-pdf-tools",
"run",
"mcp-pdf-tools"
],
"env": {
"PDF_TEMP_DIR": "/tmp/mcp-pdf-processing"
}
}
}
}

104
examples/url_examples.py Normal file
View File

@ -0,0 +1,104 @@
#!/usr/bin/env python3
"""
Examples of using MCP PDF Tools with URLs
"""
import asyncio
import sys
import os
# Add src to path for development
sys.path.insert(0, '../src')
from mcp_pdf_tools.server import (
extract_text, extract_metadata, pdf_to_markdown,
extract_tables, is_scanned_pdf
)
async def example_text_extraction():
"""Example: Extract text from a PDF URL"""
print("🔗 Extracting text from URL...")
# Using a sample PDF from the web
url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
try:
result = await extract_text(url)
print(f"✅ Text extraction successful!")
print(f" Method used: {result['method_used']}")
print(f" Pages: {result['metadata']['pages']}")
print(f" Extracted text length: {len(result['text'])} characters")
print(f" First 100 characters: {result['text'][:100]}...")
except Exception as e:
print(f"❌ Failed: {e}")
async def example_metadata_extraction():
"""Example: Extract metadata from a PDF URL"""
print("\n📋 Extracting metadata from URL...")
url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
try:
result = await extract_metadata(url)
print(f"✅ Metadata extraction successful!")
print(f" File size: {result['file_info']['size_mb']:.2f} MB")
print(f" Pages: {result['statistics']['page_count']}")
print(f" Title: {result['metadata'].get('title', 'No title')}")
print(f" Creation date: {result['metadata'].get('creation_date', 'Unknown')}")
except Exception as e:
print(f"❌ Failed: {e}")
async def example_scanned_detection():
"""Example: Check if PDF is scanned"""
print("\n🔍 Checking if PDF is scanned...")
url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
try:
result = await is_scanned_pdf(url)
print(f"✅ Scanned detection successful!")
print(f" Is scanned: {result['is_scanned']}")
print(f" Recommendation: {result['recommendation']}")
print(f" Pages checked: {result['sample_pages_checked']}")
except Exception as e:
print(f"❌ Failed: {e}")
async def example_markdown_conversion():
"""Example: Convert PDF URL to markdown"""
print("\n📝 Converting PDF to markdown...")
url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
try:
result = await pdf_to_markdown(url)
print(f"✅ Markdown conversion successful!")
print(f" Pages converted: {result['pages_converted']}")
print(f" Markdown length: {len(result['markdown'])} characters")
print(f" First 200 characters:")
print(f" {result['markdown'][:200]}...")
except Exception as e:
print(f"❌ Failed: {e}")
async def main():
"""Run all URL examples"""
print("🌐 MCP PDF Tools - URL Examples")
print("=" * 50)
await example_text_extraction()
await example_metadata_extraction()
await example_scanned_detection()
await example_markdown_conversion()
print("\n✨ URL examples completed!")
print("\n💡 Tips:")
print(" • URLs are cached for 1 hour to avoid repeated downloads")
print(" • Use HTTPS URLs for security")
print(" • The server validates content is actually a PDF file")
print(" • All tools support the same URL format")
if __name__ == "__main__":
asyncio.run(main())

3
mcp-pdf-tools-launcher.sh Executable file
View File

@ -0,0 +1,3 @@
#!/bin/bash
cd /home/rpm/claude/mcp-pdf-tools
exec uv run mcp-pdf-tools "$@"

File diff suppressed because it is too large Load Diff

52
test_pages_parameter.py Normal file
View File

@ -0,0 +1,52 @@
#!/usr/bin/env python3
"""
Test the updated pages parameter parsing
"""
import asyncio
import sys
import os
# Add src to path
sys.path.insert(0, 'src')
from mcp_pdf_tools.server import parse_pages_parameter
def test_page_parsing():
"""Test page parameter parsing (1-based user input -> 0-based internal)"""
print("Testing page parameter parsing (1-based user input -> 0-based internal)...")
# Test different input formats - all converted from 1-based user input to 0-based internal
test_cases = [
(None, None),
("1,2,3", [0, 1, 2]), # 1-based input -> 0-based internal
("[2, 3]", [1, 2]), # This is the problematic case from the user
("5", [4]), # Page 5 becomes index 4
([1, 2, 3], [0, 1, 2]), # List input also converted
("2,3,4", [1, 2, 3]), # Pages 2,3,4 -> indexes 1,2,3
("[1,2,3]", [0, 1, 2]) # Another format
]
all_passed = True
for input_val, expected in test_cases:
try:
result = parse_pages_parameter(input_val)
if result == expected:
print(f"'{input_val}' -> {result}")
else:
print(f"'{input_val}' -> {result}, expected {expected}")
all_passed = False
except Exception as e:
print(f"'{input_val}' -> Error: {e}")
all_passed = False
return all_passed
if __name__ == "__main__":
success = test_page_parsing()
if success:
print("\n🎉 All page parameter parsing tests passed!")
else:
print("\n🚨 Some tests failed!")
sys.exit(0 if success else 1)

71
test_url_support.py Normal file
View File

@ -0,0 +1,71 @@
#!/usr/bin/env python3
"""
Test URL support for MCP PDF Tools
"""
import asyncio
import sys
import os
# Add src to path
sys.path.insert(0, 'src')
from mcp_pdf_tools.server import validate_pdf_path, download_pdf_from_url
async def test_url_validation():
"""Test URL validation and download"""
print("Testing URL validation and download...")
# Test with a known PDF URL (using a publicly available sample)
test_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
try:
print(f"Testing URL: {test_url}")
path = await validate_pdf_path(test_url)
print(f"✅ Successfully downloaded and validated PDF: {path}")
print(f" File size: {path.stat().st_size} bytes")
return True
except Exception as e:
print(f"❌ URL test failed: {e}")
return False
async def test_local_path():
"""Test that local paths still work"""
print("\nTesting local path validation...")
# Test with our existing test PDF
test_path = "/tmp/test_text.pdf"
if not os.path.exists(test_path):
print(f"⚠️ Test file {test_path} not found, skipping local test")
return True
try:
path = await validate_pdf_path(test_path)
print(f"✅ Local path validation works: {path}")
return True
except Exception as e:
print(f"❌ Local path test failed: {e}")
return False
async def main():
print("🧪 Testing MCP PDF Tools URL Support\n")
url_success = await test_url_validation()
local_success = await test_local_path()
print(f"\n📊 Test Results:")
print(f" URL support: {'✅ PASS' if url_success else '❌ FAIL'}")
print(f" Local paths: {'✅ PASS' if local_success else '❌ FAIL'}")
if url_success and local_success:
print("\n🎉 All tests passed! URL support is working.")
return 0
else:
print("\n🚨 Some tests failed.")
return 1
if __name__ == "__main__":
sys.exit(asyncio.run(main()))