Implement comprehensive PDF processing suite with 15 additional advanced tools

Major expansion from 8 to 23 total tools covering:

**Document Analysis & Intelligence:**
- analyze_pdf_health: Comprehensive quality and health analysis
- analyze_pdf_security: Security features and vulnerability assessment
- classify_content: AI-powered document type classification
- summarize_content: Intelligent content summarization with key insights
- compare_pdfs: Advanced document comparison (text, structure, metadata)

**Layout & Visual Analysis:**
- analyze_layout: Page layout analysis with column detection
- extract_charts: Chart, diagram, and visual element extraction
- detect_watermarks: Watermark detection and analysis

**Content Manipulation:**
- extract_form_data: Interactive PDF form data extraction
- split_pdf: Split PDFs at specified pages
- merge_pdfs: Merge multiple PDFs into one
- rotate_pages: Rotate pages by 90°/180°/270°

**Optimization & Utilities:**
- convert_to_images: Convert PDF pages to image files
- optimize_pdf: File size optimization with quality levels
- repair_pdf: Corrupted PDF repair and recovery

**Technical Enhancements:**
- All tools support HTTPS URLs with intelligent caching
- Fixed MCP parameter validation for pages parameter
- Comprehensive error handling and validation
- Updated documentation with usage examples

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Ryan Malloy 2025-08-11 04:27:04 -06:00
parent 58d43851b9
commit f0365a0d75
2 changed files with 2216 additions and 13 deletions

136
README.md
View File

@ -214,18 +214,150 @@ result = await extract_images(
) )
``` ```
### Advanced Analysis
```python
# Analyze document health and quality
result = await analyze_pdf_health(
pdf_path="/path/to/document.pdf"
)
# Classify content type and structure
result = await classify_content(
pdf_path="/path/to/document.pdf"
)
# Generate content summary
result = await summarize_content(
pdf_path="/path/to/document.pdf",
summary_length="medium", # "short", "medium", "long"
pages="1,2,3" # Specific pages
)
# Analyze page layout
result = await analyze_layout(
pdf_path="/path/to/document.pdf",
pages="1,2,3",
include_coordinates=True
)
```
### Content Manipulation
```python
# Extract form data
result = await extract_form_data(
pdf_path="/path/to/form.pdf"
)
# Split PDF into separate files
result = await split_pdf(
pdf_path="/path/to/document.pdf",
split_pages="5,10,15", # Split after pages 5, 10, 15
output_prefix="section"
)
# Merge multiple PDFs
result = await merge_pdfs(
pdf_paths=["/path/to/doc1.pdf", "/path/to/doc2.pdf"],
output_filename="merged_document.pdf"
)
# Rotate specific pages
result = await rotate_pages(
pdf_path="/path/to/document.pdf",
page_rotations={"1": 90, "3": 180} # Page 1: 90°, Page 3: 180°
)
```
### Optimization and Repair
```python
# Optimize PDF file size
result = await optimize_pdf(
pdf_path="/path/to/large.pdf",
optimization_level="balanced", # "light", "balanced", "aggressive"
preserve_quality=True
)
# Repair corrupted PDF
result = await repair_pdf(
pdf_path="/path/to/corrupted.pdf"
)
# Compare two PDFs
result = await compare_pdfs(
pdf_path1="/path/to/original.pdf",
pdf_path2="/path/to/modified.pdf",
comparison_type="all" # "text", "structure", "metadata", "all"
)
```
### Visual Analysis
```python
# Extract charts and diagrams
result = await extract_charts(
pdf_path="/path/to/report.pdf",
pages="2,3,4",
min_size=150 # Minimum size for chart detection
)
# Detect watermarks
result = await detect_watermarks(
pdf_path="/path/to/document.pdf"
)
# Security analysis
result = await analyze_pdf_security(
pdf_path="/path/to/document.pdf"
)
```
## Available Tools ## Available Tools
### Core Processing Tools
| Tool | Description | | Tool | Description |
|------|-------------| |------|-------------|
| `extract_text` | Extract text with multiple methods and layout preservation | | `extract_text` | Extract text with multiple methods and layout preservation |
| `extract_tables` | Extract tables in various formats (JSON, CSV, Markdown) | | `extract_tables` | Extract tables in various formats (JSON, CSV, Markdown) |
| `ocr_pdf` | Perform OCR on scanned PDFs with preprocessing | | `ocr_pdf` | Perform OCR on scanned PDFs with preprocessing |
| `extract_images` | Extract images with filtering options |
| `pdf_to_markdown` | Convert PDF to clean Markdown format |
### Document Analysis Tools
| Tool | Description |
|------|-------------|
| `is_scanned_pdf` | Check if a PDF is scanned or text-based | | `is_scanned_pdf` | Check if a PDF is scanned or text-based |
| `get_document_structure` | Extract document structure, outline, and basic metadata | | `get_document_structure` | Extract document structure, outline, and basic metadata |
| `extract_metadata` | Extract comprehensive metadata and file statistics | | `extract_metadata` | Extract comprehensive metadata and file statistics |
| `pdf_to_markdown` | Convert PDF to clean Markdown format | | `analyze_pdf_health` | Comprehensive PDF health and quality analysis |
| `extract_images` | Extract images with filtering options | | `analyze_pdf_security` | Analyze PDF security features and potential issues |
| `classify_content` | Classify and analyze PDF content type and structure |
| `summarize_content` | Generate summary and key insights from PDF content |
### Layout and Visual Analysis Tools
| Tool | Description |
|------|-------------|
| `analyze_layout` | Analyze PDF page layout including text blocks, columns, and spacing |
| `extract_charts` | Extract and analyze charts, diagrams, and visual elements |
| `detect_watermarks` | Detect and analyze watermarks in PDF |
### Content Manipulation Tools
| Tool | Description |
|------|-------------|
| `extract_form_data` | Extract form fields and their values from PDF forms |
| `split_pdf` | Split PDF into multiple files at specified pages |
| `merge_pdfs` | Merge multiple PDFs into a single file |
| `rotate_pages` | Rotate specific pages by 90, 180, or 270 degrees |
### Utility and Optimization Tools
| Tool | Description |
|------|-------------|
| `compare_pdfs` | Compare two PDFs for differences in text, structure, and metadata |
| `convert_to_images` | Convert PDF pages to image files |
| `optimize_pdf` | Optimize PDF file size and performance |
| `repair_pdf` | Attempt to repair corrupted or damaged PDF files |
## Development ## Development

File diff suppressed because it is too large Load Diff