Implement comprehensive PDF processing suite with 15 additional advanced tools
Major expansion from 8 to 23 total tools covering: **Document Analysis & Intelligence:** - analyze_pdf_health: Comprehensive quality and health analysis - analyze_pdf_security: Security features and vulnerability assessment - classify_content: AI-powered document type classification - summarize_content: Intelligent content summarization with key insights - compare_pdfs: Advanced document comparison (text, structure, metadata) **Layout & Visual Analysis:** - analyze_layout: Page layout analysis with column detection - extract_charts: Chart, diagram, and visual element extraction - detect_watermarks: Watermark detection and analysis **Content Manipulation:** - extract_form_data: Interactive PDF form data extraction - split_pdf: Split PDFs at specified pages - merge_pdfs: Merge multiple PDFs into one - rotate_pages: Rotate pages by 90°/180°/270° **Optimization & Utilities:** - convert_to_images: Convert PDF pages to image files - optimize_pdf: File size optimization with quality levels - repair_pdf: Corrupted PDF repair and recovery **Technical Enhancements:** - All tools support HTTPS URLs with intelligent caching - Fixed MCP parameter validation for pages parameter - Comprehensive error handling and validation - Updated documentation with usage examples 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
58d43851b9
commit
f0365a0d75
136
README.md
136
README.md
@ -214,18 +214,150 @@ result = await extract_images(
|
|||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Advanced Analysis
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Analyze document health and quality
|
||||||
|
result = await analyze_pdf_health(
|
||||||
|
pdf_path="/path/to/document.pdf"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Classify content type and structure
|
||||||
|
result = await classify_content(
|
||||||
|
pdf_path="/path/to/document.pdf"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Generate content summary
|
||||||
|
result = await summarize_content(
|
||||||
|
pdf_path="/path/to/document.pdf",
|
||||||
|
summary_length="medium", # "short", "medium", "long"
|
||||||
|
pages="1,2,3" # Specific pages
|
||||||
|
)
|
||||||
|
|
||||||
|
# Analyze page layout
|
||||||
|
result = await analyze_layout(
|
||||||
|
pdf_path="/path/to/document.pdf",
|
||||||
|
pages="1,2,3",
|
||||||
|
include_coordinates=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Content Manipulation
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Extract form data
|
||||||
|
result = await extract_form_data(
|
||||||
|
pdf_path="/path/to/form.pdf"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Split PDF into separate files
|
||||||
|
result = await split_pdf(
|
||||||
|
pdf_path="/path/to/document.pdf",
|
||||||
|
split_pages="5,10,15", # Split after pages 5, 10, 15
|
||||||
|
output_prefix="section"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Merge multiple PDFs
|
||||||
|
result = await merge_pdfs(
|
||||||
|
pdf_paths=["/path/to/doc1.pdf", "/path/to/doc2.pdf"],
|
||||||
|
output_filename="merged_document.pdf"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Rotate specific pages
|
||||||
|
result = await rotate_pages(
|
||||||
|
pdf_path="/path/to/document.pdf",
|
||||||
|
page_rotations={"1": 90, "3": 180} # Page 1: 90°, Page 3: 180°
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Optimization and Repair
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Optimize PDF file size
|
||||||
|
result = await optimize_pdf(
|
||||||
|
pdf_path="/path/to/large.pdf",
|
||||||
|
optimization_level="balanced", # "light", "balanced", "aggressive"
|
||||||
|
preserve_quality=True
|
||||||
|
)
|
||||||
|
|
||||||
|
# Repair corrupted PDF
|
||||||
|
result = await repair_pdf(
|
||||||
|
pdf_path="/path/to/corrupted.pdf"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Compare two PDFs
|
||||||
|
result = await compare_pdfs(
|
||||||
|
pdf_path1="/path/to/original.pdf",
|
||||||
|
pdf_path2="/path/to/modified.pdf",
|
||||||
|
comparison_type="all" # "text", "structure", "metadata", "all"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Visual Analysis
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Extract charts and diagrams
|
||||||
|
result = await extract_charts(
|
||||||
|
pdf_path="/path/to/report.pdf",
|
||||||
|
pages="2,3,4",
|
||||||
|
min_size=150 # Minimum size for chart detection
|
||||||
|
)
|
||||||
|
|
||||||
|
# Detect watermarks
|
||||||
|
result = await detect_watermarks(
|
||||||
|
pdf_path="/path/to/document.pdf"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Security analysis
|
||||||
|
result = await analyze_pdf_security(
|
||||||
|
pdf_path="/path/to/document.pdf"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
## Available Tools
|
## Available Tools
|
||||||
|
|
||||||
|
### Core Processing Tools
|
||||||
| Tool | Description |
|
| Tool | Description |
|
||||||
|------|-------------|
|
|------|-------------|
|
||||||
| `extract_text` | Extract text with multiple methods and layout preservation |
|
| `extract_text` | Extract text with multiple methods and layout preservation |
|
||||||
| `extract_tables` | Extract tables in various formats (JSON, CSV, Markdown) |
|
| `extract_tables` | Extract tables in various formats (JSON, CSV, Markdown) |
|
||||||
| `ocr_pdf` | Perform OCR on scanned PDFs with preprocessing |
|
| `ocr_pdf` | Perform OCR on scanned PDFs with preprocessing |
|
||||||
|
| `extract_images` | Extract images with filtering options |
|
||||||
|
| `pdf_to_markdown` | Convert PDF to clean Markdown format |
|
||||||
|
|
||||||
|
### Document Analysis Tools
|
||||||
|
| Tool | Description |
|
||||||
|
|------|-------------|
|
||||||
| `is_scanned_pdf` | Check if a PDF is scanned or text-based |
|
| `is_scanned_pdf` | Check if a PDF is scanned or text-based |
|
||||||
| `get_document_structure` | Extract document structure, outline, and basic metadata |
|
| `get_document_structure` | Extract document structure, outline, and basic metadata |
|
||||||
| `extract_metadata` | Extract comprehensive metadata and file statistics |
|
| `extract_metadata` | Extract comprehensive metadata and file statistics |
|
||||||
| `pdf_to_markdown` | Convert PDF to clean Markdown format |
|
| `analyze_pdf_health` | Comprehensive PDF health and quality analysis |
|
||||||
| `extract_images` | Extract images with filtering options |
|
| `analyze_pdf_security` | Analyze PDF security features and potential issues |
|
||||||
|
| `classify_content` | Classify and analyze PDF content type and structure |
|
||||||
|
| `summarize_content` | Generate summary and key insights from PDF content |
|
||||||
|
|
||||||
|
### Layout and Visual Analysis Tools
|
||||||
|
| Tool | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| `analyze_layout` | Analyze PDF page layout including text blocks, columns, and spacing |
|
||||||
|
| `extract_charts` | Extract and analyze charts, diagrams, and visual elements |
|
||||||
|
| `detect_watermarks` | Detect and analyze watermarks in PDF |
|
||||||
|
|
||||||
|
### Content Manipulation Tools
|
||||||
|
| Tool | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| `extract_form_data` | Extract form fields and their values from PDF forms |
|
||||||
|
| `split_pdf` | Split PDF into multiple files at specified pages |
|
||||||
|
| `merge_pdfs` | Merge multiple PDFs into a single file |
|
||||||
|
| `rotate_pages` | Rotate specific pages by 90, 180, or 270 degrees |
|
||||||
|
|
||||||
|
### Utility and Optimization Tools
|
||||||
|
| Tool | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| `compare_pdfs` | Compare two PDFs for differences in text, structure, and metadata |
|
||||||
|
| `convert_to_images` | Convert PDF pages to image files |
|
||||||
|
| `optimize_pdf` | Optimize PDF file size and performance |
|
||||||
|
| `repair_pdf` | Attempt to repair corrupted or damaged PDF files |
|
||||||
|
|
||||||
## Development
|
## Development
|
||||||
|
|
||||||
|
File diff suppressed because it is too large
Load Diff
Loading…
x
Reference in New Issue
Block a user