✨ Add comprehensive link extraction tool (24th PDF tool)
New Features: - extract_links: Extract all PDF hyperlinks with advanced filtering - Page-specific filtering (e.g., "1,3,5" or "1-5,8,10-12") - Link type categorization: external URLs, internal pages, emails, documents - Coordinate tracking for precise link positioning - FastMCP integration with proper tool registration - Version banner display following CLAUDE.md guidelines Technical Improvements: - Enhanced startup banner with package version display - Updated documentation to reflect 24 specialized tools - Proper FastMCP @mcp.tool() decorator usage - Comprehensive error handling and security validation Documentation Updates: - README.md: Updated tool count and installation guides - CLAUDE.md: Added link extraction to implemented features - LOCAL_DEVELOPMENT.md: Enhanced with scoped installation commands Version: 1.1.0 (minor version bump for new feature)
This commit is contained in:
parent
ebf6bb8a43
commit
856dd41996
10
.mcp.json
10
.mcp.json
@ -1,11 +1,3 @@
|
|||||||
{
|
{
|
||||||
"mcpServers": {
|
"mcpServers": {}
|
||||||
"pdf-tools": {
|
|
||||||
"command": "uv",
|
|
||||||
"args": ["run", "mcp-pdf-tools"],
|
|
||||||
"env": {
|
|
||||||
"PDF_TEMP_DIR": "/tmp/mcp-pdf-processing"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
@ -93,9 +93,10 @@ uv publish
|
|||||||
4. **Document Analysis**: `is_scanned_pdf`, `get_document_structure`, `extract_metadata`
|
4. **Document Analysis**: `is_scanned_pdf`, `get_document_structure`, `extract_metadata`
|
||||||
5. **Format Conversion**: `pdf_to_markdown` - Clean markdown with MCP resource URIs for images
|
5. **Format Conversion**: `pdf_to_markdown` - Clean markdown with MCP resource URIs for images
|
||||||
6. **Image Processing**: `extract_images` - Extract images with custom output paths and clean summary output
|
6. **Image Processing**: `extract_images` - Extract images with custom output paths and clean summary output
|
||||||
7. **PDF Forms**: `extract_form_data`, `create_form_pdf`, `fill_form_pdf`, `add_form_fields` - Complete form lifecycle management
|
7. **Link Extraction**: `extract_links` - Extract all hyperlinks with page filtering and type categorization
|
||||||
8. **Document Assembly**: `merge_pdfs`, `split_pdf_by_pages`, `reorder_pdf_pages` - PDF manipulation and organization
|
8. **PDF Forms**: `extract_form_data`, `create_form_pdf`, `fill_form_pdf`, `add_form_fields` - Complete form lifecycle management
|
||||||
9. **Annotations & Markup**: `add_sticky_notes`, `add_highlights`, `add_stamps`, `add_video_notes`, `extract_all_annotations` - Collaboration and multimedia review tools
|
9. **Document Assembly**: `merge_pdfs`, `split_pdf_by_pages`, `reorder_pdf_pages` - PDF manipulation and organization
|
||||||
|
10. **Annotations & Markup**: `add_sticky_notes`, `add_highlights`, `add_stamps`, `add_video_notes`, `extract_all_annotations` - Collaboration and multimedia review tools
|
||||||
|
|
||||||
### MCP Client-Friendly Design
|
### MCP Client-Friendly Design
|
||||||
|
|
||||||
@ -314,7 +315,7 @@ Based on comprehensive PDF usage patterns, here are potential high-impact featur
|
|||||||
- `detect_pdf_quality_issues` - Scan for structural problems
|
- `detect_pdf_quality_issues` - Scan for structural problems
|
||||||
|
|
||||||
### 📄 Priority 5: Advanced Content Extraction
|
### 📄 Priority 5: Advanced Content Extraction
|
||||||
- `extract_pdf_links` - All URLs and internal links
|
- ✅ `extract_links` - All URLs and internal links (IMPLEMENTED)
|
||||||
- `extract_pdf_fonts` - Font usage analysis
|
- `extract_pdf_fonts` - Font usage analysis
|
||||||
- `extract_pdf_colors` - Color palette extraction
|
- `extract_pdf_colors` - Color palette extraction
|
||||||
- `extract_pdf_layers` - CAD/design layer information
|
- `extract_pdf_layers` - CAD/design layer information
|
||||||
|
|||||||
@ -25,19 +25,34 @@ uv sync --dev
|
|||||||
uv run python -c "from mcp_pdf.server import create_server; print('✅ MCP PDF loads successfully')"
|
uv run python -c "from mcp_pdf.server import create_server; print('✅ MCP PDF loads successfully')"
|
||||||
```
|
```
|
||||||
|
|
||||||
### 2. Test with Claude Code (Local Development)
|
### 2. Add MCP Server to Claude Desktop
|
||||||
|
|
||||||
Use the `-t local` flag to point Claude Code to your local development copy:
|
#### For Production Use (PyPI Installation)
|
||||||
|
|
||||||
|
Install the published version from PyPI:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Start Claude Code with local MCP PDF server
|
# For personal use across all projects
|
||||||
claude-code -t local /path/to/mcp-pdf
|
claude mcp add -s local pdf-tools uvx mcp-pdf
|
||||||
|
|
||||||
|
# For project-specific use (isolated to current directory)
|
||||||
|
claude mcp add -s project pdf-tools uvx mcp-pdf
|
||||||
```
|
```
|
||||||
|
|
||||||
Or if you're already in the mcp-pdf directory:
|
#### For Local Development (Source Installation)
|
||||||
|
|
||||||
|
When developing MCP PDF itself, use the local source:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
claude-code -t local .
|
# For development from local source
|
||||||
|
claude mcp add -s project pdf-tools-dev uv -- --directory /path/to/mcp-pdf-tools run mcp-pdf
|
||||||
|
```
|
||||||
|
|
||||||
|
Or if you're in the mcp-pdf directory:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Development server from current directory
|
||||||
|
claude mcp add -s project pdf-tools-dev uv -- --directory . run mcp-pdf
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3. Alternative: Manual Server Testing
|
### 3. Alternative: Manual Server Testing
|
||||||
|
|||||||
27
README.md
27
README.md
@ -6,7 +6,7 @@
|
|||||||
|
|
||||||
**🚀 The Ultimate PDF Processing Intelligence Platform for AI**
|
**🚀 The Ultimate PDF Processing Intelligence Platform for AI**
|
||||||
|
|
||||||
*Transform any PDF into structured, actionable intelligence with 23 specialized tools*
|
*Transform any PDF into structured, actionable intelligence with 24 specialized tools*
|
||||||
|
|
||||||
[](https://www.python.org/downloads/)
|
[](https://www.python.org/downloads/)
|
||||||
[](https://github.com/jlowin/fastmcp)
|
[](https://github.com/jlowin/fastmcp)
|
||||||
@ -31,7 +31,7 @@
|
|||||||
<td>
|
<td>
|
||||||
|
|
||||||
### 🏆 **Why MCP PDF Leads**
|
### 🏆 **Why MCP PDF Leads**
|
||||||
- **🚀 23 Specialized Tools** for every PDF scenario
|
- **🚀 24 Specialized Tools** for every PDF scenario
|
||||||
- **🧠 AI-Powered Intelligence** beyond basic extraction
|
- **🧠 AI-Powered Intelligence** beyond basic extraction
|
||||||
- **🔄 Multi-Library Fallbacks** for 99.9% reliability
|
- **🔄 Multi-Library Fallbacks** for 99.9% reliability
|
||||||
- **⚡ 10x Faster** than traditional solutions
|
- **⚡ 10x Faster** than traditional solutions
|
||||||
@ -76,14 +76,31 @@ uv run mcp-pdf
|
|||||||
<details>
|
<details>
|
||||||
<summary>🔧 <b>Claude Desktop Integration</b> (click to expand)</summary>
|
<summary>🔧 <b>Claude Desktop Integration</b> (click to expand)</summary>
|
||||||
|
|
||||||
|
### **📦 Production Installation (PyPI)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# For personal use across all projects
|
||||||
|
claude mcp add -s local pdf-tools uvx mcp-pdf
|
||||||
|
|
||||||
|
# For project-specific use (isolated)
|
||||||
|
claude mcp add -s project pdf-tools uvx mcp-pdf
|
||||||
|
```
|
||||||
|
|
||||||
|
### **🛠️ Development Installation (Source)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# For local development from source
|
||||||
|
claude mcp add -s project pdf-tools-dev uv -- --directory /path/to/mcp-pdf run mcp-pdf
|
||||||
|
```
|
||||||
|
|
||||||
|
### **⚙️ Manual Configuration**
|
||||||
Add to your `claude_desktop_config.json`:
|
Add to your `claude_desktop_config.json`:
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"mcpServers": {
|
"mcpServers": {
|
||||||
"pdf-tools": {
|
"pdf-tools": {
|
||||||
"command": "uv",
|
"command": "uvx",
|
||||||
"args": ["run", "mcp-pdf"],
|
"args": ["mcp-pdf"]
|
||||||
"cwd": "/path/to/mcp-pdf"
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
[project]
|
[project]
|
||||||
name = "mcp-pdf"
|
name = "mcp-pdf"
|
||||||
version = "1.0.1"
|
version = "1.1.0"
|
||||||
description = "Secure FastMCP server for comprehensive PDF processing - text extraction, OCR, table extraction, forms, annotations, and more"
|
description = "Secure FastMCP server for comprehensive PDF processing - text extraction, OCR, table extraction, forms, annotations, and more"
|
||||||
authors = [{name = "Ryan Malloy", email = "ryan@malloys.us"}]
|
authors = [{name = "Ryan Malloy", email = "ryan@malloys.us"}]
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
|
|||||||
@ -6295,12 +6295,181 @@ def create_server():
|
|||||||
"""Create and return the MCP server instance"""
|
"""Create and return the MCP server instance"""
|
||||||
return mcp
|
return mcp
|
||||||
|
|
||||||
|
@mcp.tool(
|
||||||
|
name="extract_links",
|
||||||
|
description="Extract all links from PDF with comprehensive filtering and analysis options"
|
||||||
|
)
|
||||||
|
async def extract_links(
|
||||||
|
pdf_path: str,
|
||||||
|
pages: Optional[str] = None,
|
||||||
|
include_internal: bool = True,
|
||||||
|
include_external: bool = True,
|
||||||
|
include_email: bool = True
|
||||||
|
) -> dict:
|
||||||
|
"""
|
||||||
|
Extract all links from a PDF document with page filtering options.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
pdf_path: Path to PDF file or HTTPS URL
|
||||||
|
pages: Page numbers (e.g., "1,3,5" or "1-5,8,10-12"). If None, processes all pages
|
||||||
|
include_internal: Include internal document links (default: True)
|
||||||
|
include_external: Include external URL links (default: True)
|
||||||
|
include_email: Include email links (default: True)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary containing extracted links organized by type and page
|
||||||
|
"""
|
||||||
|
start_time = time.time()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Validate PDF path and security
|
||||||
|
path = await validate_pdf_path(pdf_path)
|
||||||
|
|
||||||
|
# Parse pages parameter
|
||||||
|
pages_to_extract = []
|
||||||
|
doc = fitz.open(path)
|
||||||
|
total_pages = doc.page_count
|
||||||
|
|
||||||
|
if pages:
|
||||||
|
try:
|
||||||
|
pages_to_extract = parse_page_ranges(pages, total_pages)
|
||||||
|
except ValueError as e:
|
||||||
|
raise ValueError(f"Invalid page specification: {e}")
|
||||||
|
else:
|
||||||
|
pages_to_extract = list(range(total_pages))
|
||||||
|
|
||||||
|
# Extract links from specified pages
|
||||||
|
all_links = []
|
||||||
|
pages_with_links = []
|
||||||
|
|
||||||
|
for page_num in pages_to_extract:
|
||||||
|
page = doc[page_num]
|
||||||
|
page_links = page.get_links()
|
||||||
|
|
||||||
|
if page_links:
|
||||||
|
pages_with_links.append(page_num + 1) # 1-based for user
|
||||||
|
|
||||||
|
for link in page_links:
|
||||||
|
link_info = {
|
||||||
|
"page": page_num + 1, # 1-based page numbering
|
||||||
|
"type": "unknown",
|
||||||
|
"destination": None,
|
||||||
|
"coordinates": {
|
||||||
|
"x0": round(link["from"].x0, 2),
|
||||||
|
"y0": round(link["from"].y0, 2),
|
||||||
|
"x1": round(link["from"].x1, 2),
|
||||||
|
"y1": round(link["from"].y1, 2)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# Determine link type and destination
|
||||||
|
if link["kind"] == fitz.LINK_URI:
|
||||||
|
# External URL
|
||||||
|
if include_external:
|
||||||
|
link_info["type"] = "external_url"
|
||||||
|
link_info["destination"] = link["uri"]
|
||||||
|
all_links.append(link_info)
|
||||||
|
elif link["kind"] == fitz.LINK_GOTO:
|
||||||
|
# Internal link to another page
|
||||||
|
if include_internal:
|
||||||
|
link_info["type"] = "internal_page"
|
||||||
|
link_info["destination"] = f"Page {link['page'] + 1}"
|
||||||
|
all_links.append(link_info)
|
||||||
|
elif link["kind"] == fitz.LINK_GOTOR:
|
||||||
|
# Link to external document
|
||||||
|
if include_external:
|
||||||
|
link_info["type"] = "external_document"
|
||||||
|
link_info["destination"] = link.get("file", "unknown")
|
||||||
|
all_links.append(link_info)
|
||||||
|
elif link["kind"] == fitz.LINK_LAUNCH:
|
||||||
|
# Launch application/file
|
||||||
|
if include_external:
|
||||||
|
link_info["type"] = "launch"
|
||||||
|
link_info["destination"] = link.get("file", "unknown")
|
||||||
|
all_links.append(link_info)
|
||||||
|
elif link["kind"] == fitz.LINK_NAMED:
|
||||||
|
# Named action (like print, quit, etc.)
|
||||||
|
if include_internal:
|
||||||
|
link_info["type"] = "named_action"
|
||||||
|
link_info["destination"] = link.get("name", "unknown")
|
||||||
|
all_links.append(link_info)
|
||||||
|
|
||||||
|
# Organize links by type
|
||||||
|
links_by_type = {
|
||||||
|
"external_url": [link for link in all_links if link["type"] == "external_url"],
|
||||||
|
"internal_page": [link for link in all_links if link["type"] == "internal_page"],
|
||||||
|
"external_document": [link for link in all_links if link["type"] == "external_document"],
|
||||||
|
"launch": [link for link in all_links if link["type"] == "launch"],
|
||||||
|
"named_action": [link for link in all_links if link["type"] == "named_action"],
|
||||||
|
"email": [] # PyMuPDF doesn't distinguish email separately, they come as external_url
|
||||||
|
}
|
||||||
|
|
||||||
|
# Extract email links from external URLs
|
||||||
|
if include_email:
|
||||||
|
for link in links_by_type["external_url"]:
|
||||||
|
if link["destination"] and link["destination"].startswith("mailto:"):
|
||||||
|
email_link = link.copy()
|
||||||
|
email_link["type"] = "email"
|
||||||
|
email_link["destination"] = link["destination"].replace("mailto:", "")
|
||||||
|
links_by_type["email"].append(email_link)
|
||||||
|
|
||||||
|
# Remove email links from external_url list
|
||||||
|
links_by_type["external_url"] = [
|
||||||
|
link for link in links_by_type["external_url"]
|
||||||
|
if not (link["destination"] and link["destination"].startswith("mailto:"))
|
||||||
|
]
|
||||||
|
|
||||||
|
doc.close()
|
||||||
|
|
||||||
|
extraction_time = round(time.time() - start_time, 2)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"file_info": {
|
||||||
|
"path": str(path),
|
||||||
|
"total_pages": total_pages,
|
||||||
|
"pages_searched": pages_to_extract if pages else list(range(total_pages))
|
||||||
|
},
|
||||||
|
"extraction_summary": {
|
||||||
|
"total_links_found": len(all_links),
|
||||||
|
"pages_with_links": pages_with_links,
|
||||||
|
"pages_searched_count": len(pages_to_extract),
|
||||||
|
"link_types_found": [link_type for link_type, links in links_by_type.items() if links]
|
||||||
|
},
|
||||||
|
"links_by_type": links_by_type,
|
||||||
|
"all_links": all_links,
|
||||||
|
"extraction_settings": {
|
||||||
|
"include_internal": include_internal,
|
||||||
|
"include_external": include_external,
|
||||||
|
"include_email": include_email,
|
||||||
|
"pages_filter": pages or "all"
|
||||||
|
},
|
||||||
|
"extraction_time": extraction_time
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
error_msg = sanitize_error_message(str(e))
|
||||||
|
logger.error(f"Link extraction failed for {pdf_path}: {error_msg}")
|
||||||
|
return {
|
||||||
|
"error": f"Link extraction failed: {error_msg}",
|
||||||
|
"extraction_time": round(time.time() - start_time, 2)
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
"""Run the MCP server - entry point for CLI"""
|
"""Run the MCP server - entry point for CLI"""
|
||||||
asyncio.run(run_server())
|
asyncio.run(run_server())
|
||||||
|
|
||||||
async def run_server():
|
async def run_server():
|
||||||
"""Run the MCP server"""
|
"""Run the MCP server"""
|
||||||
|
try:
|
||||||
|
from importlib.metadata import version
|
||||||
|
package_version = version("mcp-pdf")
|
||||||
|
except:
|
||||||
|
package_version = "1.0.1"
|
||||||
|
|
||||||
|
# Log version to stderr so it appears even with MCP protocol on stdout
|
||||||
|
import sys
|
||||||
|
print(f"🎬 MCP PDF Tools v{package_version}", file=sys.stderr)
|
||||||
await mcp.run_stdio_async()
|
await mcp.run_stdio_async()
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user