
- Complete browser automation with Playwright integration - High-level API functions: get(), get_many(), discover() - JavaScript execution support with script parameters - Content extraction optimized for LLM workflows - Comprehensive test suite with 18 test files (700+ scenarios) - Local Caddy test server for reproducible testing - Performance benchmarking vs Katana crawler - Complete documentation including JavaScript API guide - PyPI-ready packaging with professional metadata - UNIX philosophy: do web scraping exceptionally well
599 lines
16 KiB
Markdown
599 lines
16 KiB
Markdown
# Crawailer API Reference
|
|
|
|
## Core Functions
|
|
|
|
### `get(url, **options) -> WebContent`
|
|
|
|
Extract content from a single URL with optional JavaScript execution.
|
|
|
|
**Parameters:**
|
|
- `url` (str): The URL to fetch
|
|
- `wait_for` (str, optional): CSS selector to wait for before extraction
|
|
- `timeout` (int, default=30): Request timeout in seconds
|
|
- `clean` (bool, default=True): Whether to clean and optimize content
|
|
- `extract_links` (bool, default=True): Whether to extract links
|
|
- `extract_metadata` (bool, default=True): Whether to extract metadata
|
|
- `script` (str, optional): JavaScript to execute (alias for `script_before`)
|
|
- `script_before` (str, optional): JavaScript to execute before content extraction
|
|
- `script_after` (str, optional): JavaScript to execute after content extraction
|
|
|
|
**Returns:** `WebContent` object with extracted content and metadata
|
|
|
|
**Example:**
|
|
```python
|
|
# Basic usage
|
|
content = await get("https://example.com")
|
|
|
|
# With JavaScript execution
|
|
content = await get(
|
|
"https://dynamic-site.com",
|
|
script="document.querySelector('.price').textContent",
|
|
wait_for=".price-loaded"
|
|
)
|
|
|
|
# Before/after pattern
|
|
content = await get(
|
|
"https://spa.com",
|
|
script_before="document.querySelector('.load-more')?.click()",
|
|
script_after="document.querySelectorAll('.item').length"
|
|
)
|
|
```
|
|
|
|
### `get_many(urls, **options) -> List[WebContent]`
|
|
|
|
Extract content from multiple URLs efficiently with concurrent processing.
|
|
|
|
**Parameters:**
|
|
- `urls` (List[str]): List of URLs to fetch
|
|
- `max_concurrent` (int, default=5): Maximum concurrent requests
|
|
- `timeout` (int, default=30): Request timeout per URL
|
|
- `clean` (bool, default=True): Whether to clean content
|
|
- `progress` (bool, default=False): Whether to show progress bar
|
|
- `script` (str | List[str], optional): JavaScript for all URLs or per-URL scripts
|
|
|
|
**Returns:** `List[WebContent]` (failed URLs return None)
|
|
|
|
**Example:**
|
|
```python
|
|
# Batch processing
|
|
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
|
|
results = await get_many(urls, max_concurrent=3)
|
|
|
|
# Same script for all URLs
|
|
results = await get_many(
|
|
urls,
|
|
script="document.querySelector('.title').textContent"
|
|
)
|
|
|
|
# Different scripts per URL
|
|
scripts = [
|
|
"document.title",
|
|
"document.querySelector('.price').textContent",
|
|
"document.querySelectorAll('.item').length"
|
|
]
|
|
results = await get_many(urls, script=scripts)
|
|
```
|
|
|
|
### `discover(query, **options) -> List[WebContent]`
|
|
|
|
Intelligently discover and rank content related to a query.
|
|
|
|
**Parameters:**
|
|
- `query` (str): Search query or topic description
|
|
- `max_pages` (int, default=10): Maximum results to return
|
|
- `quality_threshold` (float, default=0.7): Minimum quality score
|
|
- `recency_bias` (bool, default=True): Prefer recent content
|
|
- `source_types` (List[str], optional): Filter by source types
|
|
- `script` (str, optional): JavaScript for search results pages
|
|
- `content_script` (str, optional): JavaScript for discovered content pages
|
|
|
|
**Returns:** `List[WebContent]` ranked by relevance and quality
|
|
|
|
**Example:**
|
|
```python
|
|
# Basic discovery
|
|
results = await discover("machine learning tutorials")
|
|
|
|
# With JavaScript interaction
|
|
results = await discover(
|
|
"AI research papers",
|
|
script="document.querySelector('.show-more')?.click()",
|
|
content_script="document.querySelector('.abstract').textContent",
|
|
max_pages=5
|
|
)
|
|
```
|
|
|
|
### `cleanup()`
|
|
|
|
Clean up global browser resources.
|
|
|
|
**Example:**
|
|
```python
|
|
# Clean up at end of script
|
|
await cleanup()
|
|
```
|
|
|
|
## Data Classes
|
|
|
|
### `WebContent`
|
|
|
|
Structured representation of extracted web content.
|
|
|
|
**Core Properties:**
|
|
- `url` (str): Source URL
|
|
- `title` (str): Extracted page title
|
|
- `markdown` (str): LLM-optimized markdown content
|
|
- `text` (str): Clean human-readable text
|
|
- `html` (str): Original HTML content
|
|
|
|
**Metadata Properties:**
|
|
- `author` (str | None): Content author
|
|
- `published` (datetime | None): Publication date
|
|
- `reading_time` (str): Estimated reading time
|
|
- `word_count` (int): Word count
|
|
- `language` (str): Content language
|
|
- `quality_score` (float): Content quality (0-10)
|
|
|
|
**Semantic Properties:**
|
|
- `content_type` (str): Detected content type (article, product, etc.)
|
|
- `topics` (List[str]): Extracted topics
|
|
- `entities` (Dict[str, List[str]]): Named entities
|
|
|
|
**Relationship Properties:**
|
|
- `links` (List[Dict]): Extracted links with metadata
|
|
- `images` (List[Dict]): Image information
|
|
|
|
**Technical Properties:**
|
|
- `status_code` (int): HTTP status code
|
|
- `load_time` (float): Page load time
|
|
- `content_hash` (str): Content hash for deduplication
|
|
- `extracted_at` (datetime): Extraction timestamp
|
|
|
|
**JavaScript Properties:**
|
|
- `script_result` (Any | None): JavaScript execution result
|
|
- `script_error` (str | None): JavaScript execution error
|
|
|
|
**Computed Properties:**
|
|
- `summary` (str): Brief content summary
|
|
- `readable_summary` (str): Human-friendly summary with metadata
|
|
- `has_script_result` (bool): Whether JavaScript result is available
|
|
- `has_script_error` (bool): Whether JavaScript error occurred
|
|
|
|
**Methods:**
|
|
- `save(path, format="auto")`: Save content to file
|
|
|
|
**Example:**
|
|
```python
|
|
content = await get("https://example.com", script="document.title")
|
|
|
|
# Access content
|
|
print(content.title)
|
|
print(content.markdown[:100])
|
|
print(content.text[:100])
|
|
|
|
# Access metadata
|
|
print(f"Author: {content.author}")
|
|
print(f"Reading time: {content.reading_time}")
|
|
print(f"Quality: {content.quality_score}/10")
|
|
|
|
# Access JavaScript results
|
|
if content.has_script_result:
|
|
print(f"Script result: {content.script_result}")
|
|
|
|
if content.has_script_error:
|
|
print(f"Script error: {content.script_error}")
|
|
|
|
# Save content
|
|
content.save("article.md") # Saves as markdown
|
|
content.save("article.json") # Saves as JSON with all metadata
|
|
```
|
|
|
|
### `BrowserConfig`
|
|
|
|
Configuration for browser behavior.
|
|
|
|
**Properties:**
|
|
- `headless` (bool, default=True): Run browser in headless mode
|
|
- `timeout` (int, default=30000): Request timeout in milliseconds
|
|
- `user_agent` (str | None): Custom user agent
|
|
- `viewport` (Dict[str, int], default={"width": 1920, "height": 1080}): Viewport size
|
|
- `extra_args` (List[str], default=[]): Additional browser arguments
|
|
|
|
**Example:**
|
|
```python
|
|
from crawailer import BrowserConfig, Browser
|
|
|
|
config = BrowserConfig(
|
|
headless=False, # Show browser window
|
|
timeout=60000, # 60 second timeout
|
|
user_agent="Custom Bot 1.0",
|
|
viewport={"width": 1280, "height": 720}
|
|
)
|
|
|
|
browser = Browser(config)
|
|
```
|
|
|
|
## Browser Class
|
|
|
|
Lower-level browser control for advanced use cases.
|
|
|
|
### `Browser(config=None)`
|
|
|
|
**Methods:**
|
|
|
|
#### `async start()`
|
|
Initialize the browser instance.
|
|
|
|
#### `async close()`
|
|
Clean up browser resources.
|
|
|
|
#### `async fetch_page(url, **options) -> Dict[str, Any]`
|
|
Fetch a single page with full control.
|
|
|
|
**Parameters:**
|
|
- `url` (str): URL to fetch
|
|
- `wait_for` (str, optional): CSS selector to wait for
|
|
- `timeout` (int, default=30): Timeout in seconds
|
|
- `stealth` (bool, default=False): Enable stealth mode
|
|
- `script_before` (str, optional): JavaScript before content extraction
|
|
- `script_after` (str, optional): JavaScript after content extraction
|
|
|
|
**Returns:** Dictionary with page data
|
|
|
|
#### `async fetch_many(urls, **options) -> List[Dict[str, Any]]`
|
|
Fetch multiple pages concurrently.
|
|
|
|
#### `async take_screenshot(url, **options) -> bytes`
|
|
Take a screenshot of a page.
|
|
|
|
**Parameters:**
|
|
- `url` (str): URL to screenshot
|
|
- `selector` (str, optional): CSS selector to screenshot
|
|
- `full_page` (bool, default=False): Capture full scrollable page
|
|
- `timeout` (int, default=30): Timeout in seconds
|
|
|
|
**Returns:** Screenshot as PNG bytes
|
|
|
|
#### `async execute_script(url, script, **options) -> Any`
|
|
Execute JavaScript on a page and return result.
|
|
|
|
**Example:**
|
|
```python
|
|
from crawailer import Browser, BrowserConfig
|
|
|
|
config = BrowserConfig(headless=False)
|
|
browser = Browser(config)
|
|
|
|
async with browser:
|
|
# Fetch page data
|
|
page_data = await browser.fetch_page(
|
|
"https://example.com",
|
|
script_before="window.scrollTo(0, document.body.scrollHeight)",
|
|
script_after="document.querySelectorAll('.item').length"
|
|
)
|
|
|
|
# Take screenshot
|
|
screenshot = await browser.take_screenshot("https://example.com")
|
|
with open("screenshot.png", "wb") as f:
|
|
f.write(screenshot)
|
|
|
|
# Execute JavaScript
|
|
result = await browser.execute_script(
|
|
"https://example.com",
|
|
"document.title + ' - ' + document.querySelectorAll('a').length + ' links'"
|
|
)
|
|
print(result)
|
|
```
|
|
|
|
## Content Extraction
|
|
|
|
### `ContentExtractor`
|
|
|
|
Transforms raw HTML into structured WebContent.
|
|
|
|
**Parameters:**
|
|
- `clean` (bool, default=True): Clean and normalize text
|
|
- `extract_links` (bool, default=True): Extract link information
|
|
- `extract_metadata` (bool, default=True): Extract metadata
|
|
- `extract_images` (bool, default=False): Extract image information
|
|
|
|
**Methods:**
|
|
|
|
#### `async extract(page_data) -> WebContent`
|
|
Extract structured content from page data.
|
|
|
|
**Example:**
|
|
```python
|
|
from crawailer.content import ContentExtractor
|
|
from crawailer.browser import Browser
|
|
|
|
browser = Browser()
|
|
extractor = ContentExtractor(
|
|
clean=True,
|
|
extract_links=True,
|
|
extract_metadata=True,
|
|
extract_images=True
|
|
)
|
|
|
|
async with browser:
|
|
page_data = await browser.fetch_page("https://example.com")
|
|
content = await extractor.extract(page_data)
|
|
print(content.title)
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
### Custom Exceptions
|
|
|
|
```python
|
|
from crawailer.exceptions import (
|
|
CrawlerError, # Base exception
|
|
TimeoutError, # Request timeout
|
|
CloudflareProtected, # Cloudflare protection detected
|
|
PaywallDetected, # Paywall detected
|
|
RateLimitError, # Rate limit exceeded
|
|
ContentExtractionError # Content extraction failed
|
|
)
|
|
|
|
try:
|
|
content = await get("https://protected-site.com")
|
|
except CloudflareProtected:
|
|
# Try with stealth mode
|
|
content = await get("https://protected-site.com", stealth=True)
|
|
except PaywallDetected as e:
|
|
print(f"Paywall detected. Archive URL: {e.archive_url}")
|
|
except TimeoutError:
|
|
# Increase timeout
|
|
content = await get("https://slow-site.com", timeout=60)
|
|
```
|
|
|
|
## JavaScript Execution
|
|
|
|
### Script Patterns
|
|
|
|
#### Simple Execution
|
|
```python
|
|
# Extract single value
|
|
content = await get(url, script="document.title")
|
|
print(content.script_result) # Page title
|
|
```
|
|
|
|
#### Complex Operations
|
|
```python
|
|
# Multi-step JavaScript
|
|
complex_script = """
|
|
// Scroll to load content
|
|
window.scrollTo(0, document.body.scrollHeight);
|
|
await new Promise(resolve => setTimeout(resolve, 2000));
|
|
|
|
// Extract data
|
|
const items = Array.from(document.querySelectorAll('.item')).map(item => ({
|
|
title: item.querySelector('.title')?.textContent,
|
|
price: item.querySelector('.price')?.textContent
|
|
}));
|
|
|
|
return items;
|
|
"""
|
|
|
|
content = await get(url, script=complex_script)
|
|
items = content.script_result # List of extracted items
|
|
```
|
|
|
|
#### Before/After Pattern
|
|
```python
|
|
content = await get(
|
|
url,
|
|
script_before="document.querySelector('.load-more')?.click()",
|
|
script_after="document.querySelectorAll('.item').length"
|
|
)
|
|
|
|
if isinstance(content.script_result, dict):
|
|
print(f"Action result: {content.script_result['script_before']}")
|
|
print(f"Items count: {content.script_result['script_after']}")
|
|
```
|
|
|
|
#### Error Handling
|
|
```python
|
|
content = await get(url, script="document.querySelector('.missing').click()")
|
|
|
|
if content.has_script_error:
|
|
print(f"JavaScript error: {content.script_error}")
|
|
# Use fallback content
|
|
print(f"Fallback: {content.text[:100]}")
|
|
else:
|
|
print(f"Result: {content.script_result}")
|
|
```
|
|
|
|
### Framework Detection
|
|
|
|
#### React Applications
|
|
```python
|
|
react_script = """
|
|
if (window.React) {
|
|
return {
|
|
framework: 'React',
|
|
version: React.version,
|
|
hasRouter: !!window.ReactRouter,
|
|
componentCount: document.querySelectorAll('[data-reactroot] *').length
|
|
};
|
|
}
|
|
return null;
|
|
"""
|
|
|
|
content = await get("https://react-app.com", script=react_script)
|
|
```
|
|
|
|
#### Vue Applications
|
|
```python
|
|
vue_script = """
|
|
if (window.Vue) {
|
|
return {
|
|
framework: 'Vue',
|
|
version: Vue.version,
|
|
hasRouter: !!window.VueRouter,
|
|
hasVuex: !!window.Vuex
|
|
};
|
|
}
|
|
return null;
|
|
"""
|
|
|
|
content = await get("https://vue-app.com", script=vue_script)
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
### Batch Processing
|
|
```python
|
|
# Process large URL lists efficiently
|
|
urls = [f"https://site.com/page/{i}" for i in range(100)]
|
|
|
|
# Process in batches
|
|
batch_size = 10
|
|
all_results = []
|
|
|
|
for i in range(0, len(urls), batch_size):
|
|
batch = urls[i:i+batch_size]
|
|
results = await get_many(batch, max_concurrent=5)
|
|
all_results.extend(results)
|
|
|
|
# Rate limiting
|
|
await asyncio.sleep(1)
|
|
```
|
|
|
|
### Memory Management
|
|
```python
|
|
# For long-running processes
|
|
import gc
|
|
|
|
for batch in url_batches:
|
|
results = await get_many(batch)
|
|
process_results(results)
|
|
|
|
# Clear references and force garbage collection
|
|
del results
|
|
gc.collect()
|
|
```
|
|
|
|
### Timeout Configuration
|
|
```python
|
|
# Adjust timeouts based on site characteristics
|
|
fast_sites = await get_many(urls, timeout=10)
|
|
slow_sites = await get_many(urls, timeout=60)
|
|
```
|
|
|
|
## MCP Integration
|
|
|
|
### Server Setup
|
|
```python
|
|
from crawailer.mcp import create_mcp_server
|
|
|
|
# Create MCP server with default tools
|
|
server = create_mcp_server()
|
|
|
|
# Custom MCP tool
|
|
@server.tool("extract_product_data")
|
|
async def extract_product_data(url: str) -> dict:
|
|
content = await get(
|
|
url,
|
|
script="""
|
|
({
|
|
name: document.querySelector('.product-name')?.textContent,
|
|
price: document.querySelector('.price')?.textContent,
|
|
rating: document.querySelector('.rating')?.textContent
|
|
})
|
|
"""
|
|
)
|
|
|
|
return {
|
|
'title': content.title,
|
|
'product_data': content.script_result,
|
|
'metadata': {
|
|
'word_count': content.word_count,
|
|
'quality_score': content.quality_score
|
|
}
|
|
}
|
|
```
|
|
|
|
## CLI Interface
|
|
|
|
### Basic Commands
|
|
```bash
|
|
# Extract content from URL
|
|
crawailer get https://example.com
|
|
|
|
# Batch processing
|
|
crawailer get-many urls.txt --output results.json
|
|
|
|
# Discovery
|
|
crawailer discover "AI research" --max-pages 10
|
|
|
|
# Setup (install browsers)
|
|
crawailer setup
|
|
```
|
|
|
|
### JavaScript Execution
|
|
```bash
|
|
# Execute JavaScript
|
|
crawailer get https://spa.com --script "document.title" --wait-for ".loaded"
|
|
|
|
# Save with script results
|
|
crawailer get https://dynamic.com --script "window.data" --output content.json
|
|
```
|
|
|
|
## Advanced Usage
|
|
|
|
### Custom Content Extractors
|
|
```python
|
|
from crawailer.content import ContentExtractor
|
|
|
|
class CustomExtractor(ContentExtractor):
|
|
async def extract(self, page_data):
|
|
content = await super().extract(page_data)
|
|
|
|
# Add custom processing
|
|
if 'product' in content.content_type:
|
|
content.custom_data = self.extract_product_details(content.html)
|
|
|
|
return content
|
|
|
|
def extract_product_details(self, html):
|
|
# Custom extraction logic
|
|
pass
|
|
|
|
# Use custom extractor
|
|
from crawailer.api import _get_browser
|
|
|
|
browser = await _get_browser()
|
|
extractor = CustomExtractor()
|
|
|
|
page_data = await browser.fetch_page(url)
|
|
content = await extractor.extract(page_data)
|
|
```
|
|
|
|
### Session Management
|
|
```python
|
|
from crawailer.browser import Browser
|
|
|
|
# Persistent browser session
|
|
browser = Browser()
|
|
await browser.start()
|
|
|
|
try:
|
|
# Login
|
|
await browser.fetch_page(
|
|
"https://site.com/login",
|
|
script_after="""
|
|
document.querySelector('#username').value = 'user';
|
|
document.querySelector('#password').value = 'pass';
|
|
document.querySelector('#login').click();
|
|
"""
|
|
)
|
|
|
|
# Access protected content
|
|
protected_content = await browser.fetch_page("https://site.com/dashboard")
|
|
|
|
finally:
|
|
await browser.close()
|
|
```
|
|
|
|
This API reference provides comprehensive documentation for all Crawailer functionality, with particular emphasis on the JavaScript execution capabilities that set it apart from traditional web scrapers. |