
- Complete browser automation with Playwright integration - High-level API functions: get(), get_many(), discover() - JavaScript execution support with script parameters - Content extraction optimized for LLM workflows - Comprehensive test suite with 18 test files (700+ scenarios) - Local Caddy test server for reproducible testing - Performance benchmarking vs Katana crawler - Complete documentation including JavaScript API guide - PyPI-ready packaging with professional metadata - UNIX philosophy: do web scraping exceptionally well
16 KiB
Crawailer API Reference
Core Functions
get(url, **options) -> WebContent
Extract content from a single URL with optional JavaScript execution.
Parameters:
url
(str): The URL to fetchwait_for
(str, optional): CSS selector to wait for before extractiontimeout
(int, default=30): Request timeout in secondsclean
(bool, default=True): Whether to clean and optimize contentextract_links
(bool, default=True): Whether to extract linksextract_metadata
(bool, default=True): Whether to extract metadatascript
(str, optional): JavaScript to execute (alias forscript_before
)script_before
(str, optional): JavaScript to execute before content extractionscript_after
(str, optional): JavaScript to execute after content extraction
Returns: WebContent
object with extracted content and metadata
Example:
# Basic usage
content = await get("https://example.com")
# With JavaScript execution
content = await get(
"https://dynamic-site.com",
script="document.querySelector('.price').textContent",
wait_for=".price-loaded"
)
# Before/after pattern
content = await get(
"https://spa.com",
script_before="document.querySelector('.load-more')?.click()",
script_after="document.querySelectorAll('.item').length"
)
get_many(urls, **options) -> List[WebContent]
Extract content from multiple URLs efficiently with concurrent processing.
Parameters:
urls
(List[str]): List of URLs to fetchmax_concurrent
(int, default=5): Maximum concurrent requeststimeout
(int, default=30): Request timeout per URLclean
(bool, default=True): Whether to clean contentprogress
(bool, default=False): Whether to show progress barscript
(str | List[str], optional): JavaScript for all URLs or per-URL scripts
Returns: List[WebContent]
(failed URLs return None)
Example:
# Batch processing
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await get_many(urls, max_concurrent=3)
# Same script for all URLs
results = await get_many(
urls,
script="document.querySelector('.title').textContent"
)
# Different scripts per URL
scripts = [
"document.title",
"document.querySelector('.price').textContent",
"document.querySelectorAll('.item').length"
]
results = await get_many(urls, script=scripts)
discover(query, **options) -> List[WebContent]
Intelligently discover and rank content related to a query.
Parameters:
query
(str): Search query or topic descriptionmax_pages
(int, default=10): Maximum results to returnquality_threshold
(float, default=0.7): Minimum quality scorerecency_bias
(bool, default=True): Prefer recent contentsource_types
(List[str], optional): Filter by source typesscript
(str, optional): JavaScript for search results pagescontent_script
(str, optional): JavaScript for discovered content pages
Returns: List[WebContent]
ranked by relevance and quality
Example:
# Basic discovery
results = await discover("machine learning tutorials")
# With JavaScript interaction
results = await discover(
"AI research papers",
script="document.querySelector('.show-more')?.click()",
content_script="document.querySelector('.abstract').textContent",
max_pages=5
)
cleanup()
Clean up global browser resources.
Example:
# Clean up at end of script
await cleanup()
Data Classes
WebContent
Structured representation of extracted web content.
Core Properties:
url
(str): Source URLtitle
(str): Extracted page titlemarkdown
(str): LLM-optimized markdown contenttext
(str): Clean human-readable texthtml
(str): Original HTML content
Metadata Properties:
author
(str | None): Content authorpublished
(datetime | None): Publication datereading_time
(str): Estimated reading timeword_count
(int): Word countlanguage
(str): Content languagequality_score
(float): Content quality (0-10)
Semantic Properties:
content_type
(str): Detected content type (article, product, etc.)topics
(List[str]): Extracted topicsentities
(Dict[str, List[str]]): Named entities
Relationship Properties:
links
(List[Dict]): Extracted links with metadataimages
(List[Dict]): Image information
Technical Properties:
status_code
(int): HTTP status codeload_time
(float): Page load timecontent_hash
(str): Content hash for deduplicationextracted_at
(datetime): Extraction timestamp
JavaScript Properties:
script_result
(Any | None): JavaScript execution resultscript_error
(str | None): JavaScript execution error
Computed Properties:
summary
(str): Brief content summaryreadable_summary
(str): Human-friendly summary with metadatahas_script_result
(bool): Whether JavaScript result is availablehas_script_error
(bool): Whether JavaScript error occurred
Methods:
save(path, format="auto")
: Save content to file
Example:
content = await get("https://example.com", script="document.title")
# Access content
print(content.title)
print(content.markdown[:100])
print(content.text[:100])
# Access metadata
print(f"Author: {content.author}")
print(f"Reading time: {content.reading_time}")
print(f"Quality: {content.quality_score}/10")
# Access JavaScript results
if content.has_script_result:
print(f"Script result: {content.script_result}")
if content.has_script_error:
print(f"Script error: {content.script_error}")
# Save content
content.save("article.md") # Saves as markdown
content.save("article.json") # Saves as JSON with all metadata
BrowserConfig
Configuration for browser behavior.
Properties:
headless
(bool, default=True): Run browser in headless modetimeout
(int, default=30000): Request timeout in millisecondsuser_agent
(str | None): Custom user agentviewport
(Dict[str, int], default={"width": 1920, "height": 1080}): Viewport sizeextra_args
(List[str], default=[]): Additional browser arguments
Example:
from crawailer import BrowserConfig, Browser
config = BrowserConfig(
headless=False, # Show browser window
timeout=60000, # 60 second timeout
user_agent="Custom Bot 1.0",
viewport={"width": 1280, "height": 720}
)
browser = Browser(config)
Browser Class
Lower-level browser control for advanced use cases.
Browser(config=None)
Methods:
async start()
Initialize the browser instance.
async close()
Clean up browser resources.
async fetch_page(url, **options) -> Dict[str, Any]
Fetch a single page with full control.
Parameters:
url
(str): URL to fetchwait_for
(str, optional): CSS selector to wait fortimeout
(int, default=30): Timeout in secondsstealth
(bool, default=False): Enable stealth modescript_before
(str, optional): JavaScript before content extractionscript_after
(str, optional): JavaScript after content extraction
Returns: Dictionary with page data
async fetch_many(urls, **options) -> List[Dict[str, Any]]
Fetch multiple pages concurrently.
async take_screenshot(url, **options) -> bytes
Take a screenshot of a page.
Parameters:
url
(str): URL to screenshotselector
(str, optional): CSS selector to screenshotfull_page
(bool, default=False): Capture full scrollable pagetimeout
(int, default=30): Timeout in seconds
Returns: Screenshot as PNG bytes
async execute_script(url, script, **options) -> Any
Execute JavaScript on a page and return result.
Example:
from crawailer import Browser, BrowserConfig
config = BrowserConfig(headless=False)
browser = Browser(config)
async with browser:
# Fetch page data
page_data = await browser.fetch_page(
"https://example.com",
script_before="window.scrollTo(0, document.body.scrollHeight)",
script_after="document.querySelectorAll('.item').length"
)
# Take screenshot
screenshot = await browser.take_screenshot("https://example.com")
with open("screenshot.png", "wb") as f:
f.write(screenshot)
# Execute JavaScript
result = await browser.execute_script(
"https://example.com",
"document.title + ' - ' + document.querySelectorAll('a').length + ' links'"
)
print(result)
Content Extraction
ContentExtractor
Transforms raw HTML into structured WebContent.
Parameters:
clean
(bool, default=True): Clean and normalize textextract_links
(bool, default=True): Extract link informationextract_metadata
(bool, default=True): Extract metadataextract_images
(bool, default=False): Extract image information
Methods:
async extract(page_data) -> WebContent
Extract structured content from page data.
Example:
from crawailer.content import ContentExtractor
from crawailer.browser import Browser
browser = Browser()
extractor = ContentExtractor(
clean=True,
extract_links=True,
extract_metadata=True,
extract_images=True
)
async with browser:
page_data = await browser.fetch_page("https://example.com")
content = await extractor.extract(page_data)
print(content.title)
Error Handling
Custom Exceptions
from crawailer.exceptions import (
CrawlerError, # Base exception
TimeoutError, # Request timeout
CloudflareProtected, # Cloudflare protection detected
PaywallDetected, # Paywall detected
RateLimitError, # Rate limit exceeded
ContentExtractionError # Content extraction failed
)
try:
content = await get("https://protected-site.com")
except CloudflareProtected:
# Try with stealth mode
content = await get("https://protected-site.com", stealth=True)
except PaywallDetected as e:
print(f"Paywall detected. Archive URL: {e.archive_url}")
except TimeoutError:
# Increase timeout
content = await get("https://slow-site.com", timeout=60)
JavaScript Execution
Script Patterns
Simple Execution
# Extract single value
content = await get(url, script="document.title")
print(content.script_result) # Page title
Complex Operations
# Multi-step JavaScript
complex_script = """
// Scroll to load content
window.scrollTo(0, document.body.scrollHeight);
await new Promise(resolve => setTimeout(resolve, 2000));
// Extract data
const items = Array.from(document.querySelectorAll('.item')).map(item => ({
title: item.querySelector('.title')?.textContent,
price: item.querySelector('.price')?.textContent
}));
return items;
"""
content = await get(url, script=complex_script)
items = content.script_result # List of extracted items
Before/After Pattern
content = await get(
url,
script_before="document.querySelector('.load-more')?.click()",
script_after="document.querySelectorAll('.item').length"
)
if isinstance(content.script_result, dict):
print(f"Action result: {content.script_result['script_before']}")
print(f"Items count: {content.script_result['script_after']}")
Error Handling
content = await get(url, script="document.querySelector('.missing').click()")
if content.has_script_error:
print(f"JavaScript error: {content.script_error}")
# Use fallback content
print(f"Fallback: {content.text[:100]}")
else:
print(f"Result: {content.script_result}")
Framework Detection
React Applications
react_script = """
if (window.React) {
return {
framework: 'React',
version: React.version,
hasRouter: !!window.ReactRouter,
componentCount: document.querySelectorAll('[data-reactroot] *').length
};
}
return null;
"""
content = await get("https://react-app.com", script=react_script)
Vue Applications
vue_script = """
if (window.Vue) {
return {
framework: 'Vue',
version: Vue.version,
hasRouter: !!window.VueRouter,
hasVuex: !!window.Vuex
};
}
return null;
"""
content = await get("https://vue-app.com", script=vue_script)
Performance Optimization
Batch Processing
# Process large URL lists efficiently
urls = [f"https://site.com/page/{i}" for i in range(100)]
# Process in batches
batch_size = 10
all_results = []
for i in range(0, len(urls), batch_size):
batch = urls[i:i+batch_size]
results = await get_many(batch, max_concurrent=5)
all_results.extend(results)
# Rate limiting
await asyncio.sleep(1)
Memory Management
# For long-running processes
import gc
for batch in url_batches:
results = await get_many(batch)
process_results(results)
# Clear references and force garbage collection
del results
gc.collect()
Timeout Configuration
# Adjust timeouts based on site characteristics
fast_sites = await get_many(urls, timeout=10)
slow_sites = await get_many(urls, timeout=60)
MCP Integration
Server Setup
from crawailer.mcp import create_mcp_server
# Create MCP server with default tools
server = create_mcp_server()
# Custom MCP tool
@server.tool("extract_product_data")
async def extract_product_data(url: str) -> dict:
content = await get(
url,
script="""
({
name: document.querySelector('.product-name')?.textContent,
price: document.querySelector('.price')?.textContent,
rating: document.querySelector('.rating')?.textContent
})
"""
)
return {
'title': content.title,
'product_data': content.script_result,
'metadata': {
'word_count': content.word_count,
'quality_score': content.quality_score
}
}
CLI Interface
Basic Commands
# Extract content from URL
crawailer get https://example.com
# Batch processing
crawailer get-many urls.txt --output results.json
# Discovery
crawailer discover "AI research" --max-pages 10
# Setup (install browsers)
crawailer setup
JavaScript Execution
# Execute JavaScript
crawailer get https://spa.com --script "document.title" --wait-for ".loaded"
# Save with script results
crawailer get https://dynamic.com --script "window.data" --output content.json
Advanced Usage
Custom Content Extractors
from crawailer.content import ContentExtractor
class CustomExtractor(ContentExtractor):
async def extract(self, page_data):
content = await super().extract(page_data)
# Add custom processing
if 'product' in content.content_type:
content.custom_data = self.extract_product_details(content.html)
return content
def extract_product_details(self, html):
# Custom extraction logic
pass
# Use custom extractor
from crawailer.api import _get_browser
browser = await _get_browser()
extractor = CustomExtractor()
page_data = await browser.fetch_page(url)
content = await extractor.extract(page_data)
Session Management
from crawailer.browser import Browser
# Persistent browser session
browser = Browser()
await browser.start()
try:
# Login
await browser.fetch_page(
"https://site.com/login",
script_after="""
document.querySelector('#username').value = 'user';
document.querySelector('#password').value = 'pass';
document.querySelector('#login').click();
"""
)
# Access protected content
protected_content = await browser.fetch_page("https://site.com/dashboard")
finally:
await browser.close()
This API reference provides comprehensive documentation for all Crawailer functionality, with particular emphasis on the JavaScript execution capabilities that set it apart from traditional web scrapers.