Crawailer Developer d31395a166 Initial Crawailer implementation with comprehensive JavaScript API

- Complete browser automation with Playwright integration
- High-level API functions: get(), get_many(), discover()
- JavaScript execution support with script parameters
- Content extraction optimized for LLM workflows
- Comprehensive test suite with 18 test files (700+ scenarios)
- Local Caddy test server for reproducible testing
- Performance benchmarking vs Katana crawler
- Complete documentation including JavaScript API guide
- PyPI-ready packaging with professional metadata
- UNIX philosophy: do web scraping exceptionally well

2025-09-18 14:47:59 -06:00

16 KiB

Raw Permalink Blame History

Crawailer API Reference

Core Functions

`get(url, **options) -> WebContent`

Extract content from a single URL with optional JavaScript execution.

Parameters:

url (str): The URL to fetch
wait_for (str, optional): CSS selector to wait for before extraction
timeout (int, default=30): Request timeout in seconds
clean (bool, default=True): Whether to clean and optimize content
extract_links (bool, default=True): Whether to extract links
extract_metadata (bool, default=True): Whether to extract metadata
script (str, optional): JavaScript to execute (alias for script_before)
script_before (str, optional): JavaScript to execute before content extraction
script_after (str, optional): JavaScript to execute after content extraction

Returns: WebContent object with extracted content and metadata

Example:

# Basic usage
content = await get("https://example.com")

# With JavaScript execution
content = await get(
    "https://dynamic-site.com",
    script="document.querySelector('.price').textContent",
    wait_for=".price-loaded"
)

# Before/after pattern
content = await get(
    "https://spa.com",
    script_before="document.querySelector('.load-more')?.click()",
    script_after="document.querySelectorAll('.item').length"
)

`get_many(urls, **options) -> List[WebContent]`

Extract content from multiple URLs efficiently with concurrent processing.

Parameters:

urls (List[str]): List of URLs to fetch
max_concurrent (int, default=5): Maximum concurrent requests
timeout (int, default=30): Request timeout per URL
clean (bool, default=True): Whether to clean content
progress (bool, default=False): Whether to show progress bar
script (str | List[str], optional): JavaScript for all URLs or per-URL scripts

Returns: List[WebContent] (failed URLs return None)

Example:

# Batch processing
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await get_many(urls, max_concurrent=3)

# Same script for all URLs
results = await get_many(
    urls,
    script="document.querySelector('.title').textContent"
)

# Different scripts per URL
scripts = [
    "document.title",
    "document.querySelector('.price').textContent", 
    "document.querySelectorAll('.item').length"
]
results = await get_many(urls, script=scripts)

`discover(query, **options) -> List[WebContent]`

Intelligently discover and rank content related to a query.

Parameters:

query (str): Search query or topic description
max_pages (int, default=10): Maximum results to return
quality_threshold (float, default=0.7): Minimum quality score
recency_bias (bool, default=True): Prefer recent content
source_types (List[str], optional): Filter by source types
script (str, optional): JavaScript for search results pages
content_script (str, optional): JavaScript for discovered content pages

Returns: List[WebContent] ranked by relevance and quality

Example:

# Basic discovery
results = await discover("machine learning tutorials")

# With JavaScript interaction
results = await discover(
    "AI research papers",
    script="document.querySelector('.show-more')?.click()",
    content_script="document.querySelector('.abstract').textContent",
    max_pages=5
)

`cleanup()`

Clean up global browser resources.

Example:

# Clean up at end of script
await cleanup()

Data Classes

`WebContent`

Structured representation of extracted web content.

Core Properties:

url (str): Source URL
title (str): Extracted page title
markdown (str): LLM-optimized markdown content
text (str): Clean human-readable text
html (str): Original HTML content

Metadata Properties:

author (str | None): Content author
published (datetime | None): Publication date
reading_time (str): Estimated reading time
word_count (int): Word count
language (str): Content language
quality_score (float): Content quality (0-10)

Semantic Properties:

content_type (str): Detected content type (article, product, etc.)
topics (List[str]): Extracted topics
entities (Dict[str, List[str]]): Named entities

Relationship Properties:

links (List[Dict]): Extracted links with metadata
images (List[Dict]): Image information

Technical Properties:

status_code (int): HTTP status code
load_time (float): Page load time
content_hash (str): Content hash for deduplication
extracted_at (datetime): Extraction timestamp

JavaScript Properties:

script_result (Any | None): JavaScript execution result
script_error (str | None): JavaScript execution error

Computed Properties:

summary (str): Brief content summary
readable_summary (str): Human-friendly summary with metadata
has_script_result (bool): Whether JavaScript result is available
has_script_error (bool): Whether JavaScript error occurred

Methods:

save(path, format="auto"): Save content to file

Example:

content = await get("https://example.com", script="document.title")

# Access content
print(content.title)
print(content.markdown[:100])
print(content.text[:100])

# Access metadata
print(f"Author: {content.author}")
print(f"Reading time: {content.reading_time}")
print(f"Quality: {content.quality_score}/10")

# Access JavaScript results
if content.has_script_result:
    print(f"Script result: {content.script_result}")

if content.has_script_error:
    print(f"Script error: {content.script_error}")

# Save content
content.save("article.md")  # Saves as markdown
content.save("article.json")  # Saves as JSON with all metadata

`BrowserConfig`

Configuration for browser behavior.

Properties:

headless (bool, default=True): Run browser in headless mode
timeout (int, default=30000): Request timeout in milliseconds
user_agent (str | None): Custom user agent
viewport (Dict[str, int], default={"width": 1920, "height": 1080}): Viewport size
extra_args (List[str], default=[]): Additional browser arguments

Example:

from crawailer import BrowserConfig, Browser

config = BrowserConfig(
    headless=False,  # Show browser window
    timeout=60000,   # 60 second timeout
    user_agent="Custom Bot 1.0",
    viewport={"width": 1280, "height": 720}
)

browser = Browser(config)

Browser Class

Lower-level browser control for advanced use cases.

`Browser(config=None)`

Methods:

`async start()`

Initialize the browser instance.

`async close()`

Clean up browser resources.

`async fetch_page(url, **options) -> Dict[str, Any]`

Fetch a single page with full control.

Parameters:

url (str): URL to fetch
wait_for (str, optional): CSS selector to wait for
timeout (int, default=30): Timeout in seconds
stealth (bool, default=False): Enable stealth mode
script_before (str, optional): JavaScript before content extraction
script_after (str, optional): JavaScript after content extraction

Returns: Dictionary with page data

`async fetch_many(urls, **options) -> List[Dict[str, Any]]`

Fetch multiple pages concurrently.

`async take_screenshot(url, **options) -> bytes`

Take a screenshot of a page.

Parameters:

url (str): URL to screenshot
selector (str, optional): CSS selector to screenshot
full_page (bool, default=False): Capture full scrollable page
timeout (int, default=30): Timeout in seconds

Returns: Screenshot as PNG bytes

`async execute_script(url, script, **options) -> Any`

Execute JavaScript on a page and return result.

Example:

from crawailer import Browser, BrowserConfig

config = BrowserConfig(headless=False)
browser = Browser(config)

async with browser:
    # Fetch page data
    page_data = await browser.fetch_page(
        "https://example.com",
        script_before="window.scrollTo(0, document.body.scrollHeight)",
        script_after="document.querySelectorAll('.item').length"
    )
    
    # Take screenshot
    screenshot = await browser.take_screenshot("https://example.com")
    with open("screenshot.png", "wb") as f:
        f.write(screenshot)
    
    # Execute JavaScript
    result = await browser.execute_script(
        "https://example.com",
        "document.title + ' - ' + document.querySelectorAll('a').length + ' links'"
    )
    print(result)

Content Extraction

`ContentExtractor`

Transforms raw HTML into structured WebContent.

Parameters:

clean (bool, default=True): Clean and normalize text
extract_links (bool, default=True): Extract link information
extract_metadata (bool, default=True): Extract metadata
extract_images (bool, default=False): Extract image information

Methods:

`async extract(page_data) -> WebContent`

Extract structured content from page data.

Example:

from crawailer.content import ContentExtractor
from crawailer.browser import Browser

browser = Browser()
extractor = ContentExtractor(
    clean=True,
    extract_links=True,
    extract_metadata=True,
    extract_images=True
)

async with browser:
    page_data = await browser.fetch_page("https://example.com")
    content = await extractor.extract(page_data)
    print(content.title)

Error Handling

Custom Exceptions

from crawailer.exceptions import (
    CrawlerError,           # Base exception
    TimeoutError,           # Request timeout
    CloudflareProtected,    # Cloudflare protection detected
    PaywallDetected,        # Paywall detected
    RateLimitError,         # Rate limit exceeded
    ContentExtractionError  # Content extraction failed
)

try:
    content = await get("https://protected-site.com")
except CloudflareProtected:
    # Try with stealth mode
    content = await get("https://protected-site.com", stealth=True)
except PaywallDetected as e:
    print(f"Paywall detected. Archive URL: {e.archive_url}")
except TimeoutError:
    # Increase timeout
    content = await get("https://slow-site.com", timeout=60)

JavaScript Execution

Script Patterns

Simple Execution

# Extract single value
content = await get(url, script="document.title")
print(content.script_result)  # Page title

Complex Operations

# Multi-step JavaScript
complex_script = """
// Scroll to load content
window.scrollTo(0, document.body.scrollHeight);
await new Promise(resolve => setTimeout(resolve, 2000));

// Extract data
const items = Array.from(document.querySelectorAll('.item')).map(item => ({
    title: item.querySelector('.title')?.textContent,
    price: item.querySelector('.price')?.textContent
}));

return items;
"""

content = await get(url, script=complex_script)
items = content.script_result  # List of extracted items

Before/After Pattern

content = await get(
    url,
    script_before="document.querySelector('.load-more')?.click()",
    script_after="document.querySelectorAll('.item').length"
)

if isinstance(content.script_result, dict):
    print(f"Action result: {content.script_result['script_before']}")
    print(f"Items count: {content.script_result['script_after']}")

Error Handling

content = await get(url, script="document.querySelector('.missing').click()")

if content.has_script_error:
    print(f"JavaScript error: {content.script_error}")
    # Use fallback content
    print(f"Fallback: {content.text[:100]}")
else:
    print(f"Result: {content.script_result}")

Framework Detection

React Applications

react_script = """
if (window.React) {
    return {
        framework: 'React',
        version: React.version,
        hasRouter: !!window.ReactRouter,
        componentCount: document.querySelectorAll('[data-reactroot] *').length
    };
}
return null;
"""

content = await get("https://react-app.com", script=react_script)

Vue Applications

vue_script = """
if (window.Vue) {
    return {
        framework: 'Vue',
        version: Vue.version,
        hasRouter: !!window.VueRouter,
        hasVuex: !!window.Vuex
    };
}
return null;
"""

content = await get("https://vue-app.com", script=vue_script)

Performance Optimization

Batch Processing

# Process large URL lists efficiently
urls = [f"https://site.com/page/{i}" for i in range(100)]

# Process in batches
batch_size = 10
all_results = []

for i in range(0, len(urls), batch_size):
    batch = urls[i:i+batch_size]
    results = await get_many(batch, max_concurrent=5)
    all_results.extend(results)
    
    # Rate limiting
    await asyncio.sleep(1)

Memory Management

# For long-running processes
import gc

for batch in url_batches:
    results = await get_many(batch)
    process_results(results)
    
    # Clear references and force garbage collection
    del results
    gc.collect()

Timeout Configuration

# Adjust timeouts based on site characteristics
fast_sites = await get_many(urls, timeout=10)
slow_sites = await get_many(urls, timeout=60)

MCP Integration

Server Setup

from crawailer.mcp import create_mcp_server

# Create MCP server with default tools
server = create_mcp_server()

# Custom MCP tool
@server.tool("extract_product_data")
async def extract_product_data(url: str) -> dict:
    content = await get(
        url,
        script="""
        ({
            name: document.querySelector('.product-name')?.textContent,
            price: document.querySelector('.price')?.textContent,
            rating: document.querySelector('.rating')?.textContent
        })
        """
    )
    
    return {
        'title': content.title,
        'product_data': content.script_result,
        'metadata': {
            'word_count': content.word_count,
            'quality_score': content.quality_score
        }
    }

CLI Interface

Basic Commands

# Extract content from URL
crawailer get https://example.com

# Batch processing
crawailer get-many urls.txt --output results.json

# Discovery
crawailer discover "AI research" --max-pages 10

# Setup (install browsers)
crawailer setup

JavaScript Execution

# Execute JavaScript
crawailer get https://spa.com --script "document.title" --wait-for ".loaded"

# Save with script results
crawailer get https://dynamic.com --script "window.data" --output content.json

Advanced Usage

Custom Content Extractors

from crawailer.content import ContentExtractor

class CustomExtractor(ContentExtractor):
    async def extract(self, page_data):
        content = await super().extract(page_data)
        
        # Add custom processing
        if 'product' in content.content_type:
            content.custom_data = self.extract_product_details(content.html)
        
        return content
    
    def extract_product_details(self, html):
        # Custom extraction logic
        pass

# Use custom extractor
from crawailer.api import _get_browser

browser = await _get_browser()
extractor = CustomExtractor()

page_data = await browser.fetch_page(url)
content = await extractor.extract(page_data)

Session Management

from crawailer.browser import Browser

# Persistent browser session
browser = Browser()
await browser.start()

try:
    # Login
    await browser.fetch_page(
        "https://site.com/login",
        script_after="""
        document.querySelector('#username').value = 'user';
        document.querySelector('#password').value = 'pass';
        document.querySelector('#login').click();
        """
    )
    
    # Access protected content
    protected_content = await browser.fetch_page("https://site.com/dashboard")
    
finally:
    await browser.close()

This API reference provides comprehensive documentation for all Crawailer functionality, with particular emphasis on the JavaScript execution capabilities that set it apart from traditional web scrapers.

16 KiB Raw Permalink Blame History

Crawailer API Reference

Core Functions

get(url, **options) -> WebContent

get_many(urls, **options) -> List[WebContent]

discover(query, **options) -> List[WebContent]

cleanup()

Data Classes

WebContent

BrowserConfig

Browser Class

Browser(config=None)

async start()

async close()

async fetch_page(url, **options) -> Dict[str, Any]

async fetch_many(urls, **options) -> List[Dict[str, Any]]

async take_screenshot(url, **options) -> bytes

async execute_script(url, script, **options) -> Any

Content Extraction

ContentExtractor

async extract(page_data) -> WebContent

Error Handling

Custom Exceptions

JavaScript Execution

Script Patterns

Simple Execution

Complex Operations

Before/After Pattern

Error Handling

Framework Detection

React Applications

Vue Applications

Performance Optimization

Batch Processing

Memory Management

Timeout Configuration

MCP Integration

Server Setup

CLI Interface

Basic Commands

JavaScript Execution

Advanced Usage

Custom Content Extractors

Session Management

16 KiB

Raw Permalink Blame History

`get(url, **options) -> WebContent`

`get_many(urls, **options) -> List[WebContent]`

`discover(query, **options) -> List[WebContent]`

`cleanup()`

`WebContent`

`BrowserConfig`

`Browser(config=None)`

`async start()`

`async close()`

`async fetch_page(url, **options) -> Dict[str, Any]`

`async fetch_many(urls, **options) -> List[Dict[str, Any]]`

`async take_screenshot(url, **options) -> bytes`

`async execute_script(url, script, **options) -> Any`

`ContentExtractor`

`async extract(page_data) -> WebContent`