crawailer/docs/API_REFERENCE.md
Crawailer Developer d31395a166 Initial Crawailer implementation with comprehensive JavaScript API
- Complete browser automation with Playwright integration
- High-level API functions: get(), get_many(), discover()
- JavaScript execution support with script parameters
- Content extraction optimized for LLM workflows
- Comprehensive test suite with 18 test files (700+ scenarios)
- Local Caddy test server for reproducible testing
- Performance benchmarking vs Katana crawler
- Complete documentation including JavaScript API guide
- PyPI-ready packaging with professional metadata
- UNIX philosophy: do web scraping exceptionally well
2025-09-18 14:47:59 -06:00

16 KiB

Crawailer API Reference

Core Functions

get(url, **options) -> WebContent

Extract content from a single URL with optional JavaScript execution.

Parameters:

  • url (str): The URL to fetch
  • wait_for (str, optional): CSS selector to wait for before extraction
  • timeout (int, default=30): Request timeout in seconds
  • clean (bool, default=True): Whether to clean and optimize content
  • extract_links (bool, default=True): Whether to extract links
  • extract_metadata (bool, default=True): Whether to extract metadata
  • script (str, optional): JavaScript to execute (alias for script_before)
  • script_before (str, optional): JavaScript to execute before content extraction
  • script_after (str, optional): JavaScript to execute after content extraction

Returns: WebContent object with extracted content and metadata

Example:

# Basic usage
content = await get("https://example.com")

# With JavaScript execution
content = await get(
    "https://dynamic-site.com",
    script="document.querySelector('.price').textContent",
    wait_for=".price-loaded"
)

# Before/after pattern
content = await get(
    "https://spa.com",
    script_before="document.querySelector('.load-more')?.click()",
    script_after="document.querySelectorAll('.item').length"
)

get_many(urls, **options) -> List[WebContent]

Extract content from multiple URLs efficiently with concurrent processing.

Parameters:

  • urls (List[str]): List of URLs to fetch
  • max_concurrent (int, default=5): Maximum concurrent requests
  • timeout (int, default=30): Request timeout per URL
  • clean (bool, default=True): Whether to clean content
  • progress (bool, default=False): Whether to show progress bar
  • script (str | List[str], optional): JavaScript for all URLs or per-URL scripts

Returns: List[WebContent] (failed URLs return None)

Example:

# Batch processing
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await get_many(urls, max_concurrent=3)

# Same script for all URLs
results = await get_many(
    urls,
    script="document.querySelector('.title').textContent"
)

# Different scripts per URL
scripts = [
    "document.title",
    "document.querySelector('.price').textContent", 
    "document.querySelectorAll('.item').length"
]
results = await get_many(urls, script=scripts)

discover(query, **options) -> List[WebContent]

Intelligently discover and rank content related to a query.

Parameters:

  • query (str): Search query or topic description
  • max_pages (int, default=10): Maximum results to return
  • quality_threshold (float, default=0.7): Minimum quality score
  • recency_bias (bool, default=True): Prefer recent content
  • source_types (List[str], optional): Filter by source types
  • script (str, optional): JavaScript for search results pages
  • content_script (str, optional): JavaScript for discovered content pages

Returns: List[WebContent] ranked by relevance and quality

Example:

# Basic discovery
results = await discover("machine learning tutorials")

# With JavaScript interaction
results = await discover(
    "AI research papers",
    script="document.querySelector('.show-more')?.click()",
    content_script="document.querySelector('.abstract').textContent",
    max_pages=5
)

cleanup()

Clean up global browser resources.

Example:

# Clean up at end of script
await cleanup()

Data Classes

WebContent

Structured representation of extracted web content.

Core Properties:

  • url (str): Source URL
  • title (str): Extracted page title
  • markdown (str): LLM-optimized markdown content
  • text (str): Clean human-readable text
  • html (str): Original HTML content

Metadata Properties:

  • author (str | None): Content author
  • published (datetime | None): Publication date
  • reading_time (str): Estimated reading time
  • word_count (int): Word count
  • language (str): Content language
  • quality_score (float): Content quality (0-10)

Semantic Properties:

  • content_type (str): Detected content type (article, product, etc.)
  • topics (List[str]): Extracted topics
  • entities (Dict[str, List[str]]): Named entities

Relationship Properties:

  • links (List[Dict]): Extracted links with metadata
  • images (List[Dict]): Image information

Technical Properties:

  • status_code (int): HTTP status code
  • load_time (float): Page load time
  • content_hash (str): Content hash for deduplication
  • extracted_at (datetime): Extraction timestamp

JavaScript Properties:

  • script_result (Any | None): JavaScript execution result
  • script_error (str | None): JavaScript execution error

Computed Properties:

  • summary (str): Brief content summary
  • readable_summary (str): Human-friendly summary with metadata
  • has_script_result (bool): Whether JavaScript result is available
  • has_script_error (bool): Whether JavaScript error occurred

Methods:

  • save(path, format="auto"): Save content to file

Example:

content = await get("https://example.com", script="document.title")

# Access content
print(content.title)
print(content.markdown[:100])
print(content.text[:100])

# Access metadata
print(f"Author: {content.author}")
print(f"Reading time: {content.reading_time}")
print(f"Quality: {content.quality_score}/10")

# Access JavaScript results
if content.has_script_result:
    print(f"Script result: {content.script_result}")

if content.has_script_error:
    print(f"Script error: {content.script_error}")

# Save content
content.save("article.md")  # Saves as markdown
content.save("article.json")  # Saves as JSON with all metadata

BrowserConfig

Configuration for browser behavior.

Properties:

  • headless (bool, default=True): Run browser in headless mode
  • timeout (int, default=30000): Request timeout in milliseconds
  • user_agent (str | None): Custom user agent
  • viewport (Dict[str, int], default={"width": 1920, "height": 1080}): Viewport size
  • extra_args (List[str], default=[]): Additional browser arguments

Example:

from crawailer import BrowserConfig, Browser

config = BrowserConfig(
    headless=False,  # Show browser window
    timeout=60000,   # 60 second timeout
    user_agent="Custom Bot 1.0",
    viewport={"width": 1280, "height": 720}
)

browser = Browser(config)

Browser Class

Lower-level browser control for advanced use cases.

Browser(config=None)

Methods:

async start()

Initialize the browser instance.

async close()

Clean up browser resources.

async fetch_page(url, **options) -> Dict[str, Any]

Fetch a single page with full control.

Parameters:

  • url (str): URL to fetch
  • wait_for (str, optional): CSS selector to wait for
  • timeout (int, default=30): Timeout in seconds
  • stealth (bool, default=False): Enable stealth mode
  • script_before (str, optional): JavaScript before content extraction
  • script_after (str, optional): JavaScript after content extraction

Returns: Dictionary with page data

async fetch_many(urls, **options) -> List[Dict[str, Any]]

Fetch multiple pages concurrently.

async take_screenshot(url, **options) -> bytes

Take a screenshot of a page.

Parameters:

  • url (str): URL to screenshot
  • selector (str, optional): CSS selector to screenshot
  • full_page (bool, default=False): Capture full scrollable page
  • timeout (int, default=30): Timeout in seconds

Returns: Screenshot as PNG bytes

async execute_script(url, script, **options) -> Any

Execute JavaScript on a page and return result.

Example:

from crawailer import Browser, BrowserConfig

config = BrowserConfig(headless=False)
browser = Browser(config)

async with browser:
    # Fetch page data
    page_data = await browser.fetch_page(
        "https://example.com",
        script_before="window.scrollTo(0, document.body.scrollHeight)",
        script_after="document.querySelectorAll('.item').length"
    )
    
    # Take screenshot
    screenshot = await browser.take_screenshot("https://example.com")
    with open("screenshot.png", "wb") as f:
        f.write(screenshot)
    
    # Execute JavaScript
    result = await browser.execute_script(
        "https://example.com",
        "document.title + ' - ' + document.querySelectorAll('a').length + ' links'"
    )
    print(result)

Content Extraction

ContentExtractor

Transforms raw HTML into structured WebContent.

Parameters:

  • clean (bool, default=True): Clean and normalize text
  • extract_links (bool, default=True): Extract link information
  • extract_metadata (bool, default=True): Extract metadata
  • extract_images (bool, default=False): Extract image information

Methods:

async extract(page_data) -> WebContent

Extract structured content from page data.

Example:

from crawailer.content import ContentExtractor
from crawailer.browser import Browser

browser = Browser()
extractor = ContentExtractor(
    clean=True,
    extract_links=True,
    extract_metadata=True,
    extract_images=True
)

async with browser:
    page_data = await browser.fetch_page("https://example.com")
    content = await extractor.extract(page_data)
    print(content.title)

Error Handling

Custom Exceptions

from crawailer.exceptions import (
    CrawlerError,           # Base exception
    TimeoutError,           # Request timeout
    CloudflareProtected,    # Cloudflare protection detected
    PaywallDetected,        # Paywall detected
    RateLimitError,         # Rate limit exceeded
    ContentExtractionError  # Content extraction failed
)

try:
    content = await get("https://protected-site.com")
except CloudflareProtected:
    # Try with stealth mode
    content = await get("https://protected-site.com", stealth=True)
except PaywallDetected as e:
    print(f"Paywall detected. Archive URL: {e.archive_url}")
except TimeoutError:
    # Increase timeout
    content = await get("https://slow-site.com", timeout=60)

JavaScript Execution

Script Patterns

Simple Execution

# Extract single value
content = await get(url, script="document.title")
print(content.script_result)  # Page title

Complex Operations

# Multi-step JavaScript
complex_script = """
// Scroll to load content
window.scrollTo(0, document.body.scrollHeight);
await new Promise(resolve => setTimeout(resolve, 2000));

// Extract data
const items = Array.from(document.querySelectorAll('.item')).map(item => ({
    title: item.querySelector('.title')?.textContent,
    price: item.querySelector('.price')?.textContent
}));

return items;
"""

content = await get(url, script=complex_script)
items = content.script_result  # List of extracted items

Before/After Pattern

content = await get(
    url,
    script_before="document.querySelector('.load-more')?.click()",
    script_after="document.querySelectorAll('.item').length"
)

if isinstance(content.script_result, dict):
    print(f"Action result: {content.script_result['script_before']}")
    print(f"Items count: {content.script_result['script_after']}")

Error Handling

content = await get(url, script="document.querySelector('.missing').click()")

if content.has_script_error:
    print(f"JavaScript error: {content.script_error}")
    # Use fallback content
    print(f"Fallback: {content.text[:100]}")
else:
    print(f"Result: {content.script_result}")

Framework Detection

React Applications

react_script = """
if (window.React) {
    return {
        framework: 'React',
        version: React.version,
        hasRouter: !!window.ReactRouter,
        componentCount: document.querySelectorAll('[data-reactroot] *').length
    };
}
return null;
"""

content = await get("https://react-app.com", script=react_script)

Vue Applications

vue_script = """
if (window.Vue) {
    return {
        framework: 'Vue',
        version: Vue.version,
        hasRouter: !!window.VueRouter,
        hasVuex: !!window.Vuex
    };
}
return null;
"""

content = await get("https://vue-app.com", script=vue_script)

Performance Optimization

Batch Processing

# Process large URL lists efficiently
urls = [f"https://site.com/page/{i}" for i in range(100)]

# Process in batches
batch_size = 10
all_results = []

for i in range(0, len(urls), batch_size):
    batch = urls[i:i+batch_size]
    results = await get_many(batch, max_concurrent=5)
    all_results.extend(results)
    
    # Rate limiting
    await asyncio.sleep(1)

Memory Management

# For long-running processes
import gc

for batch in url_batches:
    results = await get_many(batch)
    process_results(results)
    
    # Clear references and force garbage collection
    del results
    gc.collect()

Timeout Configuration

# Adjust timeouts based on site characteristics
fast_sites = await get_many(urls, timeout=10)
slow_sites = await get_many(urls, timeout=60)

MCP Integration

Server Setup

from crawailer.mcp import create_mcp_server

# Create MCP server with default tools
server = create_mcp_server()

# Custom MCP tool
@server.tool("extract_product_data")
async def extract_product_data(url: str) -> dict:
    content = await get(
        url,
        script="""
        ({
            name: document.querySelector('.product-name')?.textContent,
            price: document.querySelector('.price')?.textContent,
            rating: document.querySelector('.rating')?.textContent
        })
        """
    )
    
    return {
        'title': content.title,
        'product_data': content.script_result,
        'metadata': {
            'word_count': content.word_count,
            'quality_score': content.quality_score
        }
    }

CLI Interface

Basic Commands

# Extract content from URL
crawailer get https://example.com

# Batch processing
crawailer get-many urls.txt --output results.json

# Discovery
crawailer discover "AI research" --max-pages 10

# Setup (install browsers)
crawailer setup

JavaScript Execution

# Execute JavaScript
crawailer get https://spa.com --script "document.title" --wait-for ".loaded"

# Save with script results
crawailer get https://dynamic.com --script "window.data" --output content.json

Advanced Usage

Custom Content Extractors

from crawailer.content import ContentExtractor

class CustomExtractor(ContentExtractor):
    async def extract(self, page_data):
        content = await super().extract(page_data)
        
        # Add custom processing
        if 'product' in content.content_type:
            content.custom_data = self.extract_product_details(content.html)
        
        return content
    
    def extract_product_details(self, html):
        # Custom extraction logic
        pass

# Use custom extractor
from crawailer.api import _get_browser

browser = await _get_browser()
extractor = CustomExtractor()

page_data = await browser.fetch_page(url)
content = await extractor.extract(page_data)

Session Management

from crawailer.browser import Browser

# Persistent browser session
browser = Browser()
await browser.start()

try:
    # Login
    await browser.fetch_page(
        "https://site.com/login",
        script_after="""
        document.querySelector('#username').value = 'user';
        document.querySelector('#password').value = 'pass';
        document.querySelector('#login').click();
        """
    )
    
    # Access protected content
    protected_content = await browser.fetch_page("https://site.com/dashboard")
    
finally:
    await browser.close()

This API reference provides comprehensive documentation for all Crawailer functionality, with particular emphasis on the JavaScript execution capabilities that set it apart from traditional web scrapers.