crawailer/docs/API_REFERENCE.md

# Crawailer API Reference

## Core Functions

### `get(url, **options) -> WebContent`

Extract content from a single URL with optional JavaScript execution.

**Parameters:**
- `url` (str): The URL to fetch
- `wait_for` (str, optional): CSS selector to wait for before extraction
- `timeout` (int, default=30): Request timeout in seconds
- `clean` (bool, default=True): Whether to clean and optimize content
- `extract_links` (bool, default=True): Whether to extract links
- `extract_metadata` (bool, default=True): Whether to extract metadata
- `script` (str, optional): JavaScript to execute (alias for `script_before`)
- `script_before` (str, optional): JavaScript to execute before content extraction
- `script_after` (str, optional): JavaScript to execute after content extraction

**Returns:** `WebContent` object with extracted content and metadata

**Example:**
```python
# Basic usage
content = await get("https://example.com")

# With JavaScript execution
content = await get(
    "https://dynamic-site.com",
    script="document.querySelector('.price').textContent",
    wait_for=".price-loaded"
)

# Before/after pattern
content = await get(
    "https://spa.com",
    script_before="document.querySelector('.load-more')?.click()",
    script_after="document.querySelectorAll('.item').length"
)
```

### `get_many(urls, **options) -> List[WebContent]`

Extract content from multiple URLs efficiently with concurrent processing.

**Parameters:**
- `urls` (List[str]): List of URLs to fetch
- `max_concurrent` (int, default=5): Maximum concurrent requests
- `timeout` (int, default=30): Request timeout per URL
- `clean` (bool, default=True): Whether to clean content
- `progress` (bool, default=False): Whether to show progress bar
- `script` (str | List[str], optional): JavaScript for all URLs or per-URL scripts

**Returns:** `List[WebContent]` (failed URLs return None)

**Example:**
```python
# Batch processing
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await get_many(urls, max_concurrent=3)

# Same script for all URLs
results = await get_many(
    urls,
    script="document.querySelector('.title').textContent"
)

# Different scripts per URL
scripts = [
    "document.title",
    "document.querySelector('.price').textContent",
    "document.querySelectorAll('.item').length"
]
results = await get_many(urls, script=scripts)
```

### `discover(query, **options) -> List[WebContent]`

Intelligently discover and rank content related to a query.

**Parameters:**
- `query` (str): Search query or topic description
- `max_pages` (int, default=10): Maximum results to return
- `quality_threshold` (float, default=0.7): Minimum quality score
- `recency_bias` (bool, default=True): Prefer recent content
- `source_types` (List[str], optional): Filter by source types
- `script` (str, optional): JavaScript for search results pages
- `content_script` (str, optional): JavaScript for discovered content pages

**Returns:** `List[WebContent]` ranked by relevance and quality

**Example:**
```python
# Basic discovery
results = await discover("machine learning tutorials")

# With JavaScript interaction
results = await discover(
    "AI research papers",
    script="document.querySelector('.show-more')?.click()",
    content_script="document.querySelector('.abstract').textContent",
    max_pages=5
)
```

### `cleanup()`

Clean up global browser resources.

**Example:**
```python
# Clean up at end of script
await cleanup()
```

## Data Classes

### `WebContent`

Structured representation of extracted web content.

**Core Properties:**
- `url` (str): Source URL
- `title` (str): Extracted page title
- `markdown` (str): LLM-optimized markdown content
- `text` (str): Clean human-readable text
- `html` (str): Original HTML content

**Metadata Properties:**
- `author` (str | None): Content author
- `published` (datetime | None): Publication date
- `reading_time` (str): Estimated reading time
- `word_count` (int): Word count
- `language` (str): Content language
- `quality_score` (float): Content quality (0-10)

**Semantic Properties:**
- `content_type` (str): Detected content type (article, product, etc.)
- `topics` (List[str]): Extracted topics
- `entities` (Dict[str, List[str]]): Named entities

**Relationship Properties:**
- `links` (List[Dict]): Extracted links with metadata
- `images` (List[Dict]): Image information

**Technical Properties:**
- `status_code` (int): HTTP status code
- `load_time` (float): Page load time
- `content_hash` (str): Content hash for deduplication
- `extracted_at` (datetime): Extraction timestamp

**JavaScript Properties:**
- `script_result` (Any | None): JavaScript execution result
- `script_error` (str | None): JavaScript execution error

**Computed Properties:**
- `summary` (str): Brief content summary
- `readable_summary` (str): Human-friendly summary with metadata
- `has_script_result` (bool): Whether JavaScript result is available
- `has_script_error` (bool): Whether JavaScript error occurred

**Methods:**
- `save(path, format="auto")`: Save content to file

**Example:**
```python
content = await get("https://example.com", script="document.title")

# Access content
print(content.title)
print(content.markdown[:100])
print(content.text[:100])

# Access metadata
print(f"Author: {content.author}")
print(f"Reading time: {content.reading_time}")
print(f"Quality: {content.quality_score}/10")

# Access JavaScript results
if content.has_script_result:
    print(f"Script result: {content.script_result}")

if content.has_script_error:
    print(f"Script error: {content.script_error}")

# Save content
content.save("article.md")  # Saves as markdown
content.save("article.json")  # Saves as JSON with all metadata
```

### `BrowserConfig`

Configuration for browser behavior.

**Properties:**
- `headless` (bool, default=True): Run browser in headless mode
- `timeout` (int, default=30000): Request timeout in milliseconds
- `user_agent` (str | None): Custom user agent
- `viewport` (Dict[str, int], default={"width": 1920, "height": 1080}): Viewport size
- `extra_args` (List[str], default=[]): Additional browser arguments

**Example:**
```python
from crawailer import BrowserConfig, Browser

config = BrowserConfig(
    headless=False,  # Show browser window
    timeout=60000,   # 60 second timeout
    user_agent="Custom Bot 1.0",
    viewport={"width": 1280, "height": 720}
)

browser = Browser(config)
```

## Browser Class

Lower-level browser control for advanced use cases.

### `Browser(config=None)`

**Methods:**

#### `async start()`
Initialize the browser instance.

#### `async close()`
Clean up browser resources.

#### `async fetch_page(url, **options) -> Dict[str, Any]`
Fetch a single page with full control.

**Parameters:**
- `url` (str): URL to fetch
- `wait_for` (str, optional): CSS selector to wait for
- `timeout` (int, default=30): Timeout in seconds
- `stealth` (bool, default=False): Enable stealth mode
- `script_before` (str, optional): JavaScript before content extraction
- `script_after` (str, optional): JavaScript after content extraction

**Returns:** Dictionary with page data

#### `async fetch_many(urls, **options) -> List[Dict[str, Any]]`
Fetch multiple pages concurrently.

#### `async take_screenshot(url, **options) -> bytes`
Take a screenshot of a page.

**Parameters:**
- `url` (str): URL to screenshot
- `selector` (str, optional): CSS selector to screenshot
- `full_page` (bool, default=False): Capture full scrollable page
- `timeout` (int, default=30): Timeout in seconds

**Returns:** Screenshot as PNG bytes

#### `async execute_script(url, script, **options) -> Any`
Execute JavaScript on a page and return result.

**Example:**
```python
from crawailer import Browser, BrowserConfig

config = BrowserConfig(headless=False)
browser = Browser(config)

async with browser:
    # Fetch page data
    page_data = await browser.fetch_page(
        "https://example.com",
        script_before="window.scrollTo(0, document.body.scrollHeight)",
        script_after="document.querySelectorAll('.item').length"
    )

    # Take screenshot
    screenshot = await browser.take_screenshot("https://example.com")
    with open("screenshot.png", "wb") as f:
        f.write(screenshot)

    # Execute JavaScript
    result = await browser.execute_script(
        "https://example.com",
        "document.title + ' - ' + document.querySelectorAll('a').length + ' links'"
    )
    print(result)
```

## Content Extraction

### `ContentExtractor`

Transforms raw HTML into structured WebContent.

**Parameters:**
- `clean` (bool, default=True): Clean and normalize text
- `extract_links` (bool, default=True): Extract link information
- `extract_metadata` (bool, default=True): Extract metadata
- `extract_images` (bool, default=False): Extract image information

**Methods:**

#### `async extract(page_data) -> WebContent`
Extract structured content from page data.

**Example:**
```python
from crawailer.content import ContentExtractor
from crawailer.browser import Browser

browser = Browser()
extractor = ContentExtractor(
    clean=True,
    extract_links=True,
    extract_metadata=True,
    extract_images=True
)

async with browser:
    page_data = await browser.fetch_page("https://example.com")
    content = await extractor.extract(page_data)
    print(content.title)
```

## Error Handling

### Custom Exceptions

```python
from crawailer.exceptions import (
    CrawlerError,           # Base exception
    TimeoutError,           # Request timeout
    CloudflareProtected,    # Cloudflare protection detected
    PaywallDetected,        # Paywall detected
    RateLimitError,         # Rate limit exceeded
    ContentExtractionError  # Content extraction failed
)

try:
    content = await get("https://protected-site.com")
except CloudflareProtected:
    # Try with stealth mode
    content = await get("https://protected-site.com", stealth=True)
except PaywallDetected as e:
    print(f"Paywall detected. Archive URL: {e.archive_url}")
except TimeoutError:
    # Increase timeout
    content = await get("https://slow-site.com", timeout=60)
```

## JavaScript Execution

### Script Patterns

#### Simple Execution
```python
# Extract single value
content = await get(url, script="document.title")
print(content.script_result)  # Page title
```

#### Complex Operations
```python
# Multi-step JavaScript
complex_script = """
// Scroll to load content
window.scrollTo(0, document.body.scrollHeight);
await new Promise(resolve => setTimeout(resolve, 2000));

// Extract data
const items = Array.from(document.querySelectorAll('.item')).map(item => ({
    title: item.querySelector('.title')?.textContent,
    price: item.querySelector('.price')?.textContent
}));

return items;
"""

content = await get(url, script=complex_script)
items = content.script_result  # List of extracted items
```

#### Before/After Pattern
```python
content = await get(
    url,
    script_before="document.querySelector('.load-more')?.click()",
    script_after="document.querySelectorAll('.item').length"
)

if isinstance(content.script_result, dict):
    print(f"Action result: {content.script_result['script_before']}")
    print(f"Items count: {content.script_result['script_after']}")
```

#### Error Handling
```python
content = await get(url, script="document.querySelector('.missing').click()")

if content.has_script_error:
    print(f"JavaScript error: {content.script_error}")
    # Use fallback content
    print(f"Fallback: {content.text[:100]}")
else:
    print(f"Result: {content.script_result}")
```

### Framework Detection

#### React Applications
```python
react_script = """
if (window.React) {
    return {
        framework: 'React',
        version: React.version,
        hasRouter: !!window.ReactRouter,
        componentCount: document.querySelectorAll('[data-reactroot] *').length
    };
}
return null;
"""

content = await get("https://react-app.com", script=react_script)
```

#### Vue Applications
```python
vue_script = """
if (window.Vue) {
    return {
        framework: 'Vue',
        version: Vue.version,
        hasRouter: !!window.VueRouter,
        hasVuex: !!window.Vuex
    };
}
return null;
"""

content = await get("https://vue-app.com", script=vue_script)
```

## Performance Optimization

### Batch Processing
```python
# Process large URL lists efficiently
urls = [f"https://site.com/page/{i}" for i in range(100)]

# Process in batches
batch_size = 10
all_results = []

for i in range(0, len(urls), batch_size):
    batch = urls[i:i+batch_size]
    results = await get_many(batch, max_concurrent=5)
    all_results.extend(results)

    # Rate limiting
    await asyncio.sleep(1)
```

### Memory Management
```python
# For long-running processes
import gc

for batch in url_batches:
    results = await get_many(batch)
    process_results(results)

    # Clear references and force garbage collection
    del results
    gc.collect()
```

### Timeout Configuration
```python
# Adjust timeouts based on site characteristics
fast_sites = await get_many(urls, timeout=10)
slow_sites = await get_many(urls, timeout=60)
```

## MCP Integration

### Server Setup
```python
from crawailer.mcp import create_mcp_server

# Create MCP server with default tools
server = create_mcp_server()

# Custom MCP tool
@server.tool("extract_product_data")
async def extract_product_data(url: str) -> dict:
    content = await get(
        url,
        script="""
        ({
            name: document.querySelector('.product-name')?.textContent,
            price: document.querySelector('.price')?.textContent,
            rating: document.querySelector('.rating')?.textContent
        })
        """
    )

    return {
        'title': content.title,
        'product_data': content.script_result,
        'metadata': {
            'word_count': content.word_count,
            'quality_score': content.quality_score
        }
    }
```

## CLI Interface

### Basic Commands
```bash
# Extract content from URL
crawailer get https://example.com

# Batch processing
crawailer get-many urls.txt --output results.json

# Discovery
crawailer discover "AI research" --max-pages 10

# Setup (install browsers)
crawailer setup
```

### JavaScript Execution
```bash
# Execute JavaScript
crawailer get https://spa.com --script "document.title" --wait-for ".loaded"

# Save with script results
crawailer get https://dynamic.com --script "window.data" --output content.json
```

## Advanced Usage

### Custom Content Extractors
```python
from crawailer.content import ContentExtractor

class CustomExtractor(ContentExtractor):
    async def extract(self, page_data):
        content = await super().extract(page_data)

        # Add custom processing
        if 'product' in content.content_type:
            content.custom_data = self.extract_product_details(content.html)

        return content

    def extract_product_details(self, html):
        # Custom extraction logic
        pass

# Use custom extractor
from crawailer.api import _get_browser

browser = await _get_browser()
extractor = CustomExtractor()

page_data = await browser.fetch_page(url)
content = await extractor.extract(page_data)
```

### Session Management
```python
from crawailer.browser import Browser

# Persistent browser session
browser = Browser()
await browser.start()

try:
    # Login
    await browser.fetch_page(
        "https://site.com/login",
        script_after="""
        document.querySelector('#username').value = 'user';
        document.querySelector('#password').value = 'pass';
        document.querySelector('#login').click();
        """
    )

    # Access protected content
    protected_content = await browser.fetch_page("https://site.com/dashboard")

finally:
    await browser.close()
```

This API reference provides comprehensive documentation for all Crawailer functionality, with particular emphasis on the JavaScript execution capabilities that set it apart from traditional web scrapers.