crawailer/docs/JAVASCRIPT_API.md

# Crawailer JavaScript API Documentation

## Overview

Crawailer provides comprehensive JavaScript execution capabilities that enable dynamic content extraction from modern web applications. Unlike traditional HTTP scrapers, Crawailer uses a real browser (Playwright) to execute JavaScript and extract content from single-page applications (SPAs), dynamic sites, and JavaScript-heavy pages.

## Key Features

- **Full JavaScript Execution**: Execute arbitrary JavaScript code using `page.evaluate()`
- **Before/After Script Patterns**: Run scripts before and after content extraction
- **SPA Support**: Handle React, Vue, Angular, and other modern frameworks
- **Dynamic Content**: Extract content that's loaded via AJAX or user interactions
- **Error Handling**: Comprehensive error capture and graceful degradation
- **Performance Monitoring**: Extract timing and memory metrics
- **User Interaction**: Simulate clicks, form submissions, and complex workflows

## Basic Usage

### Simple JavaScript Execution

```python
from crawailer import get

# Extract dynamic content
content = await get(
    "https://example.com",
    script="document.querySelector('.dynamic-price').innerText"
)

print(f"Price: {content.script_result}")
print(f"Has script result: {content.has_script_result}")
```

### Waiting for Dynamic Content

```python
# Wait for element and extract data
content = await get(
    "https://spa-app.com",
    script="document.querySelector('.loaded-content').textContent",
    wait_for=".loaded-content"  # Wait for element to appear
)
```

### Complex JavaScript Operations

```python
# Execute complex JavaScript
complex_script = """
// Scroll to load more content
window.scrollTo(0, document.body.scrollHeight);

// Wait for new content to load
await new Promise(resolve => setTimeout(resolve, 2000));

// Extract all product data
const products = Array.from(document.querySelectorAll('.product')).map(p => ({
    name: p.querySelector('.name')?.textContent,
    price: p.querySelector('.price')?.textContent,
    rating: p.querySelector('.rating')?.textContent
}));

return products;
"""

content = await get("https://ecommerce-site.com", script=complex_script)
products = content.script_result
```

## Advanced Patterns

### Before/After Script Execution

```python
# Execute script before content extraction, then after
content = await get(
    "https://dynamic-site.com",
    script_before="document.querySelector('.load-more')?.click()",
    script_after="document.querySelectorAll('.item').length"
)

if isinstance(content.script_result, dict):
    print(f"Triggered loading: {content.script_result['script_before']}")
    print(f"Items loaded: {content.script_result['script_after']}")
```

### Form Interaction and Submission

```python
# Fill and submit forms
form_script = """
// Fill login form
document.querySelector('#username').value = 'testuser';
document.querySelector('#password').value = 'testpass';

// Submit form
document.querySelector('#login-form').submit();

// Wait for redirect
await new Promise(resolve => setTimeout(resolve, 3000));

return 'form submitted';
"""

content = await get("https://app.com/login", script=form_script)
```

### Performance Monitoring

```python
# Extract performance metrics
perf_script = """
({
    loadTime: performance.timing.loadEventEnd - performance.timing.navigationStart,
    domReady: performance.timing.domContentLoadedEventEnd - performance.timing.navigationStart,
    resources: performance.getEntriesByType('resource').length,
    memory: performance.memory ? {
        used: Math.round(performance.memory.usedJSHeapSize / 1024 / 1024),
        total: Math.round(performance.memory.totalJSHeapSize / 1024 / 1024)
    } : null
})
"""

content = await get("https://example.com", script=perf_script)
metrics = content.script_result
print(f"Load time: {metrics['loadTime']}ms")
```

## Batch Processing

### Same Script for Multiple URLs

```python
from crawailer import get_many

urls = [
    "https://site1.com/product/1",
    "https://site1.com/product/2",
    "https://site1.com/product/3"
]

# Extract price from all products
results = await get_many(
    urls,
    script="document.querySelector('.price')?.textContent"
)

for result in results:
    if result and result.script_result:
        print(f"{result.url}: {result.script_result}")
```

### Different Scripts per URL

```python
# Custom script for each URL
urls = ["https://react-app.com", "https://vue-app.com", "https://angular-app.com"]
scripts = [
    "window.React ? 'React ' + React.version : 'No React'",
    "window.Vue ? 'Vue ' + Vue.version : 'No Vue'",
    "window.ng ? 'Angular detected' : 'No Angular'"
]

results = await get_many(urls, script=scripts)
```

## Intelligent Discovery

### Search Result Interaction

```python
from crawailer import discover

# Discover content with JavaScript interaction
results = await discover(
    "machine learning tutorials",
    script="document.querySelector('.show-more')?.click()",
    content_script="document.querySelector('.read-time')?.textContent",
    max_pages=5
)

for result in results:
    print(f"{result.title} - Reading time: {result.script_result}")
```

### Pagination Handling

```python
# Handle infinite scroll
pagination_script = """
let results = [];
let page = 0;

while (page < 3) {  // Load 3 pages
    // Scroll to bottom
    window.scrollTo(0, document.body.scrollHeight);

    // Wait for new content
    await new Promise(resolve => setTimeout(resolve, 2000));

    // Extract current page items
    const items = Array.from(document.querySelectorAll('.item')).map(item =>
        item.textContent.trim()
    );

    results.push(...items);
    page++;
}

return results;
"""

content = await get("https://infinite-scroll-site.com", script=pagination_script)
```

## Error Handling

### JavaScript Error Capture

```python
content = await get(
    "https://example.com",
    script="document.querySelector('.nonexistent').click()"
)

if content.has_script_error:
    print(f"JavaScript error: {content.script_error}")
else:
    print(f"Result: {content.script_result}")
```

### Graceful Degradation

```python
# Try JavaScript, fall back to static content
try:
    content = await get(
        "https://dynamic-site.com",
        script="window.dynamicData || 'fallback'"
    )

    if content.has_script_error:
        # JavaScript failed, but we still have static content
        print(f"Using static content: {content.text[:100]}")
    else:
        print(f"Dynamic data: {content.script_result}")

except Exception as e:
    print(f"Complete failure: {e}")
```

## Modern Framework Integration

### React Applications

```python
# Extract React component data
react_script = """
// Find React root
const reactRoot = document.querySelector('[data-reactroot]') || document.querySelector('#root');

if (window.React && reactRoot) {
    // Get React fiber data (React 16+)
    const fiberKey = Object.keys(reactRoot).find(key => key.startsWith('__reactInternalInstance'));

    return {
        framework: 'React',
        version: React.version,
        hasRouter: !!window.ReactRouter,
        componentCount: document.querySelectorAll('[data-reactroot] *').length
    };
}

return null;
"""

content = await get("https://react-app.com", script=react_script)
```

### Vue Applications

```python
# Extract Vue app data
vue_script = """
if (window.Vue) {
    const app = document.querySelector('#app');

    return {
        framework: 'Vue',
        version: Vue.version,
        hasRouter: !!window.VueRouter,
        hasVuex: !!window.Vuex,
        rootComponent: app?.__vue__?.$options.name || 'unknown'
    };
}

return null;
"""

content = await get("https://vue-app.com", script=vue_script)
```

### Angular Applications

```python
# Extract Angular app data
angular_script = """
if (window.ng) {
    const platform = window.ng.platform || {};

    return {
        framework: 'Angular',
        version: window.ng.version?.full || 'unknown',
        hasRouter: !!window.ng.router,
        modules: Object.keys(platform).length
    };
}

return null;
"""

content = await get("https://angular-app.com", script=angular_script)
```

## WebContent Integration

### Accessing JavaScript Results

```python
content = await get("https://example.com", script="document.title")

# JavaScript result is available in WebContent object
print(f"Script result: {content.script_result}")
print(f"Has result: {content.has_script_result}")
print(f"Has error: {content.has_script_error}")

# Also access traditional content
print(f"Title: {content.title}")
print(f"Text: {content.text[:100]}")
print(f"Markdown: {content.markdown[:100]}")
```

### Combining Static and Dynamic Data

```python
# Extract both static content and dynamic data
dynamic_script = """
({
    dynamicPrice: document.querySelector('.dynamic-price')?.textContent,
    userCount: document.querySelector('.user-count')?.textContent,
    lastUpdated: document.querySelector('.last-updated')?.textContent
})
"""

content = await get("https://dashboard.com", script=dynamic_script)

# Use both static and dynamic content
analysis = {
    'title': content.title,
    'word_count': content.word_count,
    'reading_time': content.reading_time,
    'dynamic_data': content.script_result
}
```

## Performance Considerations

### Optimize JavaScript Execution

```python
# Lightweight scripts for better performance
fast_script = "document.title"  # Simple, fast

# Avoid heavy DOM operations
slow_script = """
// This is expensive - avoid if possible
const allElements = document.querySelectorAll('*');
return Array.from(allElements).map(el => el.tagName);
"""
```

### Batch Processing Optimization

```python
# Process in smaller batches for better memory usage
urls = [f"https://site.com/page/{i}" for i in range(100)]

batch_size = 10
results = []

for i in range(0, len(urls), batch_size):
    batch = urls[i:i+batch_size]
    batch_results = await get_many(batch, script="document.title")
    results.extend(batch_results)

    # Optional: small delay between batches
    await asyncio.sleep(1)
```

## Best Practices

### 1. Script Design

```python
# ✅ Good: Simple, focused scripts
good_script = "document.querySelector('.price').textContent"

# ❌ Avoid: Complex scripts that could fail
bad_script = """
try {
    const price = document.querySelector('.price').textContent.split('$')[1];
    const discountedPrice = parseFloat(price) * 0.9;
    return `$${discountedPrice.toFixed(2)}`;
} catch (e) {
    return null;
}
"""
```

### 2. Error Handling

```python
# Always check for script errors
content = await get(url, script=script)

if content.has_script_error:
    # Handle the error appropriately
    logging.warning(f"JavaScript error on {url}: {content.script_error}")
    # Use fallback approach
else:
    # Process successful result
    process_result(content.script_result)
```

### 3. Performance Monitoring

```python
import time

start_time = time.time()
content = await get(url, script=script)
duration = time.time() - start_time

if duration > 10:  # If taking too long
    logging.warning(f"Slow JavaScript execution on {url}: {duration:.2f}s")
```

## Common Use Cases

### E-commerce Data Extraction

```python
# Extract product information
product_script = """
({
    name: document.querySelector('.product-name')?.textContent,
    price: document.querySelector('.price')?.textContent,
    rating: document.querySelector('.rating')?.textContent,
    availability: document.querySelector('.stock-status')?.textContent,
    images: Array.from(document.querySelectorAll('.product-image img')).map(img => img.src)
})
"""

content = await get("https://shop.com/product/123", script=product_script)
product_data = content.script_result
```

### Social Media Content

```python
# Extract social media posts (be respectful of terms of service)
social_script = """
Array.from(document.querySelectorAll('.post')).slice(0, 10).map(post => ({
    text: post.querySelector('.post-text')?.textContent,
    author: post.querySelector('.author')?.textContent,
    timestamp: post.querySelector('.timestamp')?.textContent,
    likes: post.querySelector('.likes-count')?.textContent
}))
"""

content = await get("https://social-site.com/feed", script=social_script)
posts = content.script_result
```

### News and Articles

```python
# Extract article metadata
article_script = """
({
    headline: document.querySelector('h1')?.textContent,
    author: document.querySelector('.author')?.textContent,
    publishDate: document.querySelector('.publish-date')?.textContent,
    readingTime: document.querySelector('.reading-time')?.textContent,
    tags: Array.from(document.querySelectorAll('.tag')).map(tag => tag.textContent),
    wordCount: document.querySelector('.article-body')?.textContent.split(' ').length
})
"""

content = await get("https://news-site.com/article/123", script=article_script)
```

## Integration with AI Workflows

### Content Preparation for LLMs

```python
# Extract structured content for AI processing
ai_script = """
({
    mainContent: document.querySelector('main')?.textContent,
    headings: Array.from(document.querySelectorAll('h1, h2, h3')).map(h => ({
        level: h.tagName,
        text: h.textContent
    })),
    keyPoints: Array.from(document.querySelectorAll('.highlight, .callout')).map(el => el.textContent),
    metadata: {
        wordCount: document.body.textContent.split(' ').length,
        readingLevel: 'advanced',  // Could be calculated
        topics: Array.from(document.querySelectorAll('.topic-tag')).map(tag => tag.textContent)
    }
})
"""

content = await get("https://technical-blog.com/post", script=ai_script)
structured_data = content.script_result

# Now ready for AI processing
ai_prompt = f"""
Analyze this content:

Title: {content.title}
Main Content: {structured_data['mainContent'][:1000]}...
Key Points: {structured_data['keyPoints']}
Topics: {structured_data['metadata']['topics']}

Provide a summary and key insights.
"""
```

## Troubleshooting

### Common Issues

1. **Script Timeout**
   ```python
   # Increase timeout for slow scripts
   content = await get(url, script=script, timeout=60)
   ```

2. **Element Not Found**
   ```python
   # Use optional chaining and fallbacks
   safe_script = """
   document.querySelector('.target')?.textContent || 'not found'
   """
   ```

3. **JavaScript Not Loaded**
   ```python
   # Wait for JavaScript frameworks to load
   content = await get(
       url,
       script="typeof React !== 'undefined' ? React.version : 'React not loaded'",
       wait_for="[data-reactroot]"
   )
   ```

### Debug Mode

```python
# Enable verbose logging for debugging
import logging
logging.basicConfig(level=logging.DEBUG)

content = await get(url, script=script)
```

This comprehensive JavaScript API enables Crawailer to handle modern web applications with the same ease as static sites, making it ideal for AI workflows that require rich, accurate content extraction.