crawailer/docs/JAVASCRIPT_API.md
Crawailer Developer d31395a166 Initial Crawailer implementation with comprehensive JavaScript API
- Complete browser automation with Playwright integration
- High-level API functions: get(), get_many(), discover()
- JavaScript execution support with script parameters
- Content extraction optimized for LLM workflows
- Comprehensive test suite with 18 test files (700+ scenarios)
- Local Caddy test server for reproducible testing
- Performance benchmarking vs Katana crawler
- Complete documentation including JavaScript API guide
- PyPI-ready packaging with professional metadata
- UNIX philosophy: do web scraping exceptionally well
2025-09-18 14:47:59 -06:00

15 KiB

Crawailer JavaScript API Documentation

Overview

Crawailer provides comprehensive JavaScript execution capabilities that enable dynamic content extraction from modern web applications. Unlike traditional HTTP scrapers, Crawailer uses a real browser (Playwright) to execute JavaScript and extract content from single-page applications (SPAs), dynamic sites, and JavaScript-heavy pages.

Key Features

  • Full JavaScript Execution: Execute arbitrary JavaScript code using page.evaluate()
  • Before/After Script Patterns: Run scripts before and after content extraction
  • SPA Support: Handle React, Vue, Angular, and other modern frameworks
  • Dynamic Content: Extract content that's loaded via AJAX or user interactions
  • Error Handling: Comprehensive error capture and graceful degradation
  • Performance Monitoring: Extract timing and memory metrics
  • User Interaction: Simulate clicks, form submissions, and complex workflows

Basic Usage

Simple JavaScript Execution

from crawailer import get

# Extract dynamic content
content = await get(
    "https://example.com",
    script="document.querySelector('.dynamic-price').innerText"
)

print(f"Price: {content.script_result}")
print(f"Has script result: {content.has_script_result}")

Waiting for Dynamic Content

# Wait for element and extract data
content = await get(
    "https://spa-app.com",
    script="document.querySelector('.loaded-content').textContent",
    wait_for=".loaded-content"  # Wait for element to appear
)

Complex JavaScript Operations

# Execute complex JavaScript
complex_script = """
// Scroll to load more content
window.scrollTo(0, document.body.scrollHeight);

// Wait for new content to load
await new Promise(resolve => setTimeout(resolve, 2000));

// Extract all product data
const products = Array.from(document.querySelectorAll('.product')).map(p => ({
    name: p.querySelector('.name')?.textContent,
    price: p.querySelector('.price')?.textContent,
    rating: p.querySelector('.rating')?.textContent
}));

return products;
"""

content = await get("https://ecommerce-site.com", script=complex_script)
products = content.script_result

Advanced Patterns

Before/After Script Execution

# Execute script before content extraction, then after
content = await get(
    "https://dynamic-site.com",
    script_before="document.querySelector('.load-more')?.click()",
    script_after="document.querySelectorAll('.item').length"
)

if isinstance(content.script_result, dict):
    print(f"Triggered loading: {content.script_result['script_before']}")
    print(f"Items loaded: {content.script_result['script_after']}")

Form Interaction and Submission

# Fill and submit forms
form_script = """
// Fill login form
document.querySelector('#username').value = 'testuser';
document.querySelector('#password').value = 'testpass';

// Submit form
document.querySelector('#login-form').submit();

// Wait for redirect
await new Promise(resolve => setTimeout(resolve, 3000));

return 'form submitted';
"""

content = await get("https://app.com/login", script=form_script)

Performance Monitoring

# Extract performance metrics
perf_script = """
({
    loadTime: performance.timing.loadEventEnd - performance.timing.navigationStart,
    domReady: performance.timing.domContentLoadedEventEnd - performance.timing.navigationStart,
    resources: performance.getEntriesByType('resource').length,
    memory: performance.memory ? {
        used: Math.round(performance.memory.usedJSHeapSize / 1024 / 1024),
        total: Math.round(performance.memory.totalJSHeapSize / 1024 / 1024)
    } : null
})
"""

content = await get("https://example.com", script=perf_script)
metrics = content.script_result
print(f"Load time: {metrics['loadTime']}ms")

Batch Processing

Same Script for Multiple URLs

from crawailer import get_many

urls = [
    "https://site1.com/product/1",
    "https://site1.com/product/2", 
    "https://site1.com/product/3"
]

# Extract price from all products
results = await get_many(
    urls,
    script="document.querySelector('.price')?.textContent"
)

for result in results:
    if result and result.script_result:
        print(f"{result.url}: {result.script_result}")

Different Scripts per URL

# Custom script for each URL
urls = ["https://react-app.com", "https://vue-app.com", "https://angular-app.com"]
scripts = [
    "window.React ? 'React ' + React.version : 'No React'",
    "window.Vue ? 'Vue ' + Vue.version : 'No Vue'", 
    "window.ng ? 'Angular detected' : 'No Angular'"
]

results = await get_many(urls, script=scripts)

Intelligent Discovery

Search Result Interaction

from crawailer import discover

# Discover content with JavaScript interaction
results = await discover(
    "machine learning tutorials",
    script="document.querySelector('.show-more')?.click()",
    content_script="document.querySelector('.read-time')?.textContent",
    max_pages=5
)

for result in results:
    print(f"{result.title} - Reading time: {result.script_result}")

Pagination Handling

# Handle infinite scroll
pagination_script = """
let results = [];
let page = 0;

while (page < 3) {  // Load 3 pages
    // Scroll to bottom
    window.scrollTo(0, document.body.scrollHeight);
    
    // Wait for new content
    await new Promise(resolve => setTimeout(resolve, 2000));
    
    // Extract current page items
    const items = Array.from(document.querySelectorAll('.item')).map(item => 
        item.textContent.trim()
    );
    
    results.push(...items);
    page++;
}

return results;
"""

content = await get("https://infinite-scroll-site.com", script=pagination_script)

Error Handling

JavaScript Error Capture

content = await get(
    "https://example.com",
    script="document.querySelector('.nonexistent').click()"
)

if content.has_script_error:
    print(f"JavaScript error: {content.script_error}")
else:
    print(f"Result: {content.script_result}")

Graceful Degradation

# Try JavaScript, fall back to static content
try:
    content = await get(
        "https://dynamic-site.com",
        script="window.dynamicData || 'fallback'"
    )
    
    if content.has_script_error:
        # JavaScript failed, but we still have static content
        print(f"Using static content: {content.text[:100]}")
    else:
        print(f"Dynamic data: {content.script_result}")
        
except Exception as e:
    print(f"Complete failure: {e}")

Modern Framework Integration

React Applications

# Extract React component data
react_script = """
// Find React root
const reactRoot = document.querySelector('[data-reactroot]') || document.querySelector('#root');

if (window.React && reactRoot) {
    // Get React fiber data (React 16+)
    const fiberKey = Object.keys(reactRoot).find(key => key.startsWith('__reactInternalInstance'));
    
    return {
        framework: 'React',
        version: React.version,
        hasRouter: !!window.ReactRouter,
        componentCount: document.querySelectorAll('[data-reactroot] *').length
    };
}

return null;
"""

content = await get("https://react-app.com", script=react_script)

Vue Applications

# Extract Vue app data
vue_script = """
if (window.Vue) {
    const app = document.querySelector('#app');
    
    return {
        framework: 'Vue',
        version: Vue.version,
        hasRouter: !!window.VueRouter,
        hasVuex: !!window.Vuex,
        rootComponent: app?.__vue__?.$options.name || 'unknown'
    };
}

return null;
"""

content = await get("https://vue-app.com", script=vue_script)

Angular Applications

# Extract Angular app data
angular_script = """
if (window.ng) {
    const platform = window.ng.platform || {};
    
    return {
        framework: 'Angular',
        version: window.ng.version?.full || 'unknown',
        hasRouter: !!window.ng.router,
        modules: Object.keys(platform).length
    };
}

return null;
"""

content = await get("https://angular-app.com", script=angular_script)

WebContent Integration

Accessing JavaScript Results

content = await get("https://example.com", script="document.title")

# JavaScript result is available in WebContent object
print(f"Script result: {content.script_result}")
print(f"Has result: {content.has_script_result}")
print(f"Has error: {content.has_script_error}")

# Also access traditional content
print(f"Title: {content.title}")
print(f"Text: {content.text[:100]}")
print(f"Markdown: {content.markdown[:100]}")

Combining Static and Dynamic Data

# Extract both static content and dynamic data
dynamic_script = """
({
    dynamicPrice: document.querySelector('.dynamic-price')?.textContent,
    userCount: document.querySelector('.user-count')?.textContent,
    lastUpdated: document.querySelector('.last-updated')?.textContent
})
"""

content = await get("https://dashboard.com", script=dynamic_script)

# Use both static and dynamic content
analysis = {
    'title': content.title,
    'word_count': content.word_count,
    'reading_time': content.reading_time,
    'dynamic_data': content.script_result
}

Performance Considerations

Optimize JavaScript Execution

# Lightweight scripts for better performance
fast_script = "document.title"  # Simple, fast

# Avoid heavy DOM operations
slow_script = """
// This is expensive - avoid if possible
const allElements = document.querySelectorAll('*');
return Array.from(allElements).map(el => el.tagName);
"""

Batch Processing Optimization

# Process in smaller batches for better memory usage
urls = [f"https://site.com/page/{i}" for i in range(100)]

batch_size = 10
results = []

for i in range(0, len(urls), batch_size):
    batch = urls[i:i+batch_size]
    batch_results = await get_many(batch, script="document.title")
    results.extend(batch_results)
    
    # Optional: small delay between batches
    await asyncio.sleep(1)

Best Practices

1. Script Design

# ✅ Good: Simple, focused scripts
good_script = "document.querySelector('.price').textContent"

# ❌ Avoid: Complex scripts that could fail
bad_script = """
try {
    const price = document.querySelector('.price').textContent.split('$')[1];
    const discountedPrice = parseFloat(price) * 0.9;
    return `$${discountedPrice.toFixed(2)}`;
} catch (e) {
    return null;
}
"""

2. Error Handling

# Always check for script errors
content = await get(url, script=script)

if content.has_script_error:
    # Handle the error appropriately
    logging.warning(f"JavaScript error on {url}: {content.script_error}")
    # Use fallback approach
else:
    # Process successful result
    process_result(content.script_result)

3. Performance Monitoring

import time

start_time = time.time()
content = await get(url, script=script)
duration = time.time() - start_time

if duration > 10:  # If taking too long
    logging.warning(f"Slow JavaScript execution on {url}: {duration:.2f}s")

Common Use Cases

E-commerce Data Extraction

# Extract product information
product_script = """
({
    name: document.querySelector('.product-name')?.textContent,
    price: document.querySelector('.price')?.textContent,
    rating: document.querySelector('.rating')?.textContent,
    availability: document.querySelector('.stock-status')?.textContent,
    images: Array.from(document.querySelectorAll('.product-image img')).map(img => img.src)
})
"""

content = await get("https://shop.com/product/123", script=product_script)
product_data = content.script_result

Social Media Content

# Extract social media posts (be respectful of terms of service)
social_script = """
Array.from(document.querySelectorAll('.post')).slice(0, 10).map(post => ({
    text: post.querySelector('.post-text')?.textContent,
    author: post.querySelector('.author')?.textContent,
    timestamp: post.querySelector('.timestamp')?.textContent,
    likes: post.querySelector('.likes-count')?.textContent
}))
"""

content = await get("https://social-site.com/feed", script=social_script)
posts = content.script_result

News and Articles

# Extract article metadata
article_script = """
({
    headline: document.querySelector('h1')?.textContent,
    author: document.querySelector('.author')?.textContent,
    publishDate: document.querySelector('.publish-date')?.textContent,
    readingTime: document.querySelector('.reading-time')?.textContent,
    tags: Array.from(document.querySelectorAll('.tag')).map(tag => tag.textContent),
    wordCount: document.querySelector('.article-body')?.textContent.split(' ').length
})
"""

content = await get("https://news-site.com/article/123", script=article_script)

Integration with AI Workflows

Content Preparation for LLMs

# Extract structured content for AI processing
ai_script = """
({
    mainContent: document.querySelector('main')?.textContent,
    headings: Array.from(document.querySelectorAll('h1, h2, h3')).map(h => ({
        level: h.tagName,
        text: h.textContent
    })),
    keyPoints: Array.from(document.querySelectorAll('.highlight, .callout')).map(el => el.textContent),
    metadata: {
        wordCount: document.body.textContent.split(' ').length,
        readingLevel: 'advanced',  // Could be calculated
        topics: Array.from(document.querySelectorAll('.topic-tag')).map(tag => tag.textContent)
    }
})
"""

content = await get("https://technical-blog.com/post", script=ai_script)
structured_data = content.script_result

# Now ready for AI processing
ai_prompt = f"""
Analyze this content:

Title: {content.title}
Main Content: {structured_data['mainContent'][:1000]}...
Key Points: {structured_data['keyPoints']}
Topics: {structured_data['metadata']['topics']}

Provide a summary and key insights.
"""

Troubleshooting

Common Issues

  1. Script Timeout

    # Increase timeout for slow scripts
    content = await get(url, script=script, timeout=60)
    
  2. Element Not Found

    # Use optional chaining and fallbacks
    safe_script = """
    document.querySelector('.target')?.textContent || 'not found'
    """
    
  3. JavaScript Not Loaded

    # Wait for JavaScript frameworks to load
    content = await get(
        url,
        script="typeof React !== 'undefined' ? React.version : 'React not loaded'",
        wait_for="[data-reactroot]"
    )
    

Debug Mode

# Enable verbose logging for debugging
import logging
logging.basicConfig(level=logging.DEBUG)

content = await get(url, script=script)

This comprehensive JavaScript API enables Crawailer to handle modern web applications with the same ease as static sites, making it ideal for AI workflows that require rich, accurate content extraction.