Crawailer Developer d31395a166 Initial Crawailer implementation with comprehensive JavaScript API

- Complete browser automation with Playwright integration
- High-level API functions: get(), get_many(), discover()
- JavaScript execution support with script parameters
- Content extraction optimized for LLM workflows
- Comprehensive test suite with 18 test files (700+ scenarios)
- Local Caddy test server for reproducible testing
- Performance benchmarking vs Katana crawler
- Complete documentation including JavaScript API guide
- PyPI-ready packaging with professional metadata
- UNIX philosophy: do web scraping exceptionally well

2025-09-18 14:47:59 -06:00

15 KiB

Raw Blame History

Crawailer JavaScript API Documentation

Overview

Crawailer provides comprehensive JavaScript execution capabilities that enable dynamic content extraction from modern web applications. Unlike traditional HTTP scrapers, Crawailer uses a real browser (Playwright) to execute JavaScript and extract content from single-page applications (SPAs), dynamic sites, and JavaScript-heavy pages.

Key Features

Full JavaScript Execution: Execute arbitrary JavaScript code using page.evaluate()
Before/After Script Patterns: Run scripts before and after content extraction
SPA Support: Handle React, Vue, Angular, and other modern frameworks
Dynamic Content: Extract content that's loaded via AJAX or user interactions
Error Handling: Comprehensive error capture and graceful degradation
Performance Monitoring: Extract timing and memory metrics
User Interaction: Simulate clicks, form submissions, and complex workflows

Basic Usage

Simple JavaScript Execution

from crawailer import get

# Extract dynamic content
content = await get(
    "https://example.com",
    script="document.querySelector('.dynamic-price').innerText"
)

print(f"Price: {content.script_result}")
print(f"Has script result: {content.has_script_result}")

Waiting for Dynamic Content

# Wait for element and extract data
content = await get(
    "https://spa-app.com",
    script="document.querySelector('.loaded-content').textContent",
    wait_for=".loaded-content"  # Wait for element to appear
)

Complex JavaScript Operations

# Execute complex JavaScript
complex_script = """
// Scroll to load more content
window.scrollTo(0, document.body.scrollHeight);

// Wait for new content to load
await new Promise(resolve => setTimeout(resolve, 2000));

// Extract all product data
const products = Array.from(document.querySelectorAll('.product')).map(p => ({
    name: p.querySelector('.name')?.textContent,
    price: p.querySelector('.price')?.textContent,
    rating: p.querySelector('.rating')?.textContent
}));

return products;
"""

content = await get("https://ecommerce-site.com", script=complex_script)
products = content.script_result

Advanced Patterns

Before/After Script Execution

# Execute script before content extraction, then after
content = await get(
    "https://dynamic-site.com",
    script_before="document.querySelector('.load-more')?.click()",
    script_after="document.querySelectorAll('.item').length"
)

if isinstance(content.script_result, dict):
    print(f"Triggered loading: {content.script_result['script_before']}")
    print(f"Items loaded: {content.script_result['script_after']}")

Form Interaction and Submission

# Fill and submit forms
form_script = """
// Fill login form
document.querySelector('#username').value = 'testuser';
document.querySelector('#password').value = 'testpass';

// Submit form
document.querySelector('#login-form').submit();

// Wait for redirect
await new Promise(resolve => setTimeout(resolve, 3000));

return 'form submitted';
"""

content = await get("https://app.com/login", script=form_script)

Performance Monitoring

# Extract performance metrics
perf_script = """
({
    loadTime: performance.timing.loadEventEnd - performance.timing.navigationStart,
    domReady: performance.timing.domContentLoadedEventEnd - performance.timing.navigationStart,
    resources: performance.getEntriesByType('resource').length,
    memory: performance.memory ? {
        used: Math.round(performance.memory.usedJSHeapSize / 1024 / 1024),
        total: Math.round(performance.memory.totalJSHeapSize / 1024 / 1024)
    } : null
})
"""

content = await get("https://example.com", script=perf_script)
metrics = content.script_result
print(f"Load time: {metrics['loadTime']}ms")

Batch Processing

Same Script for Multiple URLs

from crawailer import get_many

urls = [
    "https://site1.com/product/1",
    "https://site1.com/product/2", 
    "https://site1.com/product/3"
]

# Extract price from all products
results = await get_many(
    urls,
    script="document.querySelector('.price')?.textContent"
)

for result in results:
    if result and result.script_result:
        print(f"{result.url}: {result.script_result}")

Different Scripts per URL

# Custom script for each URL
urls = ["https://react-app.com", "https://vue-app.com", "https://angular-app.com"]
scripts = [
    "window.React ? 'React ' + React.version : 'No React'",
    "window.Vue ? 'Vue ' + Vue.version : 'No Vue'", 
    "window.ng ? 'Angular detected' : 'No Angular'"
]

results = await get_many(urls, script=scripts)

Intelligent Discovery

Search Result Interaction

from crawailer import discover

# Discover content with JavaScript interaction
results = await discover(
    "machine learning tutorials",
    script="document.querySelector('.show-more')?.click()",
    content_script="document.querySelector('.read-time')?.textContent",
    max_pages=5
)

for result in results:
    print(f"{result.title} - Reading time: {result.script_result}")

Pagination Handling

# Handle infinite scroll
pagination_script = """
let results = [];
let page = 0;

while (page < 3) {  // Load 3 pages
    // Scroll to bottom
    window.scrollTo(0, document.body.scrollHeight);
    
    // Wait for new content
    await new Promise(resolve => setTimeout(resolve, 2000));
    
    // Extract current page items
    const items = Array.from(document.querySelectorAll('.item')).map(item => 
        item.textContent.trim()
    );
    
    results.push(...items);
    page++;
}

return results;
"""

content = await get("https://infinite-scroll-site.com", script=pagination_script)

Error Handling

JavaScript Error Capture

content = await get(
    "https://example.com",
    script="document.querySelector('.nonexistent').click()"
)

if content.has_script_error:
    print(f"JavaScript error: {content.script_error}")
else:
    print(f"Result: {content.script_result}")

Graceful Degradation

# Try JavaScript, fall back to static content
try:
    content = await get(
        "https://dynamic-site.com",
        script="window.dynamicData || 'fallback'"
    )
    
    if content.has_script_error:
        # JavaScript failed, but we still have static content
        print(f"Using static content: {content.text[:100]}")
    else:
        print(f"Dynamic data: {content.script_result}")
        
except Exception as e:
    print(f"Complete failure: {e}")

Modern Framework Integration

React Applications

# Extract React component data
react_script = """
// Find React root
const reactRoot = document.querySelector('[data-reactroot]') || document.querySelector('#root');

if (window.React && reactRoot) {
    // Get React fiber data (React 16+)
    const fiberKey = Object.keys(reactRoot).find(key => key.startsWith('__reactInternalInstance'));
    
    return {
        framework: 'React',
        version: React.version,
        hasRouter: !!window.ReactRouter,
        componentCount: document.querySelectorAll('[data-reactroot] *').length
    };
}

return null;
"""

content = await get("https://react-app.com", script=react_script)

Vue Applications

# Extract Vue app data
vue_script = """
if (window.Vue) {
    const app = document.querySelector('#app');
    
    return {
        framework: 'Vue',
        version: Vue.version,
        hasRouter: !!window.VueRouter,
        hasVuex: !!window.Vuex,
        rootComponent: app?.__vue__?.$options.name || 'unknown'
    };
}

return null;
"""

content = await get("https://vue-app.com", script=vue_script)

Angular Applications

# Extract Angular app data
angular_script = """
if (window.ng) {
    const platform = window.ng.platform || {};
    
    return {
        framework: 'Angular',
        version: window.ng.version?.full || 'unknown',
        hasRouter: !!window.ng.router,
        modules: Object.keys(platform).length
    };
}

return null;
"""

content = await get("https://angular-app.com", script=angular_script)

WebContent Integration

Accessing JavaScript Results

content = await get("https://example.com", script="document.title")

# JavaScript result is available in WebContent object
print(f"Script result: {content.script_result}")
print(f"Has result: {content.has_script_result}")
print(f"Has error: {content.has_script_error}")

# Also access traditional content
print(f"Title: {content.title}")
print(f"Text: {content.text[:100]}")
print(f"Markdown: {content.markdown[:100]}")

Combining Static and Dynamic Data

# Extract both static content and dynamic data
dynamic_script = """
({
    dynamicPrice: document.querySelector('.dynamic-price')?.textContent,
    userCount: document.querySelector('.user-count')?.textContent,
    lastUpdated: document.querySelector('.last-updated')?.textContent
})
"""

content = await get("https://dashboard.com", script=dynamic_script)

# Use both static and dynamic content
analysis = {
    'title': content.title,
    'word_count': content.word_count,
    'reading_time': content.reading_time,
    'dynamic_data': content.script_result
}

Performance Considerations

Optimize JavaScript Execution

# Lightweight scripts for better performance
fast_script = "document.title"  # Simple, fast

# Avoid heavy DOM operations
slow_script = """
// This is expensive - avoid if possible
const allElements = document.querySelectorAll('*');
return Array.from(allElements).map(el => el.tagName);
"""

Batch Processing Optimization

# Process in smaller batches for better memory usage
urls = [f"https://site.com/page/{i}" for i in range(100)]

batch_size = 10
results = []

for i in range(0, len(urls), batch_size):
    batch = urls[i:i+batch_size]
    batch_results = await get_many(batch, script="document.title")
    results.extend(batch_results)
    
    # Optional: small delay between batches
    await asyncio.sleep(1)

Best Practices

1. Script Design

# ✅ Good: Simple, focused scripts
good_script = "document.querySelector('.price').textContent"

# ❌ Avoid: Complex scripts that could fail
bad_script = """
try {
    const price = document.querySelector('.price').textContent.split('$')[1];
    const discountedPrice = parseFloat(price) * 0.9;
    return `$${discountedPrice.toFixed(2)}`;
} catch (e) {
    return null;
}
"""

2. Error Handling

# Always check for script errors
content = await get(url, script=script)

if content.has_script_error:
    # Handle the error appropriately
    logging.warning(f"JavaScript error on {url}: {content.script_error}")
    # Use fallback approach
else:
    # Process successful result
    process_result(content.script_result)

3. Performance Monitoring

import time

start_time = time.time()
content = await get(url, script=script)
duration = time.time() - start_time

if duration > 10:  # If taking too long
    logging.warning(f"Slow JavaScript execution on {url}: {duration:.2f}s")

Common Use Cases

E-commerce Data Extraction

# Extract product information
product_script = """
({
    name: document.querySelector('.product-name')?.textContent,
    price: document.querySelector('.price')?.textContent,
    rating: document.querySelector('.rating')?.textContent,
    availability: document.querySelector('.stock-status')?.textContent,
    images: Array.from(document.querySelectorAll('.product-image img')).map(img => img.src)
})
"""

content = await get("https://shop.com/product/123", script=product_script)
product_data = content.script_result

# Extract social media posts (be respectful of terms of service)
social_script = """
Array.from(document.querySelectorAll('.post')).slice(0, 10).map(post => ({
    text: post.querySelector('.post-text')?.textContent,
    author: post.querySelector('.author')?.textContent,
    timestamp: post.querySelector('.timestamp')?.textContent,
    likes: post.querySelector('.likes-count')?.textContent
}))
"""

content = await get("https://social-site.com/feed", script=social_script)
posts = content.script_result

News and Articles

# Extract article metadata
article_script = """
({
    headline: document.querySelector('h1')?.textContent,
    author: document.querySelector('.author')?.textContent,
    publishDate: document.querySelector('.publish-date')?.textContent,
    readingTime: document.querySelector('.reading-time')?.textContent,
    tags: Array.from(document.querySelectorAll('.tag')).map(tag => tag.textContent),
    wordCount: document.querySelector('.article-body')?.textContent.split(' ').length
})
"""

content = await get("https://news-site.com/article/123", script=article_script)

Integration with AI Workflows

Content Preparation for LLMs

# Extract structured content for AI processing
ai_script = """
({
    mainContent: document.querySelector('main')?.textContent,
    headings: Array.from(document.querySelectorAll('h1, h2, h3')).map(h => ({
        level: h.tagName,
        text: h.textContent
    })),
    keyPoints: Array.from(document.querySelectorAll('.highlight, .callout')).map(el => el.textContent),
    metadata: {
        wordCount: document.body.textContent.split(' ').length,
        readingLevel: 'advanced',  // Could be calculated
        topics: Array.from(document.querySelectorAll('.topic-tag')).map(tag => tag.textContent)
    }
})
"""

content = await get("https://technical-blog.com/post", script=ai_script)
structured_data = content.script_result

# Now ready for AI processing
ai_prompt = f"""
Analyze this content:

Title: {content.title}
Main Content: {structured_data['mainContent'][:1000]}...
Key Points: {structured_data['keyPoints']}
Topics: {structured_data['metadata']['topics']}

Provide a summary and key insights.
"""

Troubleshooting

Common Issues

Script Timeout

# Increase timeout for slow scripts
content = await get(url, script=script, timeout=60)

Element Not Found

# Use optional chaining and fallbacks
safe_script = """
document.querySelector('.target')?.textContent || 'not found'
"""

JavaScript Not Loaded

# Wait for JavaScript frameworks to load
content = await get(
    url,
    script="typeof React !== 'undefined' ? React.version : 'React not loaded'",
    wait_for="[data-reactroot]"
)

Debug Mode

# Enable verbose logging for debugging
import logging
logging.basicConfig(level=logging.DEBUG)

content = await get(url, script=script)

This comprehensive JavaScript API enables Crawailer to handle modern web applications with the same ease as static sites, making it ideal for AI workflows that require rich, accurate content extraction.

15 KiB Raw Blame History