
- Complete browser automation with Playwright integration - High-level API functions: get(), get_many(), discover() - JavaScript execution support with script parameters - Content extraction optimized for LLM workflows - Comprehensive test suite with 18 test files (700+ scenarios) - Local Caddy test server for reproducible testing - Performance benchmarking vs Katana crawler - Complete documentation including JavaScript API guide - PyPI-ready packaging with professional metadata - UNIX philosophy: do web scraping exceptionally well
15 KiB
15 KiB
Crawailer JavaScript API Documentation
Overview
Crawailer provides comprehensive JavaScript execution capabilities that enable dynamic content extraction from modern web applications. Unlike traditional HTTP scrapers, Crawailer uses a real browser (Playwright) to execute JavaScript and extract content from single-page applications (SPAs), dynamic sites, and JavaScript-heavy pages.
Key Features
- Full JavaScript Execution: Execute arbitrary JavaScript code using
page.evaluate()
- Before/After Script Patterns: Run scripts before and after content extraction
- SPA Support: Handle React, Vue, Angular, and other modern frameworks
- Dynamic Content: Extract content that's loaded via AJAX or user interactions
- Error Handling: Comprehensive error capture and graceful degradation
- Performance Monitoring: Extract timing and memory metrics
- User Interaction: Simulate clicks, form submissions, and complex workflows
Basic Usage
Simple JavaScript Execution
from crawailer import get
# Extract dynamic content
content = await get(
"https://example.com",
script="document.querySelector('.dynamic-price').innerText"
)
print(f"Price: {content.script_result}")
print(f"Has script result: {content.has_script_result}")
Waiting for Dynamic Content
# Wait for element and extract data
content = await get(
"https://spa-app.com",
script="document.querySelector('.loaded-content').textContent",
wait_for=".loaded-content" # Wait for element to appear
)
Complex JavaScript Operations
# Execute complex JavaScript
complex_script = """
// Scroll to load more content
window.scrollTo(0, document.body.scrollHeight);
// Wait for new content to load
await new Promise(resolve => setTimeout(resolve, 2000));
// Extract all product data
const products = Array.from(document.querySelectorAll('.product')).map(p => ({
name: p.querySelector('.name')?.textContent,
price: p.querySelector('.price')?.textContent,
rating: p.querySelector('.rating')?.textContent
}));
return products;
"""
content = await get("https://ecommerce-site.com", script=complex_script)
products = content.script_result
Advanced Patterns
Before/After Script Execution
# Execute script before content extraction, then after
content = await get(
"https://dynamic-site.com",
script_before="document.querySelector('.load-more')?.click()",
script_after="document.querySelectorAll('.item').length"
)
if isinstance(content.script_result, dict):
print(f"Triggered loading: {content.script_result['script_before']}")
print(f"Items loaded: {content.script_result['script_after']}")
Form Interaction and Submission
# Fill and submit forms
form_script = """
// Fill login form
document.querySelector('#username').value = 'testuser';
document.querySelector('#password').value = 'testpass';
// Submit form
document.querySelector('#login-form').submit();
// Wait for redirect
await new Promise(resolve => setTimeout(resolve, 3000));
return 'form submitted';
"""
content = await get("https://app.com/login", script=form_script)
Performance Monitoring
# Extract performance metrics
perf_script = """
({
loadTime: performance.timing.loadEventEnd - performance.timing.navigationStart,
domReady: performance.timing.domContentLoadedEventEnd - performance.timing.navigationStart,
resources: performance.getEntriesByType('resource').length,
memory: performance.memory ? {
used: Math.round(performance.memory.usedJSHeapSize / 1024 / 1024),
total: Math.round(performance.memory.totalJSHeapSize / 1024 / 1024)
} : null
})
"""
content = await get("https://example.com", script=perf_script)
metrics = content.script_result
print(f"Load time: {metrics['loadTime']}ms")
Batch Processing
Same Script for Multiple URLs
from crawailer import get_many
urls = [
"https://site1.com/product/1",
"https://site1.com/product/2",
"https://site1.com/product/3"
]
# Extract price from all products
results = await get_many(
urls,
script="document.querySelector('.price')?.textContent"
)
for result in results:
if result and result.script_result:
print(f"{result.url}: {result.script_result}")
Different Scripts per URL
# Custom script for each URL
urls = ["https://react-app.com", "https://vue-app.com", "https://angular-app.com"]
scripts = [
"window.React ? 'React ' + React.version : 'No React'",
"window.Vue ? 'Vue ' + Vue.version : 'No Vue'",
"window.ng ? 'Angular detected' : 'No Angular'"
]
results = await get_many(urls, script=scripts)
Intelligent Discovery
Search Result Interaction
from crawailer import discover
# Discover content with JavaScript interaction
results = await discover(
"machine learning tutorials",
script="document.querySelector('.show-more')?.click()",
content_script="document.querySelector('.read-time')?.textContent",
max_pages=5
)
for result in results:
print(f"{result.title} - Reading time: {result.script_result}")
Pagination Handling
# Handle infinite scroll
pagination_script = """
let results = [];
let page = 0;
while (page < 3) { // Load 3 pages
// Scroll to bottom
window.scrollTo(0, document.body.scrollHeight);
// Wait for new content
await new Promise(resolve => setTimeout(resolve, 2000));
// Extract current page items
const items = Array.from(document.querySelectorAll('.item')).map(item =>
item.textContent.trim()
);
results.push(...items);
page++;
}
return results;
"""
content = await get("https://infinite-scroll-site.com", script=pagination_script)
Error Handling
JavaScript Error Capture
content = await get(
"https://example.com",
script="document.querySelector('.nonexistent').click()"
)
if content.has_script_error:
print(f"JavaScript error: {content.script_error}")
else:
print(f"Result: {content.script_result}")
Graceful Degradation
# Try JavaScript, fall back to static content
try:
content = await get(
"https://dynamic-site.com",
script="window.dynamicData || 'fallback'"
)
if content.has_script_error:
# JavaScript failed, but we still have static content
print(f"Using static content: {content.text[:100]}")
else:
print(f"Dynamic data: {content.script_result}")
except Exception as e:
print(f"Complete failure: {e}")
Modern Framework Integration
React Applications
# Extract React component data
react_script = """
// Find React root
const reactRoot = document.querySelector('[data-reactroot]') || document.querySelector('#root');
if (window.React && reactRoot) {
// Get React fiber data (React 16+)
const fiberKey = Object.keys(reactRoot).find(key => key.startsWith('__reactInternalInstance'));
return {
framework: 'React',
version: React.version,
hasRouter: !!window.ReactRouter,
componentCount: document.querySelectorAll('[data-reactroot] *').length
};
}
return null;
"""
content = await get("https://react-app.com", script=react_script)
Vue Applications
# Extract Vue app data
vue_script = """
if (window.Vue) {
const app = document.querySelector('#app');
return {
framework: 'Vue',
version: Vue.version,
hasRouter: !!window.VueRouter,
hasVuex: !!window.Vuex,
rootComponent: app?.__vue__?.$options.name || 'unknown'
};
}
return null;
"""
content = await get("https://vue-app.com", script=vue_script)
Angular Applications
# Extract Angular app data
angular_script = """
if (window.ng) {
const platform = window.ng.platform || {};
return {
framework: 'Angular',
version: window.ng.version?.full || 'unknown',
hasRouter: !!window.ng.router,
modules: Object.keys(platform).length
};
}
return null;
"""
content = await get("https://angular-app.com", script=angular_script)
WebContent Integration
Accessing JavaScript Results
content = await get("https://example.com", script="document.title")
# JavaScript result is available in WebContent object
print(f"Script result: {content.script_result}")
print(f"Has result: {content.has_script_result}")
print(f"Has error: {content.has_script_error}")
# Also access traditional content
print(f"Title: {content.title}")
print(f"Text: {content.text[:100]}")
print(f"Markdown: {content.markdown[:100]}")
Combining Static and Dynamic Data
# Extract both static content and dynamic data
dynamic_script = """
({
dynamicPrice: document.querySelector('.dynamic-price')?.textContent,
userCount: document.querySelector('.user-count')?.textContent,
lastUpdated: document.querySelector('.last-updated')?.textContent
})
"""
content = await get("https://dashboard.com", script=dynamic_script)
# Use both static and dynamic content
analysis = {
'title': content.title,
'word_count': content.word_count,
'reading_time': content.reading_time,
'dynamic_data': content.script_result
}
Performance Considerations
Optimize JavaScript Execution
# Lightweight scripts for better performance
fast_script = "document.title" # Simple, fast
# Avoid heavy DOM operations
slow_script = """
// This is expensive - avoid if possible
const allElements = document.querySelectorAll('*');
return Array.from(allElements).map(el => el.tagName);
"""
Batch Processing Optimization
# Process in smaller batches for better memory usage
urls = [f"https://site.com/page/{i}" for i in range(100)]
batch_size = 10
results = []
for i in range(0, len(urls), batch_size):
batch = urls[i:i+batch_size]
batch_results = await get_many(batch, script="document.title")
results.extend(batch_results)
# Optional: small delay between batches
await asyncio.sleep(1)
Best Practices
1. Script Design
# ✅ Good: Simple, focused scripts
good_script = "document.querySelector('.price').textContent"
# ❌ Avoid: Complex scripts that could fail
bad_script = """
try {
const price = document.querySelector('.price').textContent.split('$')[1];
const discountedPrice = parseFloat(price) * 0.9;
return `$${discountedPrice.toFixed(2)}`;
} catch (e) {
return null;
}
"""
2. Error Handling
# Always check for script errors
content = await get(url, script=script)
if content.has_script_error:
# Handle the error appropriately
logging.warning(f"JavaScript error on {url}: {content.script_error}")
# Use fallback approach
else:
# Process successful result
process_result(content.script_result)
3. Performance Monitoring
import time
start_time = time.time()
content = await get(url, script=script)
duration = time.time() - start_time
if duration > 10: # If taking too long
logging.warning(f"Slow JavaScript execution on {url}: {duration:.2f}s")
Common Use Cases
E-commerce Data Extraction
# Extract product information
product_script = """
({
name: document.querySelector('.product-name')?.textContent,
price: document.querySelector('.price')?.textContent,
rating: document.querySelector('.rating')?.textContent,
availability: document.querySelector('.stock-status')?.textContent,
images: Array.from(document.querySelectorAll('.product-image img')).map(img => img.src)
})
"""
content = await get("https://shop.com/product/123", script=product_script)
product_data = content.script_result
Social Media Content
# Extract social media posts (be respectful of terms of service)
social_script = """
Array.from(document.querySelectorAll('.post')).slice(0, 10).map(post => ({
text: post.querySelector('.post-text')?.textContent,
author: post.querySelector('.author')?.textContent,
timestamp: post.querySelector('.timestamp')?.textContent,
likes: post.querySelector('.likes-count')?.textContent
}))
"""
content = await get("https://social-site.com/feed", script=social_script)
posts = content.script_result
News and Articles
# Extract article metadata
article_script = """
({
headline: document.querySelector('h1')?.textContent,
author: document.querySelector('.author')?.textContent,
publishDate: document.querySelector('.publish-date')?.textContent,
readingTime: document.querySelector('.reading-time')?.textContent,
tags: Array.from(document.querySelectorAll('.tag')).map(tag => tag.textContent),
wordCount: document.querySelector('.article-body')?.textContent.split(' ').length
})
"""
content = await get("https://news-site.com/article/123", script=article_script)
Integration with AI Workflows
Content Preparation for LLMs
# Extract structured content for AI processing
ai_script = """
({
mainContent: document.querySelector('main')?.textContent,
headings: Array.from(document.querySelectorAll('h1, h2, h3')).map(h => ({
level: h.tagName,
text: h.textContent
})),
keyPoints: Array.from(document.querySelectorAll('.highlight, .callout')).map(el => el.textContent),
metadata: {
wordCount: document.body.textContent.split(' ').length,
readingLevel: 'advanced', // Could be calculated
topics: Array.from(document.querySelectorAll('.topic-tag')).map(tag => tag.textContent)
}
})
"""
content = await get("https://technical-blog.com/post", script=ai_script)
structured_data = content.script_result
# Now ready for AI processing
ai_prompt = f"""
Analyze this content:
Title: {content.title}
Main Content: {structured_data['mainContent'][:1000]}...
Key Points: {structured_data['keyPoints']}
Topics: {structured_data['metadata']['topics']}
Provide a summary and key insights.
"""
Troubleshooting
Common Issues
-
Script Timeout
# Increase timeout for slow scripts content = await get(url, script=script, timeout=60)
-
Element Not Found
# Use optional chaining and fallbacks safe_script = """ document.querySelector('.target')?.textContent || 'not found' """
-
JavaScript Not Loaded
# Wait for JavaScript frameworks to load content = await get( url, script="typeof React !== 'undefined' ? React.version : 'React not loaded'", wait_for="[data-reactroot]" )
Debug Mode
# Enable verbose logging for debugging
import logging
logging.basicConfig(level=logging.DEBUG)
content = await get(url, script=script)
This comprehensive JavaScript API enables Crawailer to handle modern web applications with the same ease as static sites, making it ideal for AI workflows that require rich, accurate content extraction.