
- Complete browser automation with Playwright integration - High-level API functions: get(), get_many(), discover() - JavaScript execution support with script parameters - Content extraction optimized for LLM workflows - Comprehensive test suite with 18 test files (700+ scenarios) - Local Caddy test server for reproducible testing - Performance benchmarking vs Katana crawler - Complete documentation including JavaScript API guide - PyPI-ready packaging with professional metadata - UNIX philosophy: do web scraping exceptionally well
579 lines
15 KiB
Markdown
579 lines
15 KiB
Markdown
# Crawailer JavaScript API Documentation
|
|
|
|
## Overview
|
|
|
|
Crawailer provides comprehensive JavaScript execution capabilities that enable dynamic content extraction from modern web applications. Unlike traditional HTTP scrapers, Crawailer uses a real browser (Playwright) to execute JavaScript and extract content from single-page applications (SPAs), dynamic sites, and JavaScript-heavy pages.
|
|
|
|
## Key Features
|
|
|
|
- **Full JavaScript Execution**: Execute arbitrary JavaScript code using `page.evaluate()`
|
|
- **Before/After Script Patterns**: Run scripts before and after content extraction
|
|
- **SPA Support**: Handle React, Vue, Angular, and other modern frameworks
|
|
- **Dynamic Content**: Extract content that's loaded via AJAX or user interactions
|
|
- **Error Handling**: Comprehensive error capture and graceful degradation
|
|
- **Performance Monitoring**: Extract timing and memory metrics
|
|
- **User Interaction**: Simulate clicks, form submissions, and complex workflows
|
|
|
|
## Basic Usage
|
|
|
|
### Simple JavaScript Execution
|
|
|
|
```python
|
|
from crawailer import get
|
|
|
|
# Extract dynamic content
|
|
content = await get(
|
|
"https://example.com",
|
|
script="document.querySelector('.dynamic-price').innerText"
|
|
)
|
|
|
|
print(f"Price: {content.script_result}")
|
|
print(f"Has script result: {content.has_script_result}")
|
|
```
|
|
|
|
### Waiting for Dynamic Content
|
|
|
|
```python
|
|
# Wait for element and extract data
|
|
content = await get(
|
|
"https://spa-app.com",
|
|
script="document.querySelector('.loaded-content').textContent",
|
|
wait_for=".loaded-content" # Wait for element to appear
|
|
)
|
|
```
|
|
|
|
### Complex JavaScript Operations
|
|
|
|
```python
|
|
# Execute complex JavaScript
|
|
complex_script = """
|
|
// Scroll to load more content
|
|
window.scrollTo(0, document.body.scrollHeight);
|
|
|
|
// Wait for new content to load
|
|
await new Promise(resolve => setTimeout(resolve, 2000));
|
|
|
|
// Extract all product data
|
|
const products = Array.from(document.querySelectorAll('.product')).map(p => ({
|
|
name: p.querySelector('.name')?.textContent,
|
|
price: p.querySelector('.price')?.textContent,
|
|
rating: p.querySelector('.rating')?.textContent
|
|
}));
|
|
|
|
return products;
|
|
"""
|
|
|
|
content = await get("https://ecommerce-site.com", script=complex_script)
|
|
products = content.script_result
|
|
```
|
|
|
|
## Advanced Patterns
|
|
|
|
### Before/After Script Execution
|
|
|
|
```python
|
|
# Execute script before content extraction, then after
|
|
content = await get(
|
|
"https://dynamic-site.com",
|
|
script_before="document.querySelector('.load-more')?.click()",
|
|
script_after="document.querySelectorAll('.item').length"
|
|
)
|
|
|
|
if isinstance(content.script_result, dict):
|
|
print(f"Triggered loading: {content.script_result['script_before']}")
|
|
print(f"Items loaded: {content.script_result['script_after']}")
|
|
```
|
|
|
|
### Form Interaction and Submission
|
|
|
|
```python
|
|
# Fill and submit forms
|
|
form_script = """
|
|
// Fill login form
|
|
document.querySelector('#username').value = 'testuser';
|
|
document.querySelector('#password').value = 'testpass';
|
|
|
|
// Submit form
|
|
document.querySelector('#login-form').submit();
|
|
|
|
// Wait for redirect
|
|
await new Promise(resolve => setTimeout(resolve, 3000));
|
|
|
|
return 'form submitted';
|
|
"""
|
|
|
|
content = await get("https://app.com/login", script=form_script)
|
|
```
|
|
|
|
### Performance Monitoring
|
|
|
|
```python
|
|
# Extract performance metrics
|
|
perf_script = """
|
|
({
|
|
loadTime: performance.timing.loadEventEnd - performance.timing.navigationStart,
|
|
domReady: performance.timing.domContentLoadedEventEnd - performance.timing.navigationStart,
|
|
resources: performance.getEntriesByType('resource').length,
|
|
memory: performance.memory ? {
|
|
used: Math.round(performance.memory.usedJSHeapSize / 1024 / 1024),
|
|
total: Math.round(performance.memory.totalJSHeapSize / 1024 / 1024)
|
|
} : null
|
|
})
|
|
"""
|
|
|
|
content = await get("https://example.com", script=perf_script)
|
|
metrics = content.script_result
|
|
print(f"Load time: {metrics['loadTime']}ms")
|
|
```
|
|
|
|
## Batch Processing
|
|
|
|
### Same Script for Multiple URLs
|
|
|
|
```python
|
|
from crawailer import get_many
|
|
|
|
urls = [
|
|
"https://site1.com/product/1",
|
|
"https://site1.com/product/2",
|
|
"https://site1.com/product/3"
|
|
]
|
|
|
|
# Extract price from all products
|
|
results = await get_many(
|
|
urls,
|
|
script="document.querySelector('.price')?.textContent"
|
|
)
|
|
|
|
for result in results:
|
|
if result and result.script_result:
|
|
print(f"{result.url}: {result.script_result}")
|
|
```
|
|
|
|
### Different Scripts per URL
|
|
|
|
```python
|
|
# Custom script for each URL
|
|
urls = ["https://react-app.com", "https://vue-app.com", "https://angular-app.com"]
|
|
scripts = [
|
|
"window.React ? 'React ' + React.version : 'No React'",
|
|
"window.Vue ? 'Vue ' + Vue.version : 'No Vue'",
|
|
"window.ng ? 'Angular detected' : 'No Angular'"
|
|
]
|
|
|
|
results = await get_many(urls, script=scripts)
|
|
```
|
|
|
|
## Intelligent Discovery
|
|
|
|
### Search Result Interaction
|
|
|
|
```python
|
|
from crawailer import discover
|
|
|
|
# Discover content with JavaScript interaction
|
|
results = await discover(
|
|
"machine learning tutorials",
|
|
script="document.querySelector('.show-more')?.click()",
|
|
content_script="document.querySelector('.read-time')?.textContent",
|
|
max_pages=5
|
|
)
|
|
|
|
for result in results:
|
|
print(f"{result.title} - Reading time: {result.script_result}")
|
|
```
|
|
|
|
### Pagination Handling
|
|
|
|
```python
|
|
# Handle infinite scroll
|
|
pagination_script = """
|
|
let results = [];
|
|
let page = 0;
|
|
|
|
while (page < 3) { // Load 3 pages
|
|
// Scroll to bottom
|
|
window.scrollTo(0, document.body.scrollHeight);
|
|
|
|
// Wait for new content
|
|
await new Promise(resolve => setTimeout(resolve, 2000));
|
|
|
|
// Extract current page items
|
|
const items = Array.from(document.querySelectorAll('.item')).map(item =>
|
|
item.textContent.trim()
|
|
);
|
|
|
|
results.push(...items);
|
|
page++;
|
|
}
|
|
|
|
return results;
|
|
"""
|
|
|
|
content = await get("https://infinite-scroll-site.com", script=pagination_script)
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
### JavaScript Error Capture
|
|
|
|
```python
|
|
content = await get(
|
|
"https://example.com",
|
|
script="document.querySelector('.nonexistent').click()"
|
|
)
|
|
|
|
if content.has_script_error:
|
|
print(f"JavaScript error: {content.script_error}")
|
|
else:
|
|
print(f"Result: {content.script_result}")
|
|
```
|
|
|
|
### Graceful Degradation
|
|
|
|
```python
|
|
# Try JavaScript, fall back to static content
|
|
try:
|
|
content = await get(
|
|
"https://dynamic-site.com",
|
|
script="window.dynamicData || 'fallback'"
|
|
)
|
|
|
|
if content.has_script_error:
|
|
# JavaScript failed, but we still have static content
|
|
print(f"Using static content: {content.text[:100]}")
|
|
else:
|
|
print(f"Dynamic data: {content.script_result}")
|
|
|
|
except Exception as e:
|
|
print(f"Complete failure: {e}")
|
|
```
|
|
|
|
## Modern Framework Integration
|
|
|
|
### React Applications
|
|
|
|
```python
|
|
# Extract React component data
|
|
react_script = """
|
|
// Find React root
|
|
const reactRoot = document.querySelector('[data-reactroot]') || document.querySelector('#root');
|
|
|
|
if (window.React && reactRoot) {
|
|
// Get React fiber data (React 16+)
|
|
const fiberKey = Object.keys(reactRoot).find(key => key.startsWith('__reactInternalInstance'));
|
|
|
|
return {
|
|
framework: 'React',
|
|
version: React.version,
|
|
hasRouter: !!window.ReactRouter,
|
|
componentCount: document.querySelectorAll('[data-reactroot] *').length
|
|
};
|
|
}
|
|
|
|
return null;
|
|
"""
|
|
|
|
content = await get("https://react-app.com", script=react_script)
|
|
```
|
|
|
|
### Vue Applications
|
|
|
|
```python
|
|
# Extract Vue app data
|
|
vue_script = """
|
|
if (window.Vue) {
|
|
const app = document.querySelector('#app');
|
|
|
|
return {
|
|
framework: 'Vue',
|
|
version: Vue.version,
|
|
hasRouter: !!window.VueRouter,
|
|
hasVuex: !!window.Vuex,
|
|
rootComponent: app?.__vue__?.$options.name || 'unknown'
|
|
};
|
|
}
|
|
|
|
return null;
|
|
"""
|
|
|
|
content = await get("https://vue-app.com", script=vue_script)
|
|
```
|
|
|
|
### Angular Applications
|
|
|
|
```python
|
|
# Extract Angular app data
|
|
angular_script = """
|
|
if (window.ng) {
|
|
const platform = window.ng.platform || {};
|
|
|
|
return {
|
|
framework: 'Angular',
|
|
version: window.ng.version?.full || 'unknown',
|
|
hasRouter: !!window.ng.router,
|
|
modules: Object.keys(platform).length
|
|
};
|
|
}
|
|
|
|
return null;
|
|
"""
|
|
|
|
content = await get("https://angular-app.com", script=angular_script)
|
|
```
|
|
|
|
## WebContent Integration
|
|
|
|
### Accessing JavaScript Results
|
|
|
|
```python
|
|
content = await get("https://example.com", script="document.title")
|
|
|
|
# JavaScript result is available in WebContent object
|
|
print(f"Script result: {content.script_result}")
|
|
print(f"Has result: {content.has_script_result}")
|
|
print(f"Has error: {content.has_script_error}")
|
|
|
|
# Also access traditional content
|
|
print(f"Title: {content.title}")
|
|
print(f"Text: {content.text[:100]}")
|
|
print(f"Markdown: {content.markdown[:100]}")
|
|
```
|
|
|
|
### Combining Static and Dynamic Data
|
|
|
|
```python
|
|
# Extract both static content and dynamic data
|
|
dynamic_script = """
|
|
({
|
|
dynamicPrice: document.querySelector('.dynamic-price')?.textContent,
|
|
userCount: document.querySelector('.user-count')?.textContent,
|
|
lastUpdated: document.querySelector('.last-updated')?.textContent
|
|
})
|
|
"""
|
|
|
|
content = await get("https://dashboard.com", script=dynamic_script)
|
|
|
|
# Use both static and dynamic content
|
|
analysis = {
|
|
'title': content.title,
|
|
'word_count': content.word_count,
|
|
'reading_time': content.reading_time,
|
|
'dynamic_data': content.script_result
|
|
}
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
### Optimize JavaScript Execution
|
|
|
|
```python
|
|
# Lightweight scripts for better performance
|
|
fast_script = "document.title" # Simple, fast
|
|
|
|
# Avoid heavy DOM operations
|
|
slow_script = """
|
|
// This is expensive - avoid if possible
|
|
const allElements = document.querySelectorAll('*');
|
|
return Array.from(allElements).map(el => el.tagName);
|
|
"""
|
|
```
|
|
|
|
### Batch Processing Optimization
|
|
|
|
```python
|
|
# Process in smaller batches for better memory usage
|
|
urls = [f"https://site.com/page/{i}" for i in range(100)]
|
|
|
|
batch_size = 10
|
|
results = []
|
|
|
|
for i in range(0, len(urls), batch_size):
|
|
batch = urls[i:i+batch_size]
|
|
batch_results = await get_many(batch, script="document.title")
|
|
results.extend(batch_results)
|
|
|
|
# Optional: small delay between batches
|
|
await asyncio.sleep(1)
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### 1. Script Design
|
|
|
|
```python
|
|
# ✅ Good: Simple, focused scripts
|
|
good_script = "document.querySelector('.price').textContent"
|
|
|
|
# ❌ Avoid: Complex scripts that could fail
|
|
bad_script = """
|
|
try {
|
|
const price = document.querySelector('.price').textContent.split('$')[1];
|
|
const discountedPrice = parseFloat(price) * 0.9;
|
|
return `$${discountedPrice.toFixed(2)}`;
|
|
} catch (e) {
|
|
return null;
|
|
}
|
|
"""
|
|
```
|
|
|
|
### 2. Error Handling
|
|
|
|
```python
|
|
# Always check for script errors
|
|
content = await get(url, script=script)
|
|
|
|
if content.has_script_error:
|
|
# Handle the error appropriately
|
|
logging.warning(f"JavaScript error on {url}: {content.script_error}")
|
|
# Use fallback approach
|
|
else:
|
|
# Process successful result
|
|
process_result(content.script_result)
|
|
```
|
|
|
|
### 3. Performance Monitoring
|
|
|
|
```python
|
|
import time
|
|
|
|
start_time = time.time()
|
|
content = await get(url, script=script)
|
|
duration = time.time() - start_time
|
|
|
|
if duration > 10: # If taking too long
|
|
logging.warning(f"Slow JavaScript execution on {url}: {duration:.2f}s")
|
|
```
|
|
|
|
## Common Use Cases
|
|
|
|
### E-commerce Data Extraction
|
|
|
|
```python
|
|
# Extract product information
|
|
product_script = """
|
|
({
|
|
name: document.querySelector('.product-name')?.textContent,
|
|
price: document.querySelector('.price')?.textContent,
|
|
rating: document.querySelector('.rating')?.textContent,
|
|
availability: document.querySelector('.stock-status')?.textContent,
|
|
images: Array.from(document.querySelectorAll('.product-image img')).map(img => img.src)
|
|
})
|
|
"""
|
|
|
|
content = await get("https://shop.com/product/123", script=product_script)
|
|
product_data = content.script_result
|
|
```
|
|
|
|
### Social Media Content
|
|
|
|
```python
|
|
# Extract social media posts (be respectful of terms of service)
|
|
social_script = """
|
|
Array.from(document.querySelectorAll('.post')).slice(0, 10).map(post => ({
|
|
text: post.querySelector('.post-text')?.textContent,
|
|
author: post.querySelector('.author')?.textContent,
|
|
timestamp: post.querySelector('.timestamp')?.textContent,
|
|
likes: post.querySelector('.likes-count')?.textContent
|
|
}))
|
|
"""
|
|
|
|
content = await get("https://social-site.com/feed", script=social_script)
|
|
posts = content.script_result
|
|
```
|
|
|
|
### News and Articles
|
|
|
|
```python
|
|
# Extract article metadata
|
|
article_script = """
|
|
({
|
|
headline: document.querySelector('h1')?.textContent,
|
|
author: document.querySelector('.author')?.textContent,
|
|
publishDate: document.querySelector('.publish-date')?.textContent,
|
|
readingTime: document.querySelector('.reading-time')?.textContent,
|
|
tags: Array.from(document.querySelectorAll('.tag')).map(tag => tag.textContent),
|
|
wordCount: document.querySelector('.article-body')?.textContent.split(' ').length
|
|
})
|
|
"""
|
|
|
|
content = await get("https://news-site.com/article/123", script=article_script)
|
|
```
|
|
|
|
## Integration with AI Workflows
|
|
|
|
### Content Preparation for LLMs
|
|
|
|
```python
|
|
# Extract structured content for AI processing
|
|
ai_script = """
|
|
({
|
|
mainContent: document.querySelector('main')?.textContent,
|
|
headings: Array.from(document.querySelectorAll('h1, h2, h3')).map(h => ({
|
|
level: h.tagName,
|
|
text: h.textContent
|
|
})),
|
|
keyPoints: Array.from(document.querySelectorAll('.highlight, .callout')).map(el => el.textContent),
|
|
metadata: {
|
|
wordCount: document.body.textContent.split(' ').length,
|
|
readingLevel: 'advanced', // Could be calculated
|
|
topics: Array.from(document.querySelectorAll('.topic-tag')).map(tag => tag.textContent)
|
|
}
|
|
})
|
|
"""
|
|
|
|
content = await get("https://technical-blog.com/post", script=ai_script)
|
|
structured_data = content.script_result
|
|
|
|
# Now ready for AI processing
|
|
ai_prompt = f"""
|
|
Analyze this content:
|
|
|
|
Title: {content.title}
|
|
Main Content: {structured_data['mainContent'][:1000]}...
|
|
Key Points: {structured_data['keyPoints']}
|
|
Topics: {structured_data['metadata']['topics']}
|
|
|
|
Provide a summary and key insights.
|
|
"""
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Script Timeout**
|
|
```python
|
|
# Increase timeout for slow scripts
|
|
content = await get(url, script=script, timeout=60)
|
|
```
|
|
|
|
2. **Element Not Found**
|
|
```python
|
|
# Use optional chaining and fallbacks
|
|
safe_script = """
|
|
document.querySelector('.target')?.textContent || 'not found'
|
|
"""
|
|
```
|
|
|
|
3. **JavaScript Not Loaded**
|
|
```python
|
|
# Wait for JavaScript frameworks to load
|
|
content = await get(
|
|
url,
|
|
script="typeof React !== 'undefined' ? React.version : 'React not loaded'",
|
|
wait_for="[data-reactroot]"
|
|
)
|
|
```
|
|
|
|
### Debug Mode
|
|
|
|
```python
|
|
# Enable verbose logging for debugging
|
|
import logging
|
|
logging.basicConfig(level=logging.DEBUG)
|
|
|
|
content = await get(url, script=script)
|
|
```
|
|
|
|
This comprehensive JavaScript API enables Crawailer to handle modern web applications with the same ease as static sites, making it ideal for AI workflows that require rich, accurate content extraction. |