crawailer/docs/JAVASCRIPT_API.md
Crawailer Developer d31395a166 Initial Crawailer implementation with comprehensive JavaScript API
- Complete browser automation with Playwright integration
- High-level API functions: get(), get_many(), discover()
- JavaScript execution support with script parameters
- Content extraction optimized for LLM workflows
- Comprehensive test suite with 18 test files (700+ scenarios)
- Local Caddy test server for reproducible testing
- Performance benchmarking vs Katana crawler
- Complete documentation including JavaScript API guide
- PyPI-ready packaging with professional metadata
- UNIX philosophy: do web scraping exceptionally well
2025-09-18 14:47:59 -06:00

579 lines
15 KiB
Markdown

# Crawailer JavaScript API Documentation
## Overview
Crawailer provides comprehensive JavaScript execution capabilities that enable dynamic content extraction from modern web applications. Unlike traditional HTTP scrapers, Crawailer uses a real browser (Playwright) to execute JavaScript and extract content from single-page applications (SPAs), dynamic sites, and JavaScript-heavy pages.
## Key Features
- **Full JavaScript Execution**: Execute arbitrary JavaScript code using `page.evaluate()`
- **Before/After Script Patterns**: Run scripts before and after content extraction
- **SPA Support**: Handle React, Vue, Angular, and other modern frameworks
- **Dynamic Content**: Extract content that's loaded via AJAX or user interactions
- **Error Handling**: Comprehensive error capture and graceful degradation
- **Performance Monitoring**: Extract timing and memory metrics
- **User Interaction**: Simulate clicks, form submissions, and complex workflows
## Basic Usage
### Simple JavaScript Execution
```python
from crawailer import get
# Extract dynamic content
content = await get(
"https://example.com",
script="document.querySelector('.dynamic-price').innerText"
)
print(f"Price: {content.script_result}")
print(f"Has script result: {content.has_script_result}")
```
### Waiting for Dynamic Content
```python
# Wait for element and extract data
content = await get(
"https://spa-app.com",
script="document.querySelector('.loaded-content').textContent",
wait_for=".loaded-content" # Wait for element to appear
)
```
### Complex JavaScript Operations
```python
# Execute complex JavaScript
complex_script = """
// Scroll to load more content
window.scrollTo(0, document.body.scrollHeight);
// Wait for new content to load
await new Promise(resolve => setTimeout(resolve, 2000));
// Extract all product data
const products = Array.from(document.querySelectorAll('.product')).map(p => ({
name: p.querySelector('.name')?.textContent,
price: p.querySelector('.price')?.textContent,
rating: p.querySelector('.rating')?.textContent
}));
return products;
"""
content = await get("https://ecommerce-site.com", script=complex_script)
products = content.script_result
```
## Advanced Patterns
### Before/After Script Execution
```python
# Execute script before content extraction, then after
content = await get(
"https://dynamic-site.com",
script_before="document.querySelector('.load-more')?.click()",
script_after="document.querySelectorAll('.item').length"
)
if isinstance(content.script_result, dict):
print(f"Triggered loading: {content.script_result['script_before']}")
print(f"Items loaded: {content.script_result['script_after']}")
```
### Form Interaction and Submission
```python
# Fill and submit forms
form_script = """
// Fill login form
document.querySelector('#username').value = 'testuser';
document.querySelector('#password').value = 'testpass';
// Submit form
document.querySelector('#login-form').submit();
// Wait for redirect
await new Promise(resolve => setTimeout(resolve, 3000));
return 'form submitted';
"""
content = await get("https://app.com/login", script=form_script)
```
### Performance Monitoring
```python
# Extract performance metrics
perf_script = """
({
loadTime: performance.timing.loadEventEnd - performance.timing.navigationStart,
domReady: performance.timing.domContentLoadedEventEnd - performance.timing.navigationStart,
resources: performance.getEntriesByType('resource').length,
memory: performance.memory ? {
used: Math.round(performance.memory.usedJSHeapSize / 1024 / 1024),
total: Math.round(performance.memory.totalJSHeapSize / 1024 / 1024)
} : null
})
"""
content = await get("https://example.com", script=perf_script)
metrics = content.script_result
print(f"Load time: {metrics['loadTime']}ms")
```
## Batch Processing
### Same Script for Multiple URLs
```python
from crawailer import get_many
urls = [
"https://site1.com/product/1",
"https://site1.com/product/2",
"https://site1.com/product/3"
]
# Extract price from all products
results = await get_many(
urls,
script="document.querySelector('.price')?.textContent"
)
for result in results:
if result and result.script_result:
print(f"{result.url}: {result.script_result}")
```
### Different Scripts per URL
```python
# Custom script for each URL
urls = ["https://react-app.com", "https://vue-app.com", "https://angular-app.com"]
scripts = [
"window.React ? 'React ' + React.version : 'No React'",
"window.Vue ? 'Vue ' + Vue.version : 'No Vue'",
"window.ng ? 'Angular detected' : 'No Angular'"
]
results = await get_many(urls, script=scripts)
```
## Intelligent Discovery
### Search Result Interaction
```python
from crawailer import discover
# Discover content with JavaScript interaction
results = await discover(
"machine learning tutorials",
script="document.querySelector('.show-more')?.click()",
content_script="document.querySelector('.read-time')?.textContent",
max_pages=5
)
for result in results:
print(f"{result.title} - Reading time: {result.script_result}")
```
### Pagination Handling
```python
# Handle infinite scroll
pagination_script = """
let results = [];
let page = 0;
while (page < 3) { // Load 3 pages
// Scroll to bottom
window.scrollTo(0, document.body.scrollHeight);
// Wait for new content
await new Promise(resolve => setTimeout(resolve, 2000));
// Extract current page items
const items = Array.from(document.querySelectorAll('.item')).map(item =>
item.textContent.trim()
);
results.push(...items);
page++;
}
return results;
"""
content = await get("https://infinite-scroll-site.com", script=pagination_script)
```
## Error Handling
### JavaScript Error Capture
```python
content = await get(
"https://example.com",
script="document.querySelector('.nonexistent').click()"
)
if content.has_script_error:
print(f"JavaScript error: {content.script_error}")
else:
print(f"Result: {content.script_result}")
```
### Graceful Degradation
```python
# Try JavaScript, fall back to static content
try:
content = await get(
"https://dynamic-site.com",
script="window.dynamicData || 'fallback'"
)
if content.has_script_error:
# JavaScript failed, but we still have static content
print(f"Using static content: {content.text[:100]}")
else:
print(f"Dynamic data: {content.script_result}")
except Exception as e:
print(f"Complete failure: {e}")
```
## Modern Framework Integration
### React Applications
```python
# Extract React component data
react_script = """
// Find React root
const reactRoot = document.querySelector('[data-reactroot]') || document.querySelector('#root');
if (window.React && reactRoot) {
// Get React fiber data (React 16+)
const fiberKey = Object.keys(reactRoot).find(key => key.startsWith('__reactInternalInstance'));
return {
framework: 'React',
version: React.version,
hasRouter: !!window.ReactRouter,
componentCount: document.querySelectorAll('[data-reactroot] *').length
};
}
return null;
"""
content = await get("https://react-app.com", script=react_script)
```
### Vue Applications
```python
# Extract Vue app data
vue_script = """
if (window.Vue) {
const app = document.querySelector('#app');
return {
framework: 'Vue',
version: Vue.version,
hasRouter: !!window.VueRouter,
hasVuex: !!window.Vuex,
rootComponent: app?.__vue__?.$options.name || 'unknown'
};
}
return null;
"""
content = await get("https://vue-app.com", script=vue_script)
```
### Angular Applications
```python
# Extract Angular app data
angular_script = """
if (window.ng) {
const platform = window.ng.platform || {};
return {
framework: 'Angular',
version: window.ng.version?.full || 'unknown',
hasRouter: !!window.ng.router,
modules: Object.keys(platform).length
};
}
return null;
"""
content = await get("https://angular-app.com", script=angular_script)
```
## WebContent Integration
### Accessing JavaScript Results
```python
content = await get("https://example.com", script="document.title")
# JavaScript result is available in WebContent object
print(f"Script result: {content.script_result}")
print(f"Has result: {content.has_script_result}")
print(f"Has error: {content.has_script_error}")
# Also access traditional content
print(f"Title: {content.title}")
print(f"Text: {content.text[:100]}")
print(f"Markdown: {content.markdown[:100]}")
```
### Combining Static and Dynamic Data
```python
# Extract both static content and dynamic data
dynamic_script = """
({
dynamicPrice: document.querySelector('.dynamic-price')?.textContent,
userCount: document.querySelector('.user-count')?.textContent,
lastUpdated: document.querySelector('.last-updated')?.textContent
})
"""
content = await get("https://dashboard.com", script=dynamic_script)
# Use both static and dynamic content
analysis = {
'title': content.title,
'word_count': content.word_count,
'reading_time': content.reading_time,
'dynamic_data': content.script_result
}
```
## Performance Considerations
### Optimize JavaScript Execution
```python
# Lightweight scripts for better performance
fast_script = "document.title" # Simple, fast
# Avoid heavy DOM operations
slow_script = """
// This is expensive - avoid if possible
const allElements = document.querySelectorAll('*');
return Array.from(allElements).map(el => el.tagName);
"""
```
### Batch Processing Optimization
```python
# Process in smaller batches for better memory usage
urls = [f"https://site.com/page/{i}" for i in range(100)]
batch_size = 10
results = []
for i in range(0, len(urls), batch_size):
batch = urls[i:i+batch_size]
batch_results = await get_many(batch, script="document.title")
results.extend(batch_results)
# Optional: small delay between batches
await asyncio.sleep(1)
```
## Best Practices
### 1. Script Design
```python
# ✅ Good: Simple, focused scripts
good_script = "document.querySelector('.price').textContent"
# ❌ Avoid: Complex scripts that could fail
bad_script = """
try {
const price = document.querySelector('.price').textContent.split('$')[1];
const discountedPrice = parseFloat(price) * 0.9;
return `$${discountedPrice.toFixed(2)}`;
} catch (e) {
return null;
}
"""
```
### 2. Error Handling
```python
# Always check for script errors
content = await get(url, script=script)
if content.has_script_error:
# Handle the error appropriately
logging.warning(f"JavaScript error on {url}: {content.script_error}")
# Use fallback approach
else:
# Process successful result
process_result(content.script_result)
```
### 3. Performance Monitoring
```python
import time
start_time = time.time()
content = await get(url, script=script)
duration = time.time() - start_time
if duration > 10: # If taking too long
logging.warning(f"Slow JavaScript execution on {url}: {duration:.2f}s")
```
## Common Use Cases
### E-commerce Data Extraction
```python
# Extract product information
product_script = """
({
name: document.querySelector('.product-name')?.textContent,
price: document.querySelector('.price')?.textContent,
rating: document.querySelector('.rating')?.textContent,
availability: document.querySelector('.stock-status')?.textContent,
images: Array.from(document.querySelectorAll('.product-image img')).map(img => img.src)
})
"""
content = await get("https://shop.com/product/123", script=product_script)
product_data = content.script_result
```
### Social Media Content
```python
# Extract social media posts (be respectful of terms of service)
social_script = """
Array.from(document.querySelectorAll('.post')).slice(0, 10).map(post => ({
text: post.querySelector('.post-text')?.textContent,
author: post.querySelector('.author')?.textContent,
timestamp: post.querySelector('.timestamp')?.textContent,
likes: post.querySelector('.likes-count')?.textContent
}))
"""
content = await get("https://social-site.com/feed", script=social_script)
posts = content.script_result
```
### News and Articles
```python
# Extract article metadata
article_script = """
({
headline: document.querySelector('h1')?.textContent,
author: document.querySelector('.author')?.textContent,
publishDate: document.querySelector('.publish-date')?.textContent,
readingTime: document.querySelector('.reading-time')?.textContent,
tags: Array.from(document.querySelectorAll('.tag')).map(tag => tag.textContent),
wordCount: document.querySelector('.article-body')?.textContent.split(' ').length
})
"""
content = await get("https://news-site.com/article/123", script=article_script)
```
## Integration with AI Workflows
### Content Preparation for LLMs
```python
# Extract structured content for AI processing
ai_script = """
({
mainContent: document.querySelector('main')?.textContent,
headings: Array.from(document.querySelectorAll('h1, h2, h3')).map(h => ({
level: h.tagName,
text: h.textContent
})),
keyPoints: Array.from(document.querySelectorAll('.highlight, .callout')).map(el => el.textContent),
metadata: {
wordCount: document.body.textContent.split(' ').length,
readingLevel: 'advanced', // Could be calculated
topics: Array.from(document.querySelectorAll('.topic-tag')).map(tag => tag.textContent)
}
})
"""
content = await get("https://technical-blog.com/post", script=ai_script)
structured_data = content.script_result
# Now ready for AI processing
ai_prompt = f"""
Analyze this content:
Title: {content.title}
Main Content: {structured_data['mainContent'][:1000]}...
Key Points: {structured_data['keyPoints']}
Topics: {structured_data['metadata']['topics']}
Provide a summary and key insights.
"""
```
## Troubleshooting
### Common Issues
1. **Script Timeout**
```python
# Increase timeout for slow scripts
content = await get(url, script=script, timeout=60)
```
2. **Element Not Found**
```python
# Use optional chaining and fallbacks
safe_script = """
document.querySelector('.target')?.textContent || 'not found'
"""
```
3. **JavaScript Not Loaded**
```python
# Wait for JavaScript frameworks to load
content = await get(
url,
script="typeof React !== 'undefined' ? React.version : 'React not loaded'",
wait_for="[data-reactroot]"
)
```
### Debug Mode
```python
# Enable verbose logging for debugging
import logging
logging.basicConfig(level=logging.DEBUG)
content = await get(url, script=script)
```
This comprehensive JavaScript API enables Crawailer to handle modern web applications with the same ease as static sites, making it ideal for AI workflows that require rich, accurate content extraction.