
- Comprehensive test suite (700+ lines) for JS execution in high-level API - Test coverage analysis and validation infrastructure - Enhancement proposal and implementation strategy - Mock HTTP server with realistic JavaScript scenarios - Parallel implementation strategy using expert agents and git worktrees Ready for test-driven implementation of JavaScript enhancements.
244 lines
7.0 KiB
Markdown
244 lines
7.0 KiB
Markdown
# Enhancement Proposal: JavaScript Execution in High-Level API
|
|
|
|
## Summary
|
|
Add optional JavaScript execution capabilities to the high-level API functions (`get`, `get_many`, `discover`) to enable DOM manipulation and dynamic content interaction without requiring direct Browser class usage.
|
|
|
|
## Motivation
|
|
|
|
Currently, users must drop down to the `Browser` class to execute JavaScript:
|
|
|
|
```python
|
|
# Current approach - requires Browser class
|
|
from crawailer import Browser, BrowserConfig
|
|
|
|
browser = Browser(BrowserConfig())
|
|
await browser.start()
|
|
result = await browser.execute_script(url, script)
|
|
await browser.stop()
|
|
```
|
|
|
|
Many common use cases would benefit from JavaScript execution in the convenience API:
|
|
- Clicking "Load More" buttons before extraction
|
|
- Scrolling to trigger lazy loading
|
|
- Extracting computed values from JavaScript
|
|
- Interacting with dynamic UI elements
|
|
|
|
## Proposed API Changes
|
|
|
|
### 1. Enhanced `get` Function
|
|
|
|
```python
|
|
async def get(
|
|
url: str,
|
|
*,
|
|
wait_for: Optional[str] = None,
|
|
script: Optional[str] = None, # NEW
|
|
script_before: Optional[str] = None, # NEW - run before extraction
|
|
script_after: Optional[str] = None, # NEW - run after extraction
|
|
timeout: int = 30,
|
|
clean: bool = True,
|
|
extract_links: bool = True,
|
|
extract_metadata: bool = True,
|
|
) -> WebContent:
|
|
"""
|
|
Get content from a single URL with optional JavaScript execution.
|
|
|
|
Args:
|
|
script: JavaScript to execute before content extraction (alias for script_before)
|
|
script_before: JavaScript to execute after page load, before extraction
|
|
script_after: JavaScript to execute after extraction (result available as content.script_result)
|
|
"""
|
|
```
|
|
|
|
### 2. Enhanced `get_many` Function
|
|
|
|
```python
|
|
async def get_many(
|
|
urls: List[str],
|
|
*,
|
|
script: Optional[Union[str, List[str]]] = None, # NEW
|
|
max_concurrent: int = 5,
|
|
timeout: int = 30,
|
|
**kwargs
|
|
) -> List[WebContent]:
|
|
"""
|
|
Args:
|
|
script: JavaScript to execute on each page (string for all, list for per-URL)
|
|
"""
|
|
```
|
|
|
|
### 3. Enhanced `discover` Function
|
|
|
|
```python
|
|
async def discover(
|
|
query: str,
|
|
*,
|
|
max_pages: int = 10,
|
|
script: Optional[str] = None, # NEW - for search results page
|
|
content_script: Optional[str] = None, # NEW - for each discovered page
|
|
**kwargs
|
|
) -> List[WebContent]:
|
|
"""
|
|
Args:
|
|
script: JavaScript to execute on search results pages
|
|
content_script: JavaScript to execute on each discovered content page
|
|
"""
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
### Example 1: E-commerce Price Extraction
|
|
```python
|
|
# Extract dynamic price that loads via JavaScript
|
|
content = await web.get(
|
|
"https://shop.example.com/product",
|
|
wait_for=".price-container",
|
|
script="document.querySelector('.final-price').innerText"
|
|
)
|
|
print(f"Price: {content.script_result}")
|
|
```
|
|
|
|
### Example 2: Infinite Scroll Content
|
|
```python
|
|
# Scroll to bottom to load all content
|
|
content = await web.get(
|
|
"https://infinite-scroll.example.com",
|
|
script_before="""
|
|
// Scroll to bottom multiple times
|
|
for(let i = 0; i < 3; i++) {
|
|
window.scrollTo(0, document.body.scrollHeight);
|
|
await new Promise(r => setTimeout(r, 1000));
|
|
}
|
|
""",
|
|
wait_for=".end-of-content"
|
|
)
|
|
```
|
|
|
|
### Example 3: Click to Expand Content
|
|
```python
|
|
# Click all "Read More" buttons before extraction
|
|
content = await web.get(
|
|
"https://blog.example.com/article",
|
|
script_before="""
|
|
document.querySelectorAll('.read-more-btn').forEach(btn => btn.click());
|
|
"""
|
|
)
|
|
```
|
|
|
|
### Example 4: Batch Processing with Different Scripts
|
|
```python
|
|
# Different scripts for different URLs
|
|
urls = [
|
|
"https://site1.com", # Needs scrolling
|
|
"https://site2.com", # Needs button click
|
|
"https://site3.com", # No script needed
|
|
]
|
|
|
|
scripts = [
|
|
"window.scrollTo(0, document.body.scrollHeight)",
|
|
"document.querySelector('.load-all').click()",
|
|
None
|
|
]
|
|
|
|
results = await web.get_many(urls, script=scripts)
|
|
```
|
|
|
|
### Example 5: Complex Discovery Flow
|
|
```python
|
|
# Advanced search with pagination
|
|
results = await web.discover(
|
|
"machine learning papers",
|
|
script="""
|
|
// Click "Show More Results" on search page
|
|
const moreBtn = document.querySelector('.show-more');
|
|
if(moreBtn) moreBtn.click();
|
|
""",
|
|
content_script="""
|
|
// Expand abstracts on each paper page
|
|
document.querySelector('.expand-abstract')?.click();
|
|
"""
|
|
)
|
|
```
|
|
|
|
## Implementation Details
|
|
|
|
### WebContent Enhancement
|
|
```python
|
|
@dataclass
|
|
class WebContent:
|
|
# ... existing fields ...
|
|
script_result: Optional[Any] = None # NEW - result from JavaScript execution
|
|
script_error: Optional[str] = None # NEW - any JS execution errors
|
|
```
|
|
|
|
### Browser Method Updates
|
|
```python
|
|
async def fetch_page(
|
|
self,
|
|
url: str,
|
|
*,
|
|
wait_for: Optional[str] = None,
|
|
script_before: Optional[str] = None, # NEW
|
|
script_after: Optional[str] = None, # NEW
|
|
timeout: int = 30,
|
|
stealth: bool = False,
|
|
) -> Dict[str, Any]:
|
|
# ... existing code ...
|
|
|
|
# After page load, before extraction
|
|
if script_before:
|
|
try:
|
|
script_result = await page.evaluate(script_before)
|
|
page_data["script_result"] = script_result
|
|
except Exception as e:
|
|
page_data["script_error"] = str(e)
|
|
|
|
# ... extraction ...
|
|
|
|
# After extraction if needed
|
|
if script_after:
|
|
after_result = await page.evaluate(script_after)
|
|
page_data["script_after_result"] = after_result
|
|
```
|
|
|
|
## Benefits
|
|
|
|
1. **Simplified API**: No need to manage Browser instances for common JS tasks
|
|
2. **Backward Compatible**: All changes are optional parameters
|
|
3. **Flexible**: Supports before/after extraction scripts
|
|
4. **Batch Support**: Can apply different scripts to different URLs
|
|
5. **Error Handling**: Graceful degradation if scripts fail
|
|
|
|
## Considerations
|
|
|
|
1. **Security**: Scripts run in page context - users must trust their scripts
|
|
2. **Performance**: JavaScript execution adds latency
|
|
3. **Debugging**: Script errors should be clearly reported
|
|
4. **Documentation**: Need clear examples of common patterns
|
|
|
|
## Alternative Approaches Considered
|
|
|
|
1. **Predefined Actions**: Instead of raw JS, provide actions like `click`, `scroll`, `fill`
|
|
- Pros: Safer, easier to use
|
|
- Cons: Less flexible, can't cover all cases
|
|
|
|
2. **Separate Functions**: `get_with_script`, `get_many_with_script`
|
|
- Pros: Cleaner separation
|
|
- Cons: API proliferation
|
|
|
|
3. **Script Templates**: Provide common script templates
|
|
- Pros: Easier for beginners
|
|
- Cons: Maintenance burden
|
|
|
|
## Recommendation
|
|
|
|
Implement the proposed changes with optional script parameters. This provides maximum flexibility while maintaining backward compatibility. Start with `script` parameter only, then add `script_before`/`script_after` if needed based on user feedback.
|
|
|
|
## Next Steps
|
|
|
|
1. Update `api.py` to accept script parameters
|
|
2. Modify `Browser.fetch_page` to execute scripts
|
|
3. Update `WebContent` to include script results
|
|
4. Add comprehensive tests for JS execution
|
|
5. Update documentation with examples
|
|
6. Consider adding script templates as utilities |