crawailer/ENHANCEMENT_JS_API.md
Crawailer Developer 7634f9fc32 Initial commit: JavaScript API enhancement preparation
- Comprehensive test suite (700+ lines) for JS execution in high-level API
- Test coverage analysis and validation infrastructure
- Enhancement proposal and implementation strategy
- Mock HTTP server with realistic JavaScript scenarios
- Parallel implementation strategy using expert agents and git worktrees

Ready for test-driven implementation of JavaScript enhancements.
2025-09-14 21:22:30 -06:00

244 lines
7.0 KiB
Markdown

# Enhancement Proposal: JavaScript Execution in High-Level API
## Summary
Add optional JavaScript execution capabilities to the high-level API functions (`get`, `get_many`, `discover`) to enable DOM manipulation and dynamic content interaction without requiring direct Browser class usage.
## Motivation
Currently, users must drop down to the `Browser` class to execute JavaScript:
```python
# Current approach - requires Browser class
from crawailer import Browser, BrowserConfig
browser = Browser(BrowserConfig())
await browser.start()
result = await browser.execute_script(url, script)
await browser.stop()
```
Many common use cases would benefit from JavaScript execution in the convenience API:
- Clicking "Load More" buttons before extraction
- Scrolling to trigger lazy loading
- Extracting computed values from JavaScript
- Interacting with dynamic UI elements
## Proposed API Changes
### 1. Enhanced `get` Function
```python
async def get(
url: str,
*,
wait_for: Optional[str] = None,
script: Optional[str] = None, # NEW
script_before: Optional[str] = None, # NEW - run before extraction
script_after: Optional[str] = None, # NEW - run after extraction
timeout: int = 30,
clean: bool = True,
extract_links: bool = True,
extract_metadata: bool = True,
) -> WebContent:
"""
Get content from a single URL with optional JavaScript execution.
Args:
script: JavaScript to execute before content extraction (alias for script_before)
script_before: JavaScript to execute after page load, before extraction
script_after: JavaScript to execute after extraction (result available as content.script_result)
"""
```
### 2. Enhanced `get_many` Function
```python
async def get_many(
urls: List[str],
*,
script: Optional[Union[str, List[str]]] = None, # NEW
max_concurrent: int = 5,
timeout: int = 30,
**kwargs
) -> List[WebContent]:
"""
Args:
script: JavaScript to execute on each page (string for all, list for per-URL)
"""
```
### 3. Enhanced `discover` Function
```python
async def discover(
query: str,
*,
max_pages: int = 10,
script: Optional[str] = None, # NEW - for search results page
content_script: Optional[str] = None, # NEW - for each discovered page
**kwargs
) -> List[WebContent]:
"""
Args:
script: JavaScript to execute on search results pages
content_script: JavaScript to execute on each discovered content page
"""
```
## Usage Examples
### Example 1: E-commerce Price Extraction
```python
# Extract dynamic price that loads via JavaScript
content = await web.get(
"https://shop.example.com/product",
wait_for=".price-container",
script="document.querySelector('.final-price').innerText"
)
print(f"Price: {content.script_result}")
```
### Example 2: Infinite Scroll Content
```python
# Scroll to bottom to load all content
content = await web.get(
"https://infinite-scroll.example.com",
script_before="""
// Scroll to bottom multiple times
for(let i = 0; i < 3; i++) {
window.scrollTo(0, document.body.scrollHeight);
await new Promise(r => setTimeout(r, 1000));
}
""",
wait_for=".end-of-content"
)
```
### Example 3: Click to Expand Content
```python
# Click all "Read More" buttons before extraction
content = await web.get(
"https://blog.example.com/article",
script_before="""
document.querySelectorAll('.read-more-btn').forEach(btn => btn.click());
"""
)
```
### Example 4: Batch Processing with Different Scripts
```python
# Different scripts for different URLs
urls = [
"https://site1.com", # Needs scrolling
"https://site2.com", # Needs button click
"https://site3.com", # No script needed
]
scripts = [
"window.scrollTo(0, document.body.scrollHeight)",
"document.querySelector('.load-all').click()",
None
]
results = await web.get_many(urls, script=scripts)
```
### Example 5: Complex Discovery Flow
```python
# Advanced search with pagination
results = await web.discover(
"machine learning papers",
script="""
// Click "Show More Results" on search page
const moreBtn = document.querySelector('.show-more');
if(moreBtn) moreBtn.click();
""",
content_script="""
// Expand abstracts on each paper page
document.querySelector('.expand-abstract')?.click();
"""
)
```
## Implementation Details
### WebContent Enhancement
```python
@dataclass
class WebContent:
# ... existing fields ...
script_result: Optional[Any] = None # NEW - result from JavaScript execution
script_error: Optional[str] = None # NEW - any JS execution errors
```
### Browser Method Updates
```python
async def fetch_page(
self,
url: str,
*,
wait_for: Optional[str] = None,
script_before: Optional[str] = None, # NEW
script_after: Optional[str] = None, # NEW
timeout: int = 30,
stealth: bool = False,
) -> Dict[str, Any]:
# ... existing code ...
# After page load, before extraction
if script_before:
try:
script_result = await page.evaluate(script_before)
page_data["script_result"] = script_result
except Exception as e:
page_data["script_error"] = str(e)
# ... extraction ...
# After extraction if needed
if script_after:
after_result = await page.evaluate(script_after)
page_data["script_after_result"] = after_result
```
## Benefits
1. **Simplified API**: No need to manage Browser instances for common JS tasks
2. **Backward Compatible**: All changes are optional parameters
3. **Flexible**: Supports before/after extraction scripts
4. **Batch Support**: Can apply different scripts to different URLs
5. **Error Handling**: Graceful degradation if scripts fail
## Considerations
1. **Security**: Scripts run in page context - users must trust their scripts
2. **Performance**: JavaScript execution adds latency
3. **Debugging**: Script errors should be clearly reported
4. **Documentation**: Need clear examples of common patterns
## Alternative Approaches Considered
1. **Predefined Actions**: Instead of raw JS, provide actions like `click`, `scroll`, `fill`
- Pros: Safer, easier to use
- Cons: Less flexible, can't cover all cases
2. **Separate Functions**: `get_with_script`, `get_many_with_script`
- Pros: Cleaner separation
- Cons: API proliferation
3. **Script Templates**: Provide common script templates
- Pros: Easier for beginners
- Cons: Maintenance burden
## Recommendation
Implement the proposed changes with optional script parameters. This provides maximum flexibility while maintaining backward compatibility. Start with `script` parameter only, then add `script_before`/`script_after` if needed based on user feedback.
## Next Steps
1. Update `api.py` to accept script parameters
2. Modify `Browser.fetch_page` to execute scripts
3. Update `WebContent` to include script results
4. Add comprehensive tests for JS execution
5. Update documentation with examples
6. Consider adding script templates as utilities