Crawailer Developer 7634f9fc32 Initial commit: JavaScript API enhancement preparation

- Comprehensive test suite (700+ lines) for JS execution in high-level API
- Test coverage analysis and validation infrastructure
- Enhancement proposal and implementation strategy
- Mock HTTP server with realistic JavaScript scenarios
- Parallel implementation strategy using expert agents and git worktrees

Ready for test-driven implementation of JavaScript enhancements.

2025-09-14 21:22:30 -06:00

7.0 KiB

Raw Blame History

Enhancement Proposal: JavaScript Execution in High-Level API

Summary

Add optional JavaScript execution capabilities to the high-level API functions (get, get_many, discover) to enable DOM manipulation and dynamic content interaction without requiring direct Browser class usage.

Motivation

Currently, users must drop down to the Browser class to execute JavaScript:

# Current approach - requires Browser class
from crawailer import Browser, BrowserConfig

browser = Browser(BrowserConfig())
await browser.start()
result = await browser.execute_script(url, script)
await browser.stop()

Many common use cases would benefit from JavaScript execution in the convenience API:

Clicking "Load More" buttons before extraction
Scrolling to trigger lazy loading
Extracting computed values from JavaScript
Interacting with dynamic UI elements

Proposed API Changes

1. Enhanced `get` Function

async def get(
    url: str,
    *,
    wait_for: Optional[str] = None,
    script: Optional[str] = None,  # NEW
    script_before: Optional[str] = None,  # NEW - run before extraction
    script_after: Optional[str] = None,  # NEW - run after extraction
    timeout: int = 30,
    clean: bool = True,
    extract_links: bool = True,
    extract_metadata: bool = True,
) -> WebContent:
    """
    Get content from a single URL with optional JavaScript execution.
    
    Args:
        script: JavaScript to execute before content extraction (alias for script_before)
        script_before: JavaScript to execute after page load, before extraction
        script_after: JavaScript to execute after extraction (result available as content.script_result)
    """

2. Enhanced `get_many` Function

async def get_many(
    urls: List[str],
    *,
    script: Optional[Union[str, List[str]]] = None,  # NEW
    max_concurrent: int = 5,
    timeout: int = 30,
    **kwargs
) -> List[WebContent]:
    """
    Args:
        script: JavaScript to execute on each page (string for all, list for per-URL)
    """

3. Enhanced `discover` Function

async def discover(
    query: str,
    *,
    max_pages: int = 10,
    script: Optional[str] = None,  # NEW - for search results page
    content_script: Optional[str] = None,  # NEW - for each discovered page
    **kwargs
) -> List[WebContent]:
    """
    Args:
        script: JavaScript to execute on search results pages
        content_script: JavaScript to execute on each discovered content page
    """

Usage Examples

Example 1: E-commerce Price Extraction

# Extract dynamic price that loads via JavaScript
content = await web.get(
    "https://shop.example.com/product",
    wait_for=".price-container",
    script="document.querySelector('.final-price').innerText"
)
print(f"Price: {content.script_result}")

Example 2: Infinite Scroll Content

# Scroll to bottom to load all content
content = await web.get(
    "https://infinite-scroll.example.com",
    script_before="""
        // Scroll to bottom multiple times
        for(let i = 0; i < 3; i++) {
            window.scrollTo(0, document.body.scrollHeight);
            await new Promise(r => setTimeout(r, 1000));
        }
    """,
    wait_for=".end-of-content"
)

Example 3: Click to Expand Content

# Click all "Read More" buttons before extraction
content = await web.get(
    "https://blog.example.com/article",
    script_before="""
        document.querySelectorAll('.read-more-btn').forEach(btn => btn.click());
    """
)

Example 4: Batch Processing with Different Scripts

# Different scripts for different URLs
urls = [
    "https://site1.com",  # Needs scrolling
    "https://site2.com",  # Needs button click
    "https://site3.com",  # No script needed
]

scripts = [
    "window.scrollTo(0, document.body.scrollHeight)",
    "document.querySelector('.load-all').click()",
    None
]

results = await web.get_many(urls, script=scripts)

Example 5: Complex Discovery Flow

# Advanced search with pagination
results = await web.discover(
    "machine learning papers",
    script="""
        // Click "Show More Results" on search page
        const moreBtn = document.querySelector('.show-more');
        if(moreBtn) moreBtn.click();
    """,
    content_script="""
        // Expand abstracts on each paper page
        document.querySelector('.expand-abstract')?.click();
    """
)

Implementation Details

WebContent Enhancement

@dataclass
class WebContent:
    # ... existing fields ...
    script_result: Optional[Any] = None  # NEW - result from JavaScript execution
    script_error: Optional[str] = None  # NEW - any JS execution errors

Browser Method Updates

async def fetch_page(
    self,
    url: str,
    *,
    wait_for: Optional[str] = None,
    script_before: Optional[str] = None,  # NEW
    script_after: Optional[str] = None,  # NEW
    timeout: int = 30,
    stealth: bool = False,
) -> Dict[str, Any]:
    # ... existing code ...
    
    # After page load, before extraction
    if script_before:
        try:
            script_result = await page.evaluate(script_before)
            page_data["script_result"] = script_result
        except Exception as e:
            page_data["script_error"] = str(e)
    
    # ... extraction ...
    
    # After extraction if needed
    if script_after:
        after_result = await page.evaluate(script_after)
        page_data["script_after_result"] = after_result

Benefits

Simplified API: No need to manage Browser instances for common JS tasks
Backward Compatible: All changes are optional parameters
Flexible: Supports before/after extraction scripts
Batch Support: Can apply different scripts to different URLs
Error Handling: Graceful degradation if scripts fail

Considerations

Security: Scripts run in page context - users must trust their scripts
Performance: JavaScript execution adds latency
Debugging: Script errors should be clearly reported
Documentation: Need clear examples of common patterns

Alternative Approaches Considered

Predefined Actions: Instead of raw JS, provide actions like click, scroll, fill
- Pros: Safer, easier to use
- Cons: Less flexible, can't cover all cases
Separate Functions: get_with_script, get_many_with_script
- Pros: Cleaner separation
- Cons: API proliferation
Script Templates: Provide common script templates
- Pros: Easier for beginners
- Cons: Maintenance burden

Recommendation

Implement the proposed changes with optional script parameters. This provides maximum flexibility while maintaining backward compatibility. Start with script parameter only, then add script_before/script_after if needed based on user feedback.

Next Steps

Update api.py to accept script parameters
Modify Browser.fetch_page to execute scripts
Update WebContent to include script results
Add comprehensive tests for JS execution
Update documentation with examples
Consider adding script templates as utilities

7.0 KiB Raw Blame History