crawailer/ENHANCEMENT_JS_API.md
Crawailer Developer 7634f9fc32 Initial commit: JavaScript API enhancement preparation
- Comprehensive test suite (700+ lines) for JS execution in high-level API
- Test coverage analysis and validation infrastructure
- Enhancement proposal and implementation strategy
- Mock HTTP server with realistic JavaScript scenarios
- Parallel implementation strategy using expert agents and git worktrees

Ready for test-driven implementation of JavaScript enhancements.
2025-09-14 21:22:30 -06:00

7.0 KiB

Enhancement Proposal: JavaScript Execution in High-Level API

Summary

Add optional JavaScript execution capabilities to the high-level API functions (get, get_many, discover) to enable DOM manipulation and dynamic content interaction without requiring direct Browser class usage.

Motivation

Currently, users must drop down to the Browser class to execute JavaScript:

# Current approach - requires Browser class
from crawailer import Browser, BrowserConfig

browser = Browser(BrowserConfig())
await browser.start()
result = await browser.execute_script(url, script)
await browser.stop()

Many common use cases would benefit from JavaScript execution in the convenience API:

  • Clicking "Load More" buttons before extraction
  • Scrolling to trigger lazy loading
  • Extracting computed values from JavaScript
  • Interacting with dynamic UI elements

Proposed API Changes

1. Enhanced get Function

async def get(
    url: str,
    *,
    wait_for: Optional[str] = None,
    script: Optional[str] = None,  # NEW
    script_before: Optional[str] = None,  # NEW - run before extraction
    script_after: Optional[str] = None,  # NEW - run after extraction
    timeout: int = 30,
    clean: bool = True,
    extract_links: bool = True,
    extract_metadata: bool = True,
) -> WebContent:
    """
    Get content from a single URL with optional JavaScript execution.
    
    Args:
        script: JavaScript to execute before content extraction (alias for script_before)
        script_before: JavaScript to execute after page load, before extraction
        script_after: JavaScript to execute after extraction (result available as content.script_result)
    """

2. Enhanced get_many Function

async def get_many(
    urls: List[str],
    *,
    script: Optional[Union[str, List[str]]] = None,  # NEW
    max_concurrent: int = 5,
    timeout: int = 30,
    **kwargs
) -> List[WebContent]:
    """
    Args:
        script: JavaScript to execute on each page (string for all, list for per-URL)
    """

3. Enhanced discover Function

async def discover(
    query: str,
    *,
    max_pages: int = 10,
    script: Optional[str] = None,  # NEW - for search results page
    content_script: Optional[str] = None,  # NEW - for each discovered page
    **kwargs
) -> List[WebContent]:
    """
    Args:
        script: JavaScript to execute on search results pages
        content_script: JavaScript to execute on each discovered content page
    """

Usage Examples

Example 1: E-commerce Price Extraction

# Extract dynamic price that loads via JavaScript
content = await web.get(
    "https://shop.example.com/product",
    wait_for=".price-container",
    script="document.querySelector('.final-price').innerText"
)
print(f"Price: {content.script_result}")

Example 2: Infinite Scroll Content

# Scroll to bottom to load all content
content = await web.get(
    "https://infinite-scroll.example.com",
    script_before="""
        // Scroll to bottom multiple times
        for(let i = 0; i < 3; i++) {
            window.scrollTo(0, document.body.scrollHeight);
            await new Promise(r => setTimeout(r, 1000));
        }
    """,
    wait_for=".end-of-content"
)

Example 3: Click to Expand Content

# Click all "Read More" buttons before extraction
content = await web.get(
    "https://blog.example.com/article",
    script_before="""
        document.querySelectorAll('.read-more-btn').forEach(btn => btn.click());
    """
)

Example 4: Batch Processing with Different Scripts

# Different scripts for different URLs
urls = [
    "https://site1.com",  # Needs scrolling
    "https://site2.com",  # Needs button click
    "https://site3.com",  # No script needed
]

scripts = [
    "window.scrollTo(0, document.body.scrollHeight)",
    "document.querySelector('.load-all').click()",
    None
]

results = await web.get_many(urls, script=scripts)

Example 5: Complex Discovery Flow

# Advanced search with pagination
results = await web.discover(
    "machine learning papers",
    script="""
        // Click "Show More Results" on search page
        const moreBtn = document.querySelector('.show-more');
        if(moreBtn) moreBtn.click();
    """,
    content_script="""
        // Expand abstracts on each paper page
        document.querySelector('.expand-abstract')?.click();
    """
)

Implementation Details

WebContent Enhancement

@dataclass
class WebContent:
    # ... existing fields ...
    script_result: Optional[Any] = None  # NEW - result from JavaScript execution
    script_error: Optional[str] = None  # NEW - any JS execution errors

Browser Method Updates

async def fetch_page(
    self,
    url: str,
    *,
    wait_for: Optional[str] = None,
    script_before: Optional[str] = None,  # NEW
    script_after: Optional[str] = None,  # NEW
    timeout: int = 30,
    stealth: bool = False,
) -> Dict[str, Any]:
    # ... existing code ...
    
    # After page load, before extraction
    if script_before:
        try:
            script_result = await page.evaluate(script_before)
            page_data["script_result"] = script_result
        except Exception as e:
            page_data["script_error"] = str(e)
    
    # ... extraction ...
    
    # After extraction if needed
    if script_after:
        after_result = await page.evaluate(script_after)
        page_data["script_after_result"] = after_result

Benefits

  1. Simplified API: No need to manage Browser instances for common JS tasks
  2. Backward Compatible: All changes are optional parameters
  3. Flexible: Supports before/after extraction scripts
  4. Batch Support: Can apply different scripts to different URLs
  5. Error Handling: Graceful degradation if scripts fail

Considerations

  1. Security: Scripts run in page context - users must trust their scripts
  2. Performance: JavaScript execution adds latency
  3. Debugging: Script errors should be clearly reported
  4. Documentation: Need clear examples of common patterns

Alternative Approaches Considered

  1. Predefined Actions: Instead of raw JS, provide actions like click, scroll, fill

    • Pros: Safer, easier to use
    • Cons: Less flexible, can't cover all cases
  2. Separate Functions: get_with_script, get_many_with_script

    • Pros: Cleaner separation
    • Cons: API proliferation
  3. Script Templates: Provide common script templates

    • Pros: Easier for beginners
    • Cons: Maintenance burden

Recommendation

Implement the proposed changes with optional script parameters. This provides maximum flexibility while maintaining backward compatibility. Start with script parameter only, then add script_before/script_after if needed based on user feedback.

Next Steps

  1. Update api.py to accept script parameters
  2. Modify Browser.fetch_page to execute scripts
  3. Update WebContent to include script results
  4. Add comprehensive tests for JS execution
  5. Update documentation with examples
  6. Consider adding script templates as utilities