Go to file

Crawailer Developer d35dcbb494 Complete Phase 3: High-level API JavaScript integration

- Enhanced get() function with script, script_before, script_after parameters
- Enhanced get_many() function with script parameter (str or List[str])
- Enhanced discover() function with script and content_script parameters
- Updated ContentExtractor to populate script fields from page_data
- Maintained 100% backward compatibility
- Added comprehensive parameter validation and error handling
- Implemented script parameter alias support (script -> script_before)
- Added smart script distribution for multi-URL operations
- Enabled two-stage JavaScript execution for discovery workflow

All API functions now support JavaScript execution while preserving
existing functionality. The enhancement provides intuitive, optional
JavaScript capabilities that integrate seamlessly with the browser
automation layer.

2025-09-14 21:47:56 -06:00

coordination

Complete Phase 3: High-level API JavaScript integration

2025-09-14 21:47:56 -06:00

examples

Initial commit: JavaScript API enhancement preparation

2025-09-14 21:22:30 -06:00

src/crawailer

Complete Phase 3: High-level API JavaScript integration

2025-09-14 21:47:56 -06:00

tests

Complete Phase 3: High-level API JavaScript integration

2025-09-14 21:47:56 -06:00

demonstrate_test_failures.py

Initial commit: JavaScript API enhancement preparation

2025-09-14 21:22:30 -06:00

ENHANCEMENT_JS_API.md

Initial commit: JavaScript API enhancement preparation

2025-09-14 21:22:30 -06:00

LICENSE

Initial commit: JavaScript API enhancement preparation

2025-09-14 21:22:30 -06:00

minimal_failing_test.py

Initial commit: JavaScript API enhancement preparation

2025-09-14 21:22:30 -06:00

PARALLEL_IMPLEMENTATION_STRATEGY.md

Initial commit: JavaScript API enhancement preparation

2025-09-14 21:22:30 -06:00

pyproject.toml

Complete Phase 3: High-level API JavaScript integration

2025-09-14 21:47:56 -06:00

README.md

Initial commit: JavaScript API enhancement preparation

2025-09-14 21:22:30 -06:00

simple_validation.py

Initial commit: JavaScript API enhancement preparation

2025-09-14 21:22:30 -06:00

test_coverage_analysis.py

Initial commit: JavaScript API enhancement preparation

2025-09-14 21:22:30 -06:00

test_current_implementation.py

Initial commit: JavaScript API enhancement preparation

2025-09-14 21:22:30 -06:00

TEST_RESULTS_SUMMARY.md

Initial commit: JavaScript API enhancement preparation

2025-09-14 21:22:30 -06:00

test_runner.py

Initial commit: JavaScript API enhancement preparation

2025-09-14 21:22:30 -06:00

test_summary.py

Initial commit: JavaScript API enhancement preparation

2025-09-14 21:22:30 -06:00

uv.lock

Complete Phase 3: High-level API JavaScript integration

2025-09-14 21:47:56 -06:00

validate_tests.py

Initial commit: JavaScript API enhancement preparation

2025-09-14 21:22:30 -06:00

README.md

🕷️ Crawailer

Browser control for robots - Delightful web automation and content extraction

Crawailer is a modern Python library designed for AI agents, automation scripts, and MCP servers that need to interact with the web. It provides a clean, intuitive API for browser control and intelligent content extraction.

✨ Features

🎯 Intuitive API: Simple, predictable functions that just work
🚀 Modern & Fast: Built on Playwright with selectolax for 5-10x faster HTML processing
🤖 AI-Friendly: Optimized outputs for LLMs and structured data extraction
🔧 Flexible: Use as a library, CLI tool, or MCP server
📦 Zero Config: Sensible defaults with optional customization
🎨 Delightful DX: Rich output, helpful errors, progress tracking

🚀 Quick Start

import crawailer as web

# Simple content extraction
content = await web.get("https://example.com")
print(content.markdown)  # Clean, LLM-ready markdown
print(content.text)      # Human-readable text
print(content.title)     # Extracted title

# Batch processing  
results = await web.get_many(["url1", "url2", "url3"])
for result in results:
    print(f"{result.title}: {result.word_count} words")

# Smart discovery
research = await web.discover("AI safety papers", limit=10)
# Returns the most relevant content, not just the first 10 results

🎯 Design Philosophy

For Robots, By Humans

Predictive: Anticipates what you need and provides it
Forgiving: Handles errors gracefully with helpful suggestions
Efficient: Fast by default, with smart caching and concurrency
Composable: Small, focused functions that work well together

Perfect for AI Workflows

LLM-Optimized: Clean markdown, structured data, semantic chunking
Context-Aware: Extracts relationships and metadata automatically
Quality-Focused: Built-in content quality assessment
Archive-Ready: Designed for long-term storage and retrieval

📖 Use Cases

AI Agents & LLM Applications

# Research assistant workflow
research = await web.discover("quantum computing breakthroughs")
for paper in research:
    summary = await llm.summarize(paper.markdown)
    insights = await llm.extract_insights(paper.content)

MCP Servers

# Easy MCP integration (with crawailer[mcp])
from crawailer.mcp import create_mcp_server

server = create_mcp_server()
# Automatically exposes web.get, web.discover, etc. as MCP tools

Data Pipeline & Automation

# Monitor competitors
competitors = ["competitor1.com", "competitor2.com"] 
changes = await web.monitor_changes(competitors, check_interval="1h")
for change in changes:
    if change.significance > 0.7:
        await notify_team(change)

🛠️ Installation

# Basic installation
pip install crawailer

# With AI features (semantic search, entity extraction)
pip install crawailer[ai]

# With MCP server capabilities  
pip install crawailer[mcp]

# Everything
pip install crawailer[all]

# Post-install setup (installs Playwright browsers)
crawailer setup

🏗️ Architecture

Crawailer is built on modern, focused libraries:

🎭 Playwright: Reliable browser automation
⚡ selectolax: 5-10x faster HTML parsing (C-based)
📝 markdownify: Clean HTML→Markdown conversion
🧹 justext: Intelligent content extraction and cleaning
🔄 httpx: Modern async HTTP client

🤝 Perfect for MCP Projects

MCP servers love Crawailer because it provides:

Focused tools: Each function does one thing well
Rich outputs: Structured data ready for LLM consumption
Smart defaults: Works out of the box with minimal configuration
Extensible: Easy to add domain-specific extraction logic

# Example MCP server tool
@mcp_tool("web_research")
async def research_topic(topic: str, depth: str = "comprehensive"):
    results = await web.discover(topic, max_pages=20)
    return {
        "sources": len(results),
        "content": [r.summary for r in results],
        "insights": await analyze_patterns(results)
    }

🎉 What Makes It Delightful

Predictive Intelligence

content = await web.get("blog-post-url")
# Automatically detects it's a blog post
# Extracts: author, date, reading time, topics

product = await web.get("ecommerce-url") 
# Recognizes product page
# Extracts: price, reviews, availability, specs

Beautiful Output

✨ Found 15 high-quality sources
📊 Sources: 4 arxiv, 3 journals, 2 conferences, 6 blogs  
📅 Date range: 2023-2024 (recent research)
⚡ Average quality score: 8.7/10
🔍 Key topics: transformers, safety, alignment

Helpful Errors

try:
    content = await web.get("problematic-site.com")
except web.CloudflareProtected:
    # "💡 Try: await web.get(url, stealth=True)"
except web.PaywallDetected as e:
    # "🔍 Found archived version: {e.archive_url}"

📚 Documentation

Getting Started: Installation and first steps
API Reference: Complete function documentation
MCP Integration: Building MCP servers with Crawailer
Examples: Real-world usage patterns
Architecture: How Crawailer works internally

🤝 Contributing

We love contributions! Crawailer is designed to be:

Easy to extend: Add new content extractors and browser capabilities
Well-tested: Comprehensive test suite with real websites
Documented: Every feature has examples and use cases

See CONTRIBUTING.md for details.

📄 License

MIT License - see LICENSE for details.

Built with ❤️ for the age of AI agents and automation

Crawailer: Because robots deserve delightful web experiences too 🤖✨