Crawailer Developer d35dcbb494 Complete Phase 3: High-level API JavaScript integration
- Enhanced get() function with script, script_before, script_after parameters
- Enhanced get_many() function with script parameter (str or List[str])
- Enhanced discover() function with script and content_script parameters
- Updated ContentExtractor to populate script fields from page_data
- Maintained 100% backward compatibility
- Added comprehensive parameter validation and error handling
- Implemented script parameter alias support (script -> script_before)
- Added smart script distribution for multi-URL operations
- Enabled two-stage JavaScript execution for discovery workflow

All API functions now support JavaScript execution while preserving
existing functionality. The enhancement provides intuitive, optional
JavaScript capabilities that integrate seamlessly with the browser
automation layer.
2025-09-14 21:47:56 -06:00

🕷️ Crawailer

Browser control for robots - Delightful web automation and content extraction

Crawailer is a modern Python library designed for AI agents, automation scripts, and MCP servers that need to interact with the web. It provides a clean, intuitive API for browser control and intelligent content extraction.

Features

  • 🎯 Intuitive API: Simple, predictable functions that just work
  • 🚀 Modern & Fast: Built on Playwright with selectolax for 5-10x faster HTML processing
  • 🤖 AI-Friendly: Optimized outputs for LLMs and structured data extraction
  • 🔧 Flexible: Use as a library, CLI tool, or MCP server
  • 📦 Zero Config: Sensible defaults with optional customization
  • 🎨 Delightful DX: Rich output, helpful errors, progress tracking

🚀 Quick Start

import crawailer as web

# Simple content extraction
content = await web.get("https://example.com")
print(content.markdown)  # Clean, LLM-ready markdown
print(content.text)      # Human-readable text
print(content.title)     # Extracted title

# Batch processing  
results = await web.get_many(["url1", "url2", "url3"])
for result in results:
    print(f"{result.title}: {result.word_count} words")

# Smart discovery
research = await web.discover("AI safety papers", limit=10)
# Returns the most relevant content, not just the first 10 results

🎯 Design Philosophy

For Robots, By Humans

  • Predictive: Anticipates what you need and provides it
  • Forgiving: Handles errors gracefully with helpful suggestions
  • Efficient: Fast by default, with smart caching and concurrency
  • Composable: Small, focused functions that work well together

Perfect for AI Workflows

  • LLM-Optimized: Clean markdown, structured data, semantic chunking
  • Context-Aware: Extracts relationships and metadata automatically
  • Quality-Focused: Built-in content quality assessment
  • Archive-Ready: Designed for long-term storage and retrieval

📖 Use Cases

AI Agents & LLM Applications

# Research assistant workflow
research = await web.discover("quantum computing breakthroughs")
for paper in research:
    summary = await llm.summarize(paper.markdown)
    insights = await llm.extract_insights(paper.content)

MCP Servers

# Easy MCP integration (with crawailer[mcp])
from crawailer.mcp import create_mcp_server

server = create_mcp_server()
# Automatically exposes web.get, web.discover, etc. as MCP tools

Data Pipeline & Automation

# Monitor competitors
competitors = ["competitor1.com", "competitor2.com"] 
changes = await web.monitor_changes(competitors, check_interval="1h")
for change in changes:
    if change.significance > 0.7:
        await notify_team(change)

🛠️ Installation

# Basic installation
pip install crawailer

# With AI features (semantic search, entity extraction)
pip install crawailer[ai]

# With MCP server capabilities  
pip install crawailer[mcp]

# Everything
pip install crawailer[all]

# Post-install setup (installs Playwright browsers)
crawailer setup

🏗️ Architecture

Crawailer is built on modern, focused libraries:

  • 🎭 Playwright: Reliable browser automation
  • selectolax: 5-10x faster HTML parsing (C-based)
  • 📝 markdownify: Clean HTML→Markdown conversion
  • 🧹 justext: Intelligent content extraction and cleaning
  • 🔄 httpx: Modern async HTTP client

🤝 Perfect for MCP Projects

MCP servers love Crawailer because it provides:

  • Focused tools: Each function does one thing well
  • Rich outputs: Structured data ready for LLM consumption
  • Smart defaults: Works out of the box with minimal configuration
  • Extensible: Easy to add domain-specific extraction logic
# Example MCP server tool
@mcp_tool("web_research")
async def research_topic(topic: str, depth: str = "comprehensive"):
    results = await web.discover(topic, max_pages=20)
    return {
        "sources": len(results),
        "content": [r.summary for r in results],
        "insights": await analyze_patterns(results)
    }

🎉 What Makes It Delightful

Predictive Intelligence

content = await web.get("blog-post-url")
# Automatically detects it's a blog post
# Extracts: author, date, reading time, topics

product = await web.get("ecommerce-url") 
# Recognizes product page
# Extracts: price, reviews, availability, specs

Beautiful Output

✨ Found 15 high-quality sources
📊 Sources: 4 arxiv, 3 journals, 2 conferences, 6 blogs  
📅 Date range: 2023-2024 (recent research)
⚡ Average quality score: 8.7/10
🔍 Key topics: transformers, safety, alignment

Helpful Errors

try:
    content = await web.get("problematic-site.com")
except web.CloudflareProtected:
    # "💡 Try: await web.get(url, stealth=True)"
except web.PaywallDetected as e:
    # "🔍 Found archived version: {e.archive_url}"

📚 Documentation

🤝 Contributing

We love contributions! Crawailer is designed to be:

  • Easy to extend: Add new content extractors and browser capabilities
  • Well-tested: Comprehensive test suite with real websites
  • Documented: Every feature has examples and use cases

See CONTRIBUTING.md for details.

📄 License

MIT License - see LICENSE for details.


Built with ❤️ for the age of AI agents and automation

Crawailer: Because robots deserve delightful web experiences too 🤖

Description
Modern Python library for browser automation and intelligent content extraction with full JavaScript execution support
Readme 583 KiB
Languages
Python 80.4%
HTML 19.3%
Shell 0.3%