
- Enhanced get() function with script, script_before, script_after parameters - Enhanced get_many() function with script parameter (str or List[str]) - Enhanced discover() function with script and content_script parameters - Updated ContentExtractor to populate script fields from page_data - Maintained 100% backward compatibility - Added comprehensive parameter validation and error handling - Implemented script parameter alias support (script -> script_before) - Added smart script distribution for multi-URL operations - Enabled two-stage JavaScript execution for discovery workflow All API functions now support JavaScript execution while preserving existing functionality. The enhancement provides intuitive, optional JavaScript capabilities that integrate seamlessly with the browser automation layer.
🕷️ Crawailer
Browser control for robots - Delightful web automation and content extraction
Crawailer is a modern Python library designed for AI agents, automation scripts, and MCP servers that need to interact with the web. It provides a clean, intuitive API for browser control and intelligent content extraction.
✨ Features
- 🎯 Intuitive API: Simple, predictable functions that just work
- 🚀 Modern & Fast: Built on Playwright with selectolax for 5-10x faster HTML processing
- 🤖 AI-Friendly: Optimized outputs for LLMs and structured data extraction
- 🔧 Flexible: Use as a library, CLI tool, or MCP server
- 📦 Zero Config: Sensible defaults with optional customization
- 🎨 Delightful DX: Rich output, helpful errors, progress tracking
🚀 Quick Start
import crawailer as web
# Simple content extraction
content = await web.get("https://example.com")
print(content.markdown) # Clean, LLM-ready markdown
print(content.text) # Human-readable text
print(content.title) # Extracted title
# Batch processing
results = await web.get_many(["url1", "url2", "url3"])
for result in results:
print(f"{result.title}: {result.word_count} words")
# Smart discovery
research = await web.discover("AI safety papers", limit=10)
# Returns the most relevant content, not just the first 10 results
🎯 Design Philosophy
For Robots, By Humans
- Predictive: Anticipates what you need and provides it
- Forgiving: Handles errors gracefully with helpful suggestions
- Efficient: Fast by default, with smart caching and concurrency
- Composable: Small, focused functions that work well together
Perfect for AI Workflows
- LLM-Optimized: Clean markdown, structured data, semantic chunking
- Context-Aware: Extracts relationships and metadata automatically
- Quality-Focused: Built-in content quality assessment
- Archive-Ready: Designed for long-term storage and retrieval
📖 Use Cases
AI Agents & LLM Applications
# Research assistant workflow
research = await web.discover("quantum computing breakthroughs")
for paper in research:
summary = await llm.summarize(paper.markdown)
insights = await llm.extract_insights(paper.content)
MCP Servers
# Easy MCP integration (with crawailer[mcp])
from crawailer.mcp import create_mcp_server
server = create_mcp_server()
# Automatically exposes web.get, web.discover, etc. as MCP tools
Data Pipeline & Automation
# Monitor competitors
competitors = ["competitor1.com", "competitor2.com"]
changes = await web.monitor_changes(competitors, check_interval="1h")
for change in changes:
if change.significance > 0.7:
await notify_team(change)
🛠️ Installation
# Basic installation
pip install crawailer
# With AI features (semantic search, entity extraction)
pip install crawailer[ai]
# With MCP server capabilities
pip install crawailer[mcp]
# Everything
pip install crawailer[all]
# Post-install setup (installs Playwright browsers)
crawailer setup
🏗️ Architecture
Crawailer is built on modern, focused libraries:
- 🎭 Playwright: Reliable browser automation
- ⚡ selectolax: 5-10x faster HTML parsing (C-based)
- 📝 markdownify: Clean HTML→Markdown conversion
- 🧹 justext: Intelligent content extraction and cleaning
- 🔄 httpx: Modern async HTTP client
🤝 Perfect for MCP Projects
MCP servers love Crawailer because it provides:
- Focused tools: Each function does one thing well
- Rich outputs: Structured data ready for LLM consumption
- Smart defaults: Works out of the box with minimal configuration
- Extensible: Easy to add domain-specific extraction logic
# Example MCP server tool
@mcp_tool("web_research")
async def research_topic(topic: str, depth: str = "comprehensive"):
results = await web.discover(topic, max_pages=20)
return {
"sources": len(results),
"content": [r.summary for r in results],
"insights": await analyze_patterns(results)
}
🎉 What Makes It Delightful
Predictive Intelligence
content = await web.get("blog-post-url")
# Automatically detects it's a blog post
# Extracts: author, date, reading time, topics
product = await web.get("ecommerce-url")
# Recognizes product page
# Extracts: price, reviews, availability, specs
Beautiful Output
✨ Found 15 high-quality sources
📊 Sources: 4 arxiv, 3 journals, 2 conferences, 6 blogs
📅 Date range: 2023-2024 (recent research)
⚡ Average quality score: 8.7/10
🔍 Key topics: transformers, safety, alignment
Helpful Errors
try:
content = await web.get("problematic-site.com")
except web.CloudflareProtected:
# "💡 Try: await web.get(url, stealth=True)"
except web.PaywallDetected as e:
# "🔍 Found archived version: {e.archive_url}"
📚 Documentation
- Getting Started: Installation and first steps
- API Reference: Complete function documentation
- MCP Integration: Building MCP servers with Crawailer
- Examples: Real-world usage patterns
- Architecture: How Crawailer works internally
🤝 Contributing
We love contributions! Crawailer is designed to be:
- Easy to extend: Add new content extractors and browser capabilities
- Well-tested: Comprehensive test suite with real websites
- Documented: Every feature has examples and use cases
See CONTRIBUTING.md for details.
📄 License
MIT License - see LICENSE for details.
Built with ❤️ for the age of AI agents and automation
Crawailer: Because robots deserve delightful web experiences too 🤖✨
Description
Modern Python library for browser automation and intelligent content extraction with full JavaScript execution support
Languages
Python
80.4%
HTML
19.3%
Shell
0.3%