
🕷️ Crawailer
Browser control for robots - Delightful web automation and content extraction
Crawailer is a modern Python library designed for AI agents, automation scripts, and MCP servers that need to interact with the web. It provides a clean, intuitive API for browser control and intelligent content extraction.
✨ Features
- 🎯 Intuitive API: Simple, predictable functions that just work
- 🚀 Modern & Fast: Built on Playwright with selectolax for 5-10x faster HTML processing
- 🤖 AI-Friendly: Optimized outputs for LLMs and structured data extraction
- 🔧 Flexible: Use as a library, CLI tool, or MCP server
- 📦 Zero Config: Sensible defaults with optional customization
- 🎨 Delightful DX: Rich output, helpful errors, progress tracking
🚀 Quick Start
import crawailer as web
# Simple content extraction
content = await web.get("https://example.com")
print(content.markdown) # Clean, LLM-ready markdown
print(content.text) # Human-readable text
print(content.title) # Extracted title
# Batch processing
results = await web.get_many(["url1", "url2", "url3"])
for result in results:
print(f"{result.title}: {result.word_count} words")
# Smart discovery
research = await web.discover("AI safety papers", limit=10)
# Returns the most relevant content, not just the first 10 results
🎯 Design Philosophy
For Robots, By Humans
- Predictive: Anticipates what you need and provides it
- Forgiving: Handles errors gracefully with helpful suggestions
- Efficient: Fast by default, with smart caching and concurrency
- Composable: Small, focused functions that work well together
Perfect for AI Workflows
- LLM-Optimized: Clean markdown, structured data, semantic chunking
- Context-Aware: Extracts relationships and metadata automatically
- Quality-Focused: Built-in content quality assessment
- Archive-Ready: Designed for long-term storage and retrieval
📖 Use Cases
AI Agents & LLM Applications
# Research assistant workflow
research = await web.discover("quantum computing breakthroughs")
for paper in research:
summary = await llm.summarize(paper.markdown)
insights = await llm.extract_insights(paper.content)
MCP Servers
# Easy MCP integration (with crawailer[mcp])
from crawailer.mcp import create_mcp_server
server = create_mcp_server()
# Automatically exposes web.get, web.discover, etc. as MCP tools
Data Pipeline & Automation
# Monitor competitors
competitors = ["competitor1.com", "competitor2.com"]
changes = await web.monitor_changes(competitors, check_interval="1h")
for change in changes:
if change.significance > 0.7:
await notify_team(change)
🛠️ Installation
# Basic installation
pip install crawailer
# With AI features (semantic search, entity extraction)
pip install crawailer[ai]
# With MCP server capabilities
pip install crawailer[mcp]
# Everything
pip install crawailer[all]
# Post-install setup (installs Playwright browsers)
crawailer setup
🏗️ Architecture
Crawailer is built on modern, focused libraries:
- 🎭 Playwright: Reliable browser automation
- ⚡ selectolax: 5-10x faster HTML parsing (C-based)
- 📝 markdownify: Clean HTML→Markdown conversion
- 🧹 justext: Intelligent content extraction and cleaning
- 🔄 httpx: Modern async HTTP client
🤝 Perfect for MCP Projects
MCP servers love Crawailer because it provides:
- Focused tools: Each function does one thing well
- Rich outputs: Structured data ready for LLM consumption
- Smart defaults: Works out of the box with minimal configuration
- Extensible: Easy to add domain-specific extraction logic
# Example MCP server tool
@mcp_tool("web_research")
async def research_topic(topic: str, depth: str = "comprehensive"):
results = await web.discover(topic, max_pages=20)
return {
"sources": len(results),
"content": [r.summary for r in results],
"insights": await analyze_patterns(results)
}
🎉 What Makes It Delightful
Predictive Intelligence
content = await web.get("blog-post-url")
# Automatically detects it's a blog post
# Extracts: author, date, reading time, topics
product = await web.get("ecommerce-url")
# Recognizes product page
# Extracts: price, reviews, availability, specs
Beautiful Output
✨ Found 15 high-quality sources
📊 Sources: 4 arxiv, 3 journals, 2 conferences, 6 blogs
📅 Date range: 2023-2024 (recent research)
⚡ Average quality score: 8.7/10
🔍 Key topics: transformers, safety, alignment
Helpful Errors
try:
content = await web.get("problematic-site.com")
except web.CloudflareProtected:
# "💡 Try: await web.get(url, stealth=True)"
except web.PaywallDetected as e:
# "🔍 Found archived version: {e.archive_url}"
📚 Documentation
- Getting Started: Installation and first steps
- API Reference: Complete function documentation
- MCP Integration: Building MCP servers with Crawailer
- Examples: Real-world usage patterns
- Architecture: How Crawailer works internally
🤝 Contributing
We love contributions! Crawailer is designed to be:
- Easy to extend: Add new content extractors and browser capabilities
- Well-tested: Comprehensive test suite with real websites
- Documented: Every feature has examples and use cases
See CONTRIBUTING.md for details.
📄 License
MIT License - see LICENSE for details.
Built with ❤️ for the age of AI agents and automation
Crawailer: Because robots deserve delightful web experiences too 🤖✨
Description
Modern Python library for browser automation and intelligent content extraction with full JavaScript execution support
Languages
Python
80.4%
HTML
19.3%
Shell
0.3%