# πŸ•·οΈ Crawailer **Browser control for robots** - Delightful web automation and content extraction Crawailer is a modern Python library designed for AI agents, automation scripts, and MCP servers that need to interact with the web. It provides a clean, intuitive API for browser control and intelligent content extraction. ## ✨ Features - **🎯 Intuitive API**: Simple, predictable functions that just work - **πŸš€ Modern & Fast**: Built on Playwright with selectolax for 5-10x faster HTML processing - **πŸ€– AI-Friendly**: Optimized outputs for LLMs and structured data extraction - **πŸ”§ Flexible**: Use as a library, CLI tool, or MCP server - **πŸ“¦ Zero Config**: Sensible defaults with optional customization - **🎨 Delightful DX**: Rich output, helpful errors, progress tracking ## πŸš€ Quick Start ```python import crawailer as web # Simple content extraction content = await web.get("https://example.com") print(content.markdown) # Clean, LLM-ready markdown print(content.text) # Human-readable text print(content.title) # Extracted title # Batch processing results = await web.get_many(["url1", "url2", "url3"]) for result in results: print(f"{result.title}: {result.word_count} words") # Smart discovery research = await web.discover("AI safety papers", limit=10) # Returns the most relevant content, not just the first 10 results ``` ## 🎯 Design Philosophy ### For Robots, By Humans - **Predictive**: Anticipates what you need and provides it - **Forgiving**: Handles errors gracefully with helpful suggestions - **Efficient**: Fast by default, with smart caching and concurrency - **Composable**: Small, focused functions that work well together ### Perfect for AI Workflows - **LLM-Optimized**: Clean markdown, structured data, semantic chunking - **Context-Aware**: Extracts relationships and metadata automatically - **Quality-Focused**: Built-in content quality assessment - **Archive-Ready**: Designed for long-term storage and retrieval ## πŸ“– Use Cases ### AI Agents & LLM Applications ```python # Research assistant workflow research = await web.discover("quantum computing breakthroughs") for paper in research: summary = await llm.summarize(paper.markdown) insights = await llm.extract_insights(paper.content) ``` ### MCP Servers ```python # Easy MCP integration (with crawailer[mcp]) from crawailer.mcp import create_mcp_server server = create_mcp_server() # Automatically exposes web.get, web.discover, etc. as MCP tools ``` ### Data Pipeline & Automation ```python # Monitor competitors competitors = ["competitor1.com", "competitor2.com"] changes = await web.monitor_changes(competitors, check_interval="1h") for change in changes: if change.significance > 0.7: await notify_team(change) ``` ## πŸ› οΈ Installation ```bash # Basic installation pip install crawailer # With AI features (semantic search, entity extraction) pip install crawailer[ai] # With MCP server capabilities pip install crawailer[mcp] # Everything pip install crawailer[all] # Post-install setup (installs Playwright browsers) crawailer setup ``` ## πŸ—οΈ Architecture Crawailer is built on modern, focused libraries: - **🎭 Playwright**: Reliable browser automation - **⚑ selectolax**: 5-10x faster HTML parsing (C-based) - **πŸ“ markdownify**: Clean HTMLβ†’Markdown conversion - **🧹 justext**: Intelligent content extraction and cleaning - **πŸ”„ httpx**: Modern async HTTP client ## 🀝 Perfect for MCP Projects MCP servers love Crawailer because it provides: - **Focused tools**: Each function does one thing well - **Rich outputs**: Structured data ready for LLM consumption - **Smart defaults**: Works out of the box with minimal configuration - **Extensible**: Easy to add domain-specific extraction logic ```python # Example MCP server tool @mcp_tool("web_research") async def research_topic(topic: str, depth: str = "comprehensive"): results = await web.discover(topic, max_pages=20) return { "sources": len(results), "content": [r.summary for r in results], "insights": await analyze_patterns(results) } ``` ## πŸŽ‰ What Makes It Delightful ### Predictive Intelligence ```python content = await web.get("blog-post-url") # Automatically detects it's a blog post # Extracts: author, date, reading time, topics product = await web.get("ecommerce-url") # Recognizes product page # Extracts: price, reviews, availability, specs ``` ### Beautiful Output ``` ✨ Found 15 high-quality sources πŸ“Š Sources: 4 arxiv, 3 journals, 2 conferences, 6 blogs πŸ“… Date range: 2023-2024 (recent research) ⚑ Average quality score: 8.7/10 πŸ” Key topics: transformers, safety, alignment ``` ### Helpful Errors ```python try: content = await web.get("problematic-site.com") except web.CloudflareProtected: # "πŸ’‘ Try: await web.get(url, stealth=True)" except web.PaywallDetected as e: # "πŸ” Found archived version: {e.archive_url}" ``` ## πŸ“š Documentation - **[Getting Started](docs/getting-started.md)**: Installation and first steps - **[API Reference](docs/api.md)**: Complete function documentation - **[MCP Integration](docs/mcp.md)**: Building MCP servers with Crawailer - **[Examples](examples/)**: Real-world usage patterns - **[Architecture](docs/architecture.md)**: How Crawailer works internally ## 🀝 Contributing We love contributions! Crawailer is designed to be: - **Easy to extend**: Add new content extractors and browser capabilities - **Well-tested**: Comprehensive test suite with real websites - **Documented**: Every feature has examples and use cases See [CONTRIBUTING.md](CONTRIBUTING.md) for details. ## πŸ“„ License MIT License - see [LICENSE](LICENSE) for details. --- **Built with ❀️ for the age of AI agents and automation** *Crawailer: Because robots deserve delightful web experiences too* πŸ€–βœ¨