
- Comprehensive test suite (700+ lines) for JS execution in high-level API - Test coverage analysis and validation infrastructure - Enhancement proposal and implementation strategy - Mock HTTP server with realistic JavaScript scenarios - Parallel implementation strategy using expert agents and git worktrees Ready for test-driven implementation of JavaScript enhancements.
188 lines
5.8 KiB
Markdown
188 lines
5.8 KiB
Markdown
# 🕷️ Crawailer
|
|
|
|
**Browser control for robots** - Delightful web automation and content extraction
|
|
|
|
Crawailer is a modern Python library designed for AI agents, automation scripts, and MCP servers that need to interact with the web. It provides a clean, intuitive API for browser control and intelligent content extraction.
|
|
|
|
## ✨ Features
|
|
|
|
- **🎯 Intuitive API**: Simple, predictable functions that just work
|
|
- **🚀 Modern & Fast**: Built on Playwright with selectolax for 5-10x faster HTML processing
|
|
- **🤖 AI-Friendly**: Optimized outputs for LLMs and structured data extraction
|
|
- **🔧 Flexible**: Use as a library, CLI tool, or MCP server
|
|
- **📦 Zero Config**: Sensible defaults with optional customization
|
|
- **🎨 Delightful DX**: Rich output, helpful errors, progress tracking
|
|
|
|
## 🚀 Quick Start
|
|
|
|
```python
|
|
import crawailer as web
|
|
|
|
# Simple content extraction
|
|
content = await web.get("https://example.com")
|
|
print(content.markdown) # Clean, LLM-ready markdown
|
|
print(content.text) # Human-readable text
|
|
print(content.title) # Extracted title
|
|
|
|
# Batch processing
|
|
results = await web.get_many(["url1", "url2", "url3"])
|
|
for result in results:
|
|
print(f"{result.title}: {result.word_count} words")
|
|
|
|
# Smart discovery
|
|
research = await web.discover("AI safety papers", limit=10)
|
|
# Returns the most relevant content, not just the first 10 results
|
|
```
|
|
|
|
## 🎯 Design Philosophy
|
|
|
|
### For Robots, By Humans
|
|
- **Predictive**: Anticipates what you need and provides it
|
|
- **Forgiving**: Handles errors gracefully with helpful suggestions
|
|
- **Efficient**: Fast by default, with smart caching and concurrency
|
|
- **Composable**: Small, focused functions that work well together
|
|
|
|
### Perfect for AI Workflows
|
|
- **LLM-Optimized**: Clean markdown, structured data, semantic chunking
|
|
- **Context-Aware**: Extracts relationships and metadata automatically
|
|
- **Quality-Focused**: Built-in content quality assessment
|
|
- **Archive-Ready**: Designed for long-term storage and retrieval
|
|
|
|
## 📖 Use Cases
|
|
|
|
### AI Agents & LLM Applications
|
|
```python
|
|
# Research assistant workflow
|
|
research = await web.discover("quantum computing breakthroughs")
|
|
for paper in research:
|
|
summary = await llm.summarize(paper.markdown)
|
|
insights = await llm.extract_insights(paper.content)
|
|
```
|
|
|
|
### MCP Servers
|
|
```python
|
|
# Easy MCP integration (with crawailer[mcp])
|
|
from crawailer.mcp import create_mcp_server
|
|
|
|
server = create_mcp_server()
|
|
# Automatically exposes web.get, web.discover, etc. as MCP tools
|
|
```
|
|
|
|
### Data Pipeline & Automation
|
|
```python
|
|
# Monitor competitors
|
|
competitors = ["competitor1.com", "competitor2.com"]
|
|
changes = await web.monitor_changes(competitors, check_interval="1h")
|
|
for change in changes:
|
|
if change.significance > 0.7:
|
|
await notify_team(change)
|
|
```
|
|
|
|
## 🛠️ Installation
|
|
|
|
```bash
|
|
# Basic installation
|
|
pip install crawailer
|
|
|
|
# With AI features (semantic search, entity extraction)
|
|
pip install crawailer[ai]
|
|
|
|
# With MCP server capabilities
|
|
pip install crawailer[mcp]
|
|
|
|
# Everything
|
|
pip install crawailer[all]
|
|
|
|
# Post-install setup (installs Playwright browsers)
|
|
crawailer setup
|
|
```
|
|
|
|
## 🏗️ Architecture
|
|
|
|
Crawailer is built on modern, focused libraries:
|
|
|
|
- **🎭 Playwright**: Reliable browser automation
|
|
- **⚡ selectolax**: 5-10x faster HTML parsing (C-based)
|
|
- **📝 markdownify**: Clean HTML→Markdown conversion
|
|
- **🧹 justext**: Intelligent content extraction and cleaning
|
|
- **🔄 httpx**: Modern async HTTP client
|
|
|
|
## 🤝 Perfect for MCP Projects
|
|
|
|
MCP servers love Crawailer because it provides:
|
|
|
|
- **Focused tools**: Each function does one thing well
|
|
- **Rich outputs**: Structured data ready for LLM consumption
|
|
- **Smart defaults**: Works out of the box with minimal configuration
|
|
- **Extensible**: Easy to add domain-specific extraction logic
|
|
|
|
```python
|
|
# Example MCP server tool
|
|
@mcp_tool("web_research")
|
|
async def research_topic(topic: str, depth: str = "comprehensive"):
|
|
results = await web.discover(topic, max_pages=20)
|
|
return {
|
|
"sources": len(results),
|
|
"content": [r.summary for r in results],
|
|
"insights": await analyze_patterns(results)
|
|
}
|
|
```
|
|
|
|
## 🎉 What Makes It Delightful
|
|
|
|
### Predictive Intelligence
|
|
```python
|
|
content = await web.get("blog-post-url")
|
|
# Automatically detects it's a blog post
|
|
# Extracts: author, date, reading time, topics
|
|
|
|
product = await web.get("ecommerce-url")
|
|
# Recognizes product page
|
|
# Extracts: price, reviews, availability, specs
|
|
```
|
|
|
|
### Beautiful Output
|
|
```
|
|
✨ Found 15 high-quality sources
|
|
📊 Sources: 4 arxiv, 3 journals, 2 conferences, 6 blogs
|
|
📅 Date range: 2023-2024 (recent research)
|
|
⚡ Average quality score: 8.7/10
|
|
🔍 Key topics: transformers, safety, alignment
|
|
```
|
|
|
|
### Helpful Errors
|
|
```python
|
|
try:
|
|
content = await web.get("problematic-site.com")
|
|
except web.CloudflareProtected:
|
|
# "💡 Try: await web.get(url, stealth=True)"
|
|
except web.PaywallDetected as e:
|
|
# "🔍 Found archived version: {e.archive_url}"
|
|
```
|
|
|
|
## 📚 Documentation
|
|
|
|
- **[Getting Started](docs/getting-started.md)**: Installation and first steps
|
|
- **[API Reference](docs/api.md)**: Complete function documentation
|
|
- **[MCP Integration](docs/mcp.md)**: Building MCP servers with Crawailer
|
|
- **[Examples](examples/)**: Real-world usage patterns
|
|
- **[Architecture](docs/architecture.md)**: How Crawailer works internally
|
|
|
|
## 🤝 Contributing
|
|
|
|
We love contributions! Crawailer is designed to be:
|
|
- **Easy to extend**: Add new content extractors and browser capabilities
|
|
- **Well-tested**: Comprehensive test suite with real websites
|
|
- **Documented**: Every feature has examples and use cases
|
|
|
|
See [CONTRIBUTING.md](CONTRIBUTING.md) for details.
|
|
|
|
## 📄 License
|
|
|
|
MIT License - see [LICENSE](LICENSE) for details.
|
|
|
|
---
|
|
|
|
**Built with ❤️ for the age of AI agents and automation**
|
|
|
|
*Crawailer: Because robots deserve delightful web experiences too* 🤖✨ |