
- Removed pepy.tech downloads badge that shows 0 for new packages - Will re-add once we have meaningful download numbers - Keeps PyPI version and Python support badges which work immediately
289 lines
10 KiB
Markdown
289 lines
10 KiB
Markdown
# 🕷️ Crawailer
|
|
|
|
**The JavaScript-first web scraper that actually works with modern websites**
|
|
|
|
> **Finally!** A Python library that handles React, Vue, Angular, and dynamic content without the headaches. When `requests` fails and Selenium feels like overkill, Crawailer delivers clean, AI-ready content extraction with bulletproof JavaScript execution.
|
|
|
|
```python
|
|
pip install crawailer
|
|
```
|
|
|
|
[](https://badge.fury.io/py/crawailer)
|
|
[](https://pypi.org/project/crawailer/)
|
|
|
|
## ✨ Features
|
|
|
|
- **🎯 JavaScript-First**: Executes real JavaScript on React, Vue, Angular sites (unlike `requests`)
|
|
- **⚡ Lightning Fast**: 5-10x faster HTML processing with C-based selectolax
|
|
- **🤖 AI-Optimized**: Clean markdown output perfect for LLM training and RAG
|
|
- **🔧 Three Ways to Use**: Library, CLI tool, or MCP server - your choice
|
|
- **📦 Zero Config**: Works immediately with sensible defaults
|
|
- **🧪 Battle-Tested**: 18 comprehensive test suites with 70+ real-world scenarios
|
|
- **🎨 Developer Joy**: Rich terminal output, helpful errors, progress tracking
|
|
|
|
## 🚀 Quick Start
|
|
|
|
```python
|
|
import crawailer as web
|
|
|
|
# Simple content extraction
|
|
content = await web.get("https://example.com")
|
|
print(content.markdown) # Clean, LLM-ready markdown
|
|
print(content.text) # Human-readable text
|
|
print(content.title) # Extracted title
|
|
|
|
# JavaScript execution for dynamic content
|
|
content = await web.get(
|
|
"https://spa-app.com",
|
|
script="document.querySelector('.dynamic-price').textContent"
|
|
)
|
|
print(f"Price: {content.script_result}")
|
|
|
|
# Batch processing with JavaScript
|
|
results = await web.get_many(
|
|
["url1", "url2", "url3"],
|
|
script="document.title + ' | ' + document.querySelector('.description')?.textContent"
|
|
)
|
|
for result in results:
|
|
print(f"{result.title}: {result.script_result}")
|
|
|
|
# Smart discovery with interaction
|
|
research = await web.discover(
|
|
"AI safety papers",
|
|
script="document.querySelector('.show-more')?.click()",
|
|
max_pages=10
|
|
)
|
|
# Returns the most relevant content with enhanced extraction
|
|
|
|
# Compare: Traditional scraping fails on modern sites
|
|
# requests.get("https://react-app.com") → Empty <div id="root"></div>
|
|
# Crawailer → Full content + dynamic data
|
|
```
|
|
|
|
## 🎯 Design Philosophy
|
|
|
|
### For Robots, By Humans
|
|
- **Predictive**: Anticipates what you need and provides it
|
|
- **Forgiving**: Handles errors gracefully with helpful suggestions
|
|
- **Efficient**: Fast by default, with smart caching and concurrency
|
|
- **Composable**: Small, focused functions that work well together
|
|
|
|
### Perfect for AI Workflows
|
|
- **LLM-Optimized**: Clean markdown, structured data, semantic chunking
|
|
- **Context-Aware**: Extracts relationships and metadata automatically
|
|
- **Quality-Focused**: Built-in content quality assessment
|
|
- **Archive-Ready**: Designed for long-term storage and retrieval
|
|
|
|
## 📖 Use Cases
|
|
|
|
### 🤖 AI Agents & LLM Applications
|
|
**Problem**: Training data scattered across JavaScript-heavy academic sites
|
|
```python
|
|
# Research assistant workflow with JavaScript interaction
|
|
research = await web.discover(
|
|
"quantum computing breakthroughs",
|
|
script="document.querySelector('.show-abstract')?.click(); return document.querySelector('.full-text')?.textContent"
|
|
)
|
|
for paper in research:
|
|
# Rich content includes JavaScript-extracted data
|
|
summary = await llm.summarize(paper.markdown)
|
|
dynamic_content = paper.script_result # JavaScript execution result
|
|
insights = await llm.extract_insights(paper.content + dynamic_content)
|
|
```
|
|
|
|
### 🛒 E-commerce Price Monitoring
|
|
**Problem**: Product prices loaded via AJAX, `requests` sees loading spinners
|
|
```python
|
|
# Monitor competitor pricing with dynamic content
|
|
products = await web.get_many(
|
|
competitor_urls,
|
|
script="return {price: document.querySelector('.price')?.textContent, stock: document.querySelector('.inventory')?.textContent}"
|
|
)
|
|
for product in products:
|
|
if product.script_result['price'] != cached_price:
|
|
await alert_price_change(product.url, product.script_result)
|
|
```
|
|
|
|
### 🔗 MCP Servers
|
|
**Problem**: Claude needs reliable web content extraction tools
|
|
```python
|
|
# Easy MCP integration (with crawailer[mcp])
|
|
from crawailer.mcp import create_mcp_server
|
|
|
|
server = create_mcp_server()
|
|
# Automatically exposes web.get, web.discover, etc. as MCP tools
|
|
```
|
|
|
|
### 📊 Social Media & Content Analysis
|
|
**Problem**: Posts and comments load infinitely via JavaScript
|
|
```python
|
|
# Extract social media discussions with infinite scroll
|
|
content = await web.get(
|
|
"https://social-platform.com/topic/ai-safety",
|
|
script="window.scrollTo(0, document.body.scrollHeight); return document.querySelectorAll('.post').length"
|
|
)
|
|
# Gets full thread content, not just initial page load
|
|
```
|
|
|
|
## 🛠️ Installation
|
|
|
|
```bash
|
|
# Basic installation
|
|
pip install crawailer
|
|
|
|
# With AI features (semantic search, entity extraction)
|
|
pip install crawailer[ai]
|
|
|
|
# With MCP server capabilities
|
|
pip install crawailer[mcp]
|
|
|
|
# Everything
|
|
pip install crawailer[all]
|
|
|
|
# Post-install setup (installs Playwright browsers)
|
|
crawailer setup
|
|
```
|
|
|
|
## 🏗️ Architecture
|
|
|
|
Crawailer is built on modern, focused libraries:
|
|
|
|
- **🎭 Playwright**: Reliable browser automation
|
|
- **⚡ selectolax**: 5-10x faster HTML parsing (C-based)
|
|
- **📝 markdownify**: Clean HTML→Markdown conversion
|
|
- **🧹 justext**: Intelligent content extraction and cleaning
|
|
- **🔄 httpx**: Modern async HTTP client
|
|
|
|
## 🧪 Battle-Tested Quality
|
|
|
|
Crawailer includes **18 comprehensive test suites** with real-world scenarios:
|
|
|
|
- **Modern Frameworks**: React, Vue, Angular demos with full JavaScript APIs
|
|
- **Mobile Compatibility**: Safari iOS, Chrome Android, responsive designs
|
|
- **Production Edge Cases**: Network failures, memory pressure, browser differences
|
|
- **Performance Testing**: Stress tests, concurrency, resource management
|
|
|
|
**Want to contribute?** We welcome PRs with new test scenarios! Our test sites library shows exactly how different frameworks should behave with JavaScript execution.
|
|
|
|
> 📝 **Future TODO**: Move examples to dedicated repository for community contributions
|
|
|
|
## 🤝 Perfect for MCP Projects
|
|
|
|
MCP servers love Crawailer because it provides:
|
|
|
|
- **Focused tools**: Each function does one thing well
|
|
- **Rich outputs**: Structured data ready for LLM consumption
|
|
- **Smart defaults**: Works out of the box with minimal configuration
|
|
- **Extensible**: Easy to add domain-specific extraction logic
|
|
|
|
```python
|
|
# Example MCP server tool
|
|
@mcp_tool("web_research")
|
|
async def research_topic(topic: str, depth: str = "comprehensive"):
|
|
results = await web.discover(topic, max_pages=20)
|
|
return {
|
|
"sources": len(results),
|
|
"content": [r.summary for r in results],
|
|
"insights": await analyze_patterns(results)
|
|
}
|
|
```
|
|
|
|
## 🥊 Crawailer vs Traditional Tools
|
|
|
|
| Challenge | `requests` & HTTP libs | Selenium | **Crawailer** |
|
|
|-----------|------------------------|----------|---------------|
|
|
| **React/Vue/Angular** | ❌ Empty templates | 🟡 Slow, complex setup | ✅ **Just works** |
|
|
| **Dynamic Pricing** | ❌ Shows loading spinner | 🟡 Requires waits/timeouts | ✅ **Intelligent waiting** |
|
|
| **JavaScript APIs** | ❌ No access | 🟡 Clunky WebDriver calls | ✅ **Native page.evaluate()** |
|
|
| **Speed** | 🟢 100-500ms | ❌ 5-15 seconds | ✅ **2-5 seconds** |
|
|
| **Memory** | 🟢 1-5MB | ❌ 200-500MB | 🟡 **100-200MB** |
|
|
| **AI-Ready Output** | ❌ Raw HTML | ❌ Raw HTML | ✅ **Clean Markdown** |
|
|
| **Developer Experience** | 🟡 Manual parsing | ❌ Complex WebDriver | ✅ **Intuitive API** |
|
|
|
|
> **The bottom line**: When JavaScript matters, Crawailer delivers. When it doesn't, use `requests`.
|
|
>
|
|
> 📖 **[See complete tool comparison →](docs/COMPARISON.md)** (includes Scrapy, Playwright, BeautifulSoup, and more)
|
|
|
|
## 🎉 What Makes It Delightful
|
|
|
|
### JavaScript-Powered Intelligence
|
|
```python
|
|
# Dynamic content extraction from SPAs
|
|
content = await web.get(
|
|
"https://react-app.com",
|
|
script="window.testData?.framework + ' v' + window.React?.version"
|
|
)
|
|
# Automatically detects: React application with version info
|
|
# Extracts: Dynamic content + framework details
|
|
|
|
# E-commerce with JavaScript-loaded prices
|
|
product = await web.get(
|
|
"https://shop.com/product",
|
|
script="document.querySelector('.dynamic-price')?.textContent",
|
|
wait_for=".price-loaded"
|
|
)
|
|
# Recognizes product page with dynamic pricing
|
|
# Extracts: Real-time price, reviews, availability, specs
|
|
```
|
|
|
|
### Beautiful Output
|
|
```
|
|
✨ Found 15 high-quality sources
|
|
📊 Sources: 4 arxiv, 3 journals, 2 conferences, 6 blogs
|
|
📅 Date range: 2023-2024 (recent research)
|
|
⚡ Average quality score: 8.7/10
|
|
🔍 Key topics: transformers, safety, alignment
|
|
```
|
|
|
|
### Helpful Errors
|
|
```python
|
|
try:
|
|
content = await web.get("problematic-site.com")
|
|
except web.CloudflareProtected:
|
|
# "💡 Try: await web.get(url, stealth=True)"
|
|
except web.PaywallDetected as e:
|
|
# "🔍 Found archived version: {e.archive_url}"
|
|
```
|
|
|
|
## 📚 Documentation
|
|
|
|
- **[Tool Comparison](docs/COMPARISON.md)**: How Crawailer compares to Scrapy, Selenium, BeautifulSoup, etc.
|
|
- **[Getting Started](docs/getting-started.md)**: Installation and first steps
|
|
- **[JavaScript API](docs/JAVASCRIPT_API.md)**: Complete JavaScript execution guide
|
|
- **[API Reference](docs/API_REFERENCE.md)**: Complete function documentation
|
|
- **[Benchmarks](docs/BENCHMARKS.md)**: Performance comparison with other tools
|
|
- **[MCP Integration](docs/mcp.md)**: Building MCP servers with Crawailer
|
|
- **[Examples](examples/)**: Real-world usage patterns
|
|
- **[Architecture](docs/architecture.md)**: How Crawailer works internally
|
|
|
|
## 🤝 Contributing
|
|
|
|
We love contributions! Crawailer is designed to be:
|
|
- **Easy to extend**: Add new content extractors and browser capabilities
|
|
- **Well-tested**: Comprehensive test suite with real websites
|
|
- **Documented**: Every feature has examples and use cases
|
|
|
|
See [CONTRIBUTING.md](CONTRIBUTING.md) for details.
|
|
|
|
## 📄 License
|
|
|
|
MIT License - see [LICENSE](LICENSE) for details.
|
|
|
|
---
|
|
|
|
## 🚀 Ready to Stop Fighting JavaScript?
|
|
|
|
```bash
|
|
pip install crawailer
|
|
crawailer setup # Install browser engines
|
|
```
|
|
|
|
**Join the revolution**: Stop losing data to `requests.get()` failures. Start extracting **real content** from **real websites** that actually use JavaScript.
|
|
|
|
⭐ **Star us on GitHub** if Crawailer saves your scraping sanity!
|
|
|
|
---
|
|
|
|
**Built with ❤️ for the age of AI agents and automation**
|
|
|
|
*Crawailer: Because robots deserve delightful web experiences too* 🤖✨ |