Compare commits
8 Commits
feature/js
...
main
Author | SHA1 | Date | |
---|---|---|---|
![]() |
f4088bf607 | ||
![]() |
cffd04e4ca | ||
![]() |
fad0b60b66 | ||
![]() |
3d5a2f3dec | ||
![]() |
21c4a63477 | ||
![]() |
2e1e7d49eb | ||
![]() |
8b0fa8ef77 | ||
![]() |
ad37776018 |
@ -80,4 +80,4 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
|
||||
---
|
||||
|
||||
For more details about changes, see the [commit history](https://github.com/anthropics/crawailer/commits/main).
|
||||
For more details about changes, see the [commit history](https://git.supported.systems/MCP/crawailer/commits/branch/main).
|
90
README.md
90
README.md
@ -1,29 +1,65 @@
|
||||
# 🕷️ Crawailer
|
||||
|
||||
**The JavaScript-first web scraper that actually works with modern websites**
|
||||
**The web scraper that doesn't suck at JavaScript** ✨
|
||||
|
||||
> **Finally!** A Python library that handles React, Vue, Angular, and dynamic content without the headaches. When `requests` fails and Selenium feels like overkill, Crawailer delivers clean, AI-ready content extraction with bulletproof JavaScript execution.
|
||||
> **Stop fighting modern websites.** While `requests` gives you empty `<div id="root"></div>`, Crawailer actually executes JavaScript and extracts real content from React, Vue, and Angular apps. Finally, web scraping that works in 2025.
|
||||
|
||||
> ⚡ **Claude Code's new best friend** - Your AI assistant can now access ANY website
|
||||
|
||||
```python
|
||||
pip install crawailer
|
||||
```
|
||||
|
||||
[](https://badge.fury.io/py/crawailer)
|
||||
[](https://pepy.tech/project/crawailer)
|
||||
[](https://pypi.org/project/crawailer/)
|
||||
|
||||
## ✨ Features
|
||||
## ✨ Why Developers Choose Crawailer
|
||||
|
||||
- **🎯 JavaScript-First**: Executes real JavaScript on React, Vue, Angular sites (unlike `requests`)
|
||||
- **⚡ Lightning Fast**: 5-10x faster HTML processing with C-based selectolax
|
||||
- **🤖 AI-Optimized**: Clean markdown output perfect for LLM training and RAG
|
||||
- **🔧 Three Ways to Use**: Library, CLI tool, or MCP server - your choice
|
||||
- **📦 Zero Config**: Works immediately with sensible defaults
|
||||
- **🧪 Battle-Tested**: 18 comprehensive test suites with 70+ real-world scenarios
|
||||
- **🎨 Developer Joy**: Rich terminal output, helpful errors, progress tracking
|
||||
**🔥 JavaScript That Actually Works**
|
||||
While other tools timeout or crash, Crawailer executes real JavaScript like a human browser
|
||||
|
||||
**⚡ Stupidly Fast**
|
||||
5-10x faster than BeautifulSoup with C-based parsing that doesn't make you wait
|
||||
|
||||
**🤖 AI Assistant Ready**
|
||||
Perfect markdown output that your Claude/GPT/local model will love
|
||||
|
||||
**🎯 Zero Learning Curve**
|
||||
`pip install` → works immediately → no 47-page configuration guides
|
||||
|
||||
**🧪 Production Battle-Tested**
|
||||
18 comprehensive test suites covering every edge case we could think of
|
||||
|
||||
**🎨 Actually Enjoyable**
|
||||
Rich terminal output, helpful errors, progress bars that don't lie
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
> *(Honestly, you probably don't need to read these examples - just ask your AI assistant to figure it out. That's what models are for! But here they are anyway...)*
|
||||
|
||||
### 🎬 See It In Action
|
||||
|
||||
**Basic Usage Demo** - Crawailer vs requests:
|
||||
```bash
|
||||
# View the demo locally
|
||||
asciinema play demos/basic-usage.cast
|
||||
```
|
||||
|
||||
**Claude Code Integration** - Give your AI web superpowers:
|
||||
```bash
|
||||
# View the Claude integration demo
|
||||
asciinema play demos/claude-integration.cast
|
||||
```
|
||||
|
||||
*Don't have asciinema? `pip install asciinema` or run the demos yourself:*
|
||||
```bash
|
||||
# Clone the repo and run demos interactively
|
||||
git clone https://git.supported.systems/MCP/crawailer.git
|
||||
cd crawailer
|
||||
python demo_basic_usage.py
|
||||
python demo_claude_integration.py
|
||||
```
|
||||
|
||||
```python
|
||||
import crawailer as web
|
||||
|
||||
@ -61,6 +97,30 @@ research = await web.discover(
|
||||
# Crawailer → Full content + dynamic data
|
||||
```
|
||||
|
||||
### 🧠 Claude Code MCP Integration
|
||||
|
||||
> *"Hey Claude, go grab that data from the React app"* ← This actually works now
|
||||
|
||||
```python
|
||||
# Add to your Claude Code MCP server
|
||||
from crawailer.mcp import create_mcp_server
|
||||
|
||||
@mcp_tool("web_extract")
|
||||
async def extract_content(url: str, script: str = ""):
|
||||
"""Extract content from any website with optional JavaScript execution"""
|
||||
content = await web.get(url, script=script)
|
||||
return {
|
||||
"title": content.title,
|
||||
"markdown": content.markdown,
|
||||
"script_result": content.script_result,
|
||||
"word_count": content.word_count
|
||||
}
|
||||
|
||||
# 🎉 No more "I can't access that site"
|
||||
# 🎉 No more copy-pasting content manually
|
||||
# 🎉 Your AI can now browse the web like a human
|
||||
```
|
||||
|
||||
## 🎯 Design Philosophy
|
||||
|
||||
### For Robots, By Humans
|
||||
@ -272,16 +332,18 @@ MIT License - see [LICENSE](LICENSE) for details.
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Ready to Stop Fighting JavaScript?
|
||||
## 🚀 Ready to Stop Losing Your Mind?
|
||||
|
||||
```bash
|
||||
pip install crawailer
|
||||
crawailer setup # Install browser engines
|
||||
```
|
||||
|
||||
**Join the revolution**: Stop losing data to `requests.get()` failures. Start extracting **real content** from **real websites** that actually use JavaScript.
|
||||
**Life's too short** for empty `<div>` tags and "JavaScript required" messages.
|
||||
|
||||
⭐ **Star us on GitHub** if Crawailer saves your scraping sanity!
|
||||
Get content that actually exists. From websites that actually work.
|
||||
|
||||
⭐ **Star us if this saves your sanity** → [git.supported.systems/MCP/crawailer](https://git.supported.systems/MCP/crawailer)
|
||||
|
||||
---
|
||||
|
||||
|
58
demo_basic_usage.py
Normal file
58
demo_basic_usage.py
Normal file
@ -0,0 +1,58 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Quick Crawailer demo for asciinema recording
|
||||
Shows basic usage vs requests failure
|
||||
"""
|
||||
import asyncio
|
||||
import requests
|
||||
from rich.console import Console
|
||||
from rich.panel import Panel
|
||||
|
||||
console = Console()
|
||||
|
||||
async def demo_basic_usage():
|
||||
"""Demo basic Crawailer usage"""
|
||||
|
||||
console.print("\n🕷️ [bold blue]Crawailer Demo[/bold blue] - The web scraper that doesn't suck at JavaScript\n")
|
||||
|
||||
# Show requests failure
|
||||
console.print("📉 [red]What happens with requests:[/red]")
|
||||
console.print("```python")
|
||||
console.print("import requests")
|
||||
console.print("response = requests.get('https://react-app.example.com')")
|
||||
console.print("print(response.text)")
|
||||
console.print("```")
|
||||
|
||||
# Simulate requests response
|
||||
console.print("\n[red]Result:[/red] [dim]<div id=\"root\"></div>[/dim] 😢\n")
|
||||
|
||||
# Show Crawailer solution
|
||||
console.print("🚀 [green]What happens with Crawailer:[/green]")
|
||||
console.print("```python")
|
||||
console.print("import crawailer as web")
|
||||
console.print("content = await web.get('https://react-app.example.com')")
|
||||
console.print("print(content.markdown)")
|
||||
console.print("```")
|
||||
|
||||
await asyncio.sleep(2)
|
||||
|
||||
# Simulate Crawailer response
|
||||
console.print("\n[green]Result:[/green]")
|
||||
console.print(Panel.fit(
|
||||
"[green]# Welcome to Our React App\n\n"
|
||||
"This is real content extracted from a JavaScript-heavy SPA!\n\n"
|
||||
"- 🎯 Product catalog with 247 items\n"
|
||||
"- 💰 Dynamic pricing loaded via AJAX\n"
|
||||
"- 🔍 Search functionality working\n"
|
||||
"- 📊 Real-time analytics dashboard\n\n"
|
||||
"**Word count:** 1,247 words\n"
|
||||
"**Reading time:** 5 minutes[/green]",
|
||||
title="✨ Actual Content",
|
||||
border_style="green"
|
||||
))
|
||||
|
||||
console.print("\n⚡ [bold]The difference?[/bold] Crawailer executes JavaScript like a real browser!")
|
||||
console.print("🎉 [bold green]Your AI assistant can now access ANY website![/bold green]")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(demo_basic_usage())
|
76
demo_claude_integration.py
Normal file
76
demo_claude_integration.py
Normal file
@ -0,0 +1,76 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Claude Code MCP integration demo for asciinema recording
|
||||
Shows how easy it is to give Claude web access
|
||||
"""
|
||||
import asyncio
|
||||
from rich.console import Console
|
||||
from rich.panel import Panel
|
||||
from rich.syntax import Syntax
|
||||
|
||||
console = Console()
|
||||
|
||||
async def demo_claude_integration():
|
||||
"""Demo Claude Code MCP integration"""
|
||||
|
||||
console.print("\n🧠 [bold blue]Claude Code MCP Integration Demo[/bold blue]\n")
|
||||
|
||||
console.print("[dim]\"Hey Claude, go grab that data from the React app\"[/dim] ← This actually works now!\n")
|
||||
|
||||
# Show the MCP server code
|
||||
mcp_code = '''# Add to your Claude Code MCP server
|
||||
from crawailer.mcp import create_mcp_server
|
||||
|
||||
@mcp_tool("web_extract")
|
||||
async def extract_content(url: str, script: str = ""):
|
||||
"""Extract content from any website with optional JavaScript execution"""
|
||||
content = await web.get(url, script=script)
|
||||
return {
|
||||
"title": content.title,
|
||||
"markdown": content.markdown,
|
||||
"script_result": content.script_result,
|
||||
"word_count": content.word_count
|
||||
}'''
|
||||
|
||||
syntax = Syntax(mcp_code, "python", theme="monokai", line_numbers=True)
|
||||
console.print(Panel(syntax, title="🔧 MCP Server Setup", border_style="blue"))
|
||||
|
||||
await asyncio.sleep(3)
|
||||
|
||||
# Show Claude conversation
|
||||
console.print("\n💬 [bold]Claude Code Conversation:[/bold]")
|
||||
|
||||
await asyncio.sleep(1)
|
||||
console.print("\n[cyan]You:[/cyan] Claude, can you extract the pricing info from https://shop.example.com/product/123?")
|
||||
|
||||
await asyncio.sleep(2)
|
||||
console.print("\n[green]Claude:[/green] I'll extract that pricing information for you!")
|
||||
console.print("[green]Claude:[/green] [dim]Using web_extract tool...[/dim]")
|
||||
|
||||
await asyncio.sleep(2)
|
||||
console.print("\n[green]Claude:[/green] Here's the pricing information I extracted:")
|
||||
|
||||
# Show extracted content
|
||||
result_panel = Panel.fit(
|
||||
"[green]**Product:** Premium Widget Pro\n"
|
||||
"**Price:** $299.99 (was $399.99)\n"
|
||||
"**Discount:** 25% off - Limited time!\n"
|
||||
"**Stock:** 47 units available\n"
|
||||
"**Shipping:** Free 2-day delivery\n"
|
||||
"**Reviews:** 4.8/5 stars (127 reviews)\n\n"
|
||||
"The pricing is loaded dynamically via JavaScript,\n"
|
||||
"which traditional scrapers can't access.[/green]",
|
||||
title="📊 Extracted Data",
|
||||
border_style="green"
|
||||
)
|
||||
console.print(result_panel)
|
||||
|
||||
console.print("\n🎉 [bold green]Benefits:[/bold green]")
|
||||
console.print(" • No more \"I can't access that site\"")
|
||||
console.print(" • No more copy-pasting content manually")
|
||||
console.print(" • Your AI can now browse the web like a human")
|
||||
|
||||
console.print("\n⚡ [bold]Claude Code + Crawailer = Web superpowers for your AI![/bold]")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(demo_claude_integration())
|
20
demos/basic-usage.cast
Normal file
20
demos/basic-usage.cast
Normal file
@ -0,0 +1,20 @@
|
||||
{"version":3,"term":{"cols":80,"rows":24,"type":"xterm-kitty"},"timestamp":1758237889,"command":"python demo_basic_usage.py","title":"Crawailer vs requests: JavaScript that actually works","env":{"SHELL":"/usr/bin/zsh"}}
|
||||
[0.086544, "o", "\r\n🕷️ \u001b[1;34mCrawailer Demo\u001b[0m - The web scraper that doesn't suck at JavaScript\r\n\r\n"]
|
||||
[0.000139, "o", "📉 \u001b[31mWhat happens with requests:\u001b[0m\r\n"]
|
||||
[0.000067, "o", "```python\r\n"]
|
||||
[0.000063, "o", "import requests\r\n"]
|
||||
[0.000108, "o", "response = \u001b[1;35mrequests.get\u001b[0m\u001b[1m(\u001b[0m\u001b[32m'https://react-app.example.com'\u001b[0m\u001b[1m)\u001b[0m\r\n"]
|
||||
[0.000078, "o", "\u001b[1;35mprint\u001b[0m\u001b[1m(\u001b[0mresponse.text\u001b[1m)\u001b[0m\r\n"]
|
||||
[0.000044, "o", "```\r\n"]
|
||||
[0.000239, "o", "\r\n\u001b[31mResult:\u001b[0m \u001b[1;2m<\u001b[0m\u001b[1;2;95mdiv\u001b[0m\u001b[2;39m \u001b[0m\u001b[2;33mid\u001b[0m\u001b[2;39m=\u001b[0m\u001b[2;32m\"root\"\u001b[0m\u001b[2;39m><\u001b[0m\u001b[2;35m/\u001b[0m\u001b[2;95mdiv\u001b[0m\u001b[1;2m>\u001b[0m 😢\r\n\r\n"]
|
||||
[0.000106, "o", "🚀 \u001b[32mWhat happens with Crawailer:\u001b[0m\r\n"]
|
||||
[0.000054, "o", "```python\r\n"]
|
||||
[0.000058, "o", "import crawailer as web\r\n"]
|
||||
[0.000092, "o", "content = await \u001b[1;35mweb.get\u001b[0m\u001b[1m(\u001b[0m\u001b[32m'https://react-app.example.com'\u001b[0m\u001b[1m)\u001b[0m\r\n"]
|
||||
[0.000075, "o", "\u001b[1;35mprint\u001b[0m\u001b[1m(\u001b[0mcontent.markdown\u001b[1m)\u001b[0m\r\n"]
|
||||
[0.000045, "o", "```\r\n"]
|
||||
[2.001535, "o", "\r\n\u001b[32mResult:\u001b[0m\r\n"]
|
||||
[0.000583, "o", "\u001b[32m╭─\u001b[0m\u001b[32m────────────────────\u001b[0m\u001b[32m ✨ Actual Content \u001b[0m\u001b[32m────────────────────\u001b[0m\u001b[32m─╮\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m# Welcome to Our React App\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32mThis is real content extracted from a JavaScript-heavy SPA!\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m- 🎯 Product catalog with 247 items\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m- 💰 Dynamic pricing loaded via AJAX\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m- 🔍 Search functionality working\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m- 📊 Real-time analytics dashboard\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m**Word count:** 1,247 words\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m**Reading time:** 5 minutes\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m╰─────────────────────────────────────────────────────────────╯\u001b[0m\r\n"]
|
||||
[0.000174, "o", "\r\n⚡ \u001b[1mThe difference?\u001b[0m Crawailer executes JavaScript like a real browser!\r\n"]
|
||||
[0.000134, "o", "🎉 \u001b[1;32mYour AI assistant can now access ANY website!\u001b[0m\r\n"]
|
||||
[0.017314, "x", "0"]
|
17
demos/claude-integration.cast
Normal file
17
demos/claude-integration.cast
Normal file
File diff suppressed because one or more lines are too long
@ -115,15 +115,15 @@ all = [
|
||||
]
|
||||
|
||||
[project.urls]
|
||||
Homepage = "https://github.com/anthropics/crawailer"
|
||||
Repository = "https://github.com/anthropics/crawailer"
|
||||
Documentation = "https://github.com/anthropics/crawailer/blob/main/docs/README.md"
|
||||
"Bug Tracker" = "https://github.com/anthropics/crawailer/issues"
|
||||
"Source Code" = "https://github.com/anthropics/crawailer"
|
||||
"API Reference" = "https://github.com/anthropics/crawailer/blob/main/docs/API_REFERENCE.md"
|
||||
"JavaScript Guide" = "https://github.com/anthropics/crawailer/blob/main/docs/JAVASCRIPT_API.md"
|
||||
"Benchmarks" = "https://github.com/anthropics/crawailer/blob/main/docs/BENCHMARKS.md"
|
||||
Changelog = "https://github.com/anthropics/crawailer/releases"
|
||||
Homepage = "https://git.supported.systems/MCP/crawailer"
|
||||
Repository = "https://git.supported.systems/MCP/crawailer"
|
||||
Documentation = "https://git.supported.systems/MCP/crawailer/src/branch/main/docs/README.md"
|
||||
"Bug Tracker" = "https://git.supported.systems/MCP/crawailer/issues"
|
||||
"Source Code" = "https://git.supported.systems/MCP/crawailer"
|
||||
"API Reference" = "https://git.supported.systems/MCP/crawailer/src/branch/main/docs/API_REFERENCE.md"
|
||||
"JavaScript Guide" = "https://git.supported.systems/MCP/crawailer/src/branch/main/docs/JAVASCRIPT_API.md"
|
||||
"Benchmarks" = "https://git.supported.systems/MCP/crawailer/src/branch/main/docs/BENCHMARKS.md"
|
||||
Changelog = "https://git.supported.systems/MCP/crawailer/releases"
|
||||
|
||||
[project.scripts]
|
||||
crawailer = "crawailer.cli:main"
|
||||
|
@ -5,7 +5,7 @@ A delightful library for web automation and content extraction,
|
||||
designed for AI agents, MCP servers, and automation scripts.
|
||||
"""
|
||||
|
||||
__version__ = "0.1.0"
|
||||
__version__ = "0.1.1"
|
||||
|
||||
# Core browser control
|
||||
from .browser import Browser
|
||||
|
Loading…
x
Reference in New Issue
Block a user