Compare commits
8 Commits
feature/js
...
main
Author | SHA1 | Date | |
---|---|---|---|
![]() |
f4088bf607 | ||
![]() |
cffd04e4ca | ||
![]() |
fad0b60b66 | ||
![]() |
3d5a2f3dec | ||
![]() |
21c4a63477 | ||
![]() |
2e1e7d49eb | ||
![]() |
8b0fa8ef77 | ||
![]() |
ad37776018 |
@ -80,4 +80,4 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
For more details about changes, see the [commit history](https://github.com/anthropics/crawailer/commits/main).
|
For more details about changes, see the [commit history](https://git.supported.systems/MCP/crawailer/commits/branch/main).
|
90
README.md
90
README.md
@ -1,29 +1,65 @@
|
|||||||
# 🕷️ Crawailer
|
# 🕷️ Crawailer
|
||||||
|
|
||||||
**The JavaScript-first web scraper that actually works with modern websites**
|
**The web scraper that doesn't suck at JavaScript** ✨
|
||||||
|
|
||||||
> **Finally!** A Python library that handles React, Vue, Angular, and dynamic content without the headaches. When `requests` fails and Selenium feels like overkill, Crawailer delivers clean, AI-ready content extraction with bulletproof JavaScript execution.
|
> **Stop fighting modern websites.** While `requests` gives you empty `<div id="root"></div>`, Crawailer actually executes JavaScript and extracts real content from React, Vue, and Angular apps. Finally, web scraping that works in 2025.
|
||||||
|
|
||||||
|
> ⚡ **Claude Code's new best friend** - Your AI assistant can now access ANY website
|
||||||
|
|
||||||
```python
|
```python
|
||||||
pip install crawailer
|
pip install crawailer
|
||||||
```
|
```
|
||||||
|
|
||||||
[](https://badge.fury.io/py/crawailer)
|
[](https://badge.fury.io/py/crawailer)
|
||||||
[](https://pepy.tech/project/crawailer)
|
|
||||||
[](https://pypi.org/project/crawailer/)
|
[](https://pypi.org/project/crawailer/)
|
||||||
|
|
||||||
## ✨ Features
|
## ✨ Why Developers Choose Crawailer
|
||||||
|
|
||||||
- **🎯 JavaScript-First**: Executes real JavaScript on React, Vue, Angular sites (unlike `requests`)
|
**🔥 JavaScript That Actually Works**
|
||||||
- **⚡ Lightning Fast**: 5-10x faster HTML processing with C-based selectolax
|
While other tools timeout or crash, Crawailer executes real JavaScript like a human browser
|
||||||
- **🤖 AI-Optimized**: Clean markdown output perfect for LLM training and RAG
|
|
||||||
- **🔧 Three Ways to Use**: Library, CLI tool, or MCP server - your choice
|
**⚡ Stupidly Fast**
|
||||||
- **📦 Zero Config**: Works immediately with sensible defaults
|
5-10x faster than BeautifulSoup with C-based parsing that doesn't make you wait
|
||||||
- **🧪 Battle-Tested**: 18 comprehensive test suites with 70+ real-world scenarios
|
|
||||||
- **🎨 Developer Joy**: Rich terminal output, helpful errors, progress tracking
|
**🤖 AI Assistant Ready**
|
||||||
|
Perfect markdown output that your Claude/GPT/local model will love
|
||||||
|
|
||||||
|
**🎯 Zero Learning Curve**
|
||||||
|
`pip install` → works immediately → no 47-page configuration guides
|
||||||
|
|
||||||
|
**🧪 Production Battle-Tested**
|
||||||
|
18 comprehensive test suites covering every edge case we could think of
|
||||||
|
|
||||||
|
**🎨 Actually Enjoyable**
|
||||||
|
Rich terminal output, helpful errors, progress bars that don't lie
|
||||||
|
|
||||||
## 🚀 Quick Start
|
## 🚀 Quick Start
|
||||||
|
|
||||||
|
> *(Honestly, you probably don't need to read these examples - just ask your AI assistant to figure it out. That's what models are for! But here they are anyway...)*
|
||||||
|
|
||||||
|
### 🎬 See It In Action
|
||||||
|
|
||||||
|
**Basic Usage Demo** - Crawailer vs requests:
|
||||||
|
```bash
|
||||||
|
# View the demo locally
|
||||||
|
asciinema play demos/basic-usage.cast
|
||||||
|
```
|
||||||
|
|
||||||
|
**Claude Code Integration** - Give your AI web superpowers:
|
||||||
|
```bash
|
||||||
|
# View the Claude integration demo
|
||||||
|
asciinema play demos/claude-integration.cast
|
||||||
|
```
|
||||||
|
|
||||||
|
*Don't have asciinema? `pip install asciinema` or run the demos yourself:*
|
||||||
|
```bash
|
||||||
|
# Clone the repo and run demos interactively
|
||||||
|
git clone https://git.supported.systems/MCP/crawailer.git
|
||||||
|
cd crawailer
|
||||||
|
python demo_basic_usage.py
|
||||||
|
python demo_claude_integration.py
|
||||||
|
```
|
||||||
|
|
||||||
```python
|
```python
|
||||||
import crawailer as web
|
import crawailer as web
|
||||||
|
|
||||||
@ -61,6 +97,30 @@ research = await web.discover(
|
|||||||
# Crawailer → Full content + dynamic data
|
# Crawailer → Full content + dynamic data
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### 🧠 Claude Code MCP Integration
|
||||||
|
|
||||||
|
> *"Hey Claude, go grab that data from the React app"* ← This actually works now
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Add to your Claude Code MCP server
|
||||||
|
from crawailer.mcp import create_mcp_server
|
||||||
|
|
||||||
|
@mcp_tool("web_extract")
|
||||||
|
async def extract_content(url: str, script: str = ""):
|
||||||
|
"""Extract content from any website with optional JavaScript execution"""
|
||||||
|
content = await web.get(url, script=script)
|
||||||
|
return {
|
||||||
|
"title": content.title,
|
||||||
|
"markdown": content.markdown,
|
||||||
|
"script_result": content.script_result,
|
||||||
|
"word_count": content.word_count
|
||||||
|
}
|
||||||
|
|
||||||
|
# 🎉 No more "I can't access that site"
|
||||||
|
# 🎉 No more copy-pasting content manually
|
||||||
|
# 🎉 Your AI can now browse the web like a human
|
||||||
|
```
|
||||||
|
|
||||||
## 🎯 Design Philosophy
|
## 🎯 Design Philosophy
|
||||||
|
|
||||||
### For Robots, By Humans
|
### For Robots, By Humans
|
||||||
@ -272,16 +332,18 @@ MIT License - see [LICENSE](LICENSE) for details.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 🚀 Ready to Stop Fighting JavaScript?
|
## 🚀 Ready to Stop Losing Your Mind?
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install crawailer
|
pip install crawailer
|
||||||
crawailer setup # Install browser engines
|
crawailer setup # Install browser engines
|
||||||
```
|
```
|
||||||
|
|
||||||
**Join the revolution**: Stop losing data to `requests.get()` failures. Start extracting **real content** from **real websites** that actually use JavaScript.
|
**Life's too short** for empty `<div>` tags and "JavaScript required" messages.
|
||||||
|
|
||||||
⭐ **Star us on GitHub** if Crawailer saves your scraping sanity!
|
Get content that actually exists. From websites that actually work.
|
||||||
|
|
||||||
|
⭐ **Star us if this saves your sanity** → [git.supported.systems/MCP/crawailer](https://git.supported.systems/MCP/crawailer)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
58
demo_basic_usage.py
Normal file
58
demo_basic_usage.py
Normal file
@ -0,0 +1,58 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Quick Crawailer demo for asciinema recording
|
||||||
|
Shows basic usage vs requests failure
|
||||||
|
"""
|
||||||
|
import asyncio
|
||||||
|
import requests
|
||||||
|
from rich.console import Console
|
||||||
|
from rich.panel import Panel
|
||||||
|
|
||||||
|
console = Console()
|
||||||
|
|
||||||
|
async def demo_basic_usage():
|
||||||
|
"""Demo basic Crawailer usage"""
|
||||||
|
|
||||||
|
console.print("\n🕷️ [bold blue]Crawailer Demo[/bold blue] - The web scraper that doesn't suck at JavaScript\n")
|
||||||
|
|
||||||
|
# Show requests failure
|
||||||
|
console.print("📉 [red]What happens with requests:[/red]")
|
||||||
|
console.print("```python")
|
||||||
|
console.print("import requests")
|
||||||
|
console.print("response = requests.get('https://react-app.example.com')")
|
||||||
|
console.print("print(response.text)")
|
||||||
|
console.print("```")
|
||||||
|
|
||||||
|
# Simulate requests response
|
||||||
|
console.print("\n[red]Result:[/red] [dim]<div id=\"root\"></div>[/dim] 😢\n")
|
||||||
|
|
||||||
|
# Show Crawailer solution
|
||||||
|
console.print("🚀 [green]What happens with Crawailer:[/green]")
|
||||||
|
console.print("```python")
|
||||||
|
console.print("import crawailer as web")
|
||||||
|
console.print("content = await web.get('https://react-app.example.com')")
|
||||||
|
console.print("print(content.markdown)")
|
||||||
|
console.print("```")
|
||||||
|
|
||||||
|
await asyncio.sleep(2)
|
||||||
|
|
||||||
|
# Simulate Crawailer response
|
||||||
|
console.print("\n[green]Result:[/green]")
|
||||||
|
console.print(Panel.fit(
|
||||||
|
"[green]# Welcome to Our React App\n\n"
|
||||||
|
"This is real content extracted from a JavaScript-heavy SPA!\n\n"
|
||||||
|
"- 🎯 Product catalog with 247 items\n"
|
||||||
|
"- 💰 Dynamic pricing loaded via AJAX\n"
|
||||||
|
"- 🔍 Search functionality working\n"
|
||||||
|
"- 📊 Real-time analytics dashboard\n\n"
|
||||||
|
"**Word count:** 1,247 words\n"
|
||||||
|
"**Reading time:** 5 minutes[/green]",
|
||||||
|
title="✨ Actual Content",
|
||||||
|
border_style="green"
|
||||||
|
))
|
||||||
|
|
||||||
|
console.print("\n⚡ [bold]The difference?[/bold] Crawailer executes JavaScript like a real browser!")
|
||||||
|
console.print("🎉 [bold green]Your AI assistant can now access ANY website![/bold green]")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(demo_basic_usage())
|
76
demo_claude_integration.py
Normal file
76
demo_claude_integration.py
Normal file
@ -0,0 +1,76 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Claude Code MCP integration demo for asciinema recording
|
||||||
|
Shows how easy it is to give Claude web access
|
||||||
|
"""
|
||||||
|
import asyncio
|
||||||
|
from rich.console import Console
|
||||||
|
from rich.panel import Panel
|
||||||
|
from rich.syntax import Syntax
|
||||||
|
|
||||||
|
console = Console()
|
||||||
|
|
||||||
|
async def demo_claude_integration():
|
||||||
|
"""Demo Claude Code MCP integration"""
|
||||||
|
|
||||||
|
console.print("\n🧠 [bold blue]Claude Code MCP Integration Demo[/bold blue]\n")
|
||||||
|
|
||||||
|
console.print("[dim]\"Hey Claude, go grab that data from the React app\"[/dim] ← This actually works now!\n")
|
||||||
|
|
||||||
|
# Show the MCP server code
|
||||||
|
mcp_code = '''# Add to your Claude Code MCP server
|
||||||
|
from crawailer.mcp import create_mcp_server
|
||||||
|
|
||||||
|
@mcp_tool("web_extract")
|
||||||
|
async def extract_content(url: str, script: str = ""):
|
||||||
|
"""Extract content from any website with optional JavaScript execution"""
|
||||||
|
content = await web.get(url, script=script)
|
||||||
|
return {
|
||||||
|
"title": content.title,
|
||||||
|
"markdown": content.markdown,
|
||||||
|
"script_result": content.script_result,
|
||||||
|
"word_count": content.word_count
|
||||||
|
}'''
|
||||||
|
|
||||||
|
syntax = Syntax(mcp_code, "python", theme="monokai", line_numbers=True)
|
||||||
|
console.print(Panel(syntax, title="🔧 MCP Server Setup", border_style="blue"))
|
||||||
|
|
||||||
|
await asyncio.sleep(3)
|
||||||
|
|
||||||
|
# Show Claude conversation
|
||||||
|
console.print("\n💬 [bold]Claude Code Conversation:[/bold]")
|
||||||
|
|
||||||
|
await asyncio.sleep(1)
|
||||||
|
console.print("\n[cyan]You:[/cyan] Claude, can you extract the pricing info from https://shop.example.com/product/123?")
|
||||||
|
|
||||||
|
await asyncio.sleep(2)
|
||||||
|
console.print("\n[green]Claude:[/green] I'll extract that pricing information for you!")
|
||||||
|
console.print("[green]Claude:[/green] [dim]Using web_extract tool...[/dim]")
|
||||||
|
|
||||||
|
await asyncio.sleep(2)
|
||||||
|
console.print("\n[green]Claude:[/green] Here's the pricing information I extracted:")
|
||||||
|
|
||||||
|
# Show extracted content
|
||||||
|
result_panel = Panel.fit(
|
||||||
|
"[green]**Product:** Premium Widget Pro\n"
|
||||||
|
"**Price:** $299.99 (was $399.99)\n"
|
||||||
|
"**Discount:** 25% off - Limited time!\n"
|
||||||
|
"**Stock:** 47 units available\n"
|
||||||
|
"**Shipping:** Free 2-day delivery\n"
|
||||||
|
"**Reviews:** 4.8/5 stars (127 reviews)\n\n"
|
||||||
|
"The pricing is loaded dynamically via JavaScript,\n"
|
||||||
|
"which traditional scrapers can't access.[/green]",
|
||||||
|
title="📊 Extracted Data",
|
||||||
|
border_style="green"
|
||||||
|
)
|
||||||
|
console.print(result_panel)
|
||||||
|
|
||||||
|
console.print("\n🎉 [bold green]Benefits:[/bold green]")
|
||||||
|
console.print(" • No more \"I can't access that site\"")
|
||||||
|
console.print(" • No more copy-pasting content manually")
|
||||||
|
console.print(" • Your AI can now browse the web like a human")
|
||||||
|
|
||||||
|
console.print("\n⚡ [bold]Claude Code + Crawailer = Web superpowers for your AI![/bold]")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(demo_claude_integration())
|
20
demos/basic-usage.cast
Normal file
20
demos/basic-usage.cast
Normal file
@ -0,0 +1,20 @@
|
|||||||
|
{"version":3,"term":{"cols":80,"rows":24,"type":"xterm-kitty"},"timestamp":1758237889,"command":"python demo_basic_usage.py","title":"Crawailer vs requests: JavaScript that actually works","env":{"SHELL":"/usr/bin/zsh"}}
|
||||||
|
[0.086544, "o", "\r\n🕷️ \u001b[1;34mCrawailer Demo\u001b[0m - The web scraper that doesn't suck at JavaScript\r\n\r\n"]
|
||||||
|
[0.000139, "o", "📉 \u001b[31mWhat happens with requests:\u001b[0m\r\n"]
|
||||||
|
[0.000067, "o", "```python\r\n"]
|
||||||
|
[0.000063, "o", "import requests\r\n"]
|
||||||
|
[0.000108, "o", "response = \u001b[1;35mrequests.get\u001b[0m\u001b[1m(\u001b[0m\u001b[32m'https://react-app.example.com'\u001b[0m\u001b[1m)\u001b[0m\r\n"]
|
||||||
|
[0.000078, "o", "\u001b[1;35mprint\u001b[0m\u001b[1m(\u001b[0mresponse.text\u001b[1m)\u001b[0m\r\n"]
|
||||||
|
[0.000044, "o", "```\r\n"]
|
||||||
|
[0.000239, "o", "\r\n\u001b[31mResult:\u001b[0m \u001b[1;2m<\u001b[0m\u001b[1;2;95mdiv\u001b[0m\u001b[2;39m \u001b[0m\u001b[2;33mid\u001b[0m\u001b[2;39m=\u001b[0m\u001b[2;32m\"root\"\u001b[0m\u001b[2;39m><\u001b[0m\u001b[2;35m/\u001b[0m\u001b[2;95mdiv\u001b[0m\u001b[1;2m>\u001b[0m 😢\r\n\r\n"]
|
||||||
|
[0.000106, "o", "🚀 \u001b[32mWhat happens with Crawailer:\u001b[0m\r\n"]
|
||||||
|
[0.000054, "o", "```python\r\n"]
|
||||||
|
[0.000058, "o", "import crawailer as web\r\n"]
|
||||||
|
[0.000092, "o", "content = await \u001b[1;35mweb.get\u001b[0m\u001b[1m(\u001b[0m\u001b[32m'https://react-app.example.com'\u001b[0m\u001b[1m)\u001b[0m\r\n"]
|
||||||
|
[0.000075, "o", "\u001b[1;35mprint\u001b[0m\u001b[1m(\u001b[0mcontent.markdown\u001b[1m)\u001b[0m\r\n"]
|
||||||
|
[0.000045, "o", "```\r\n"]
|
||||||
|
[2.001535, "o", "\r\n\u001b[32mResult:\u001b[0m\r\n"]
|
||||||
|
[0.000583, "o", "\u001b[32m╭─\u001b[0m\u001b[32m────────────────────\u001b[0m\u001b[32m ✨ Actual Content \u001b[0m\u001b[32m────────────────────\u001b[0m\u001b[32m─╮\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m# Welcome to Our React App\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32mThis is real content extracted from a JavaScript-heavy SPA!\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m- 🎯 Product catalog with 247 items\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m- 💰 Dynamic pricing loaded via AJAX\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m- 🔍 Search functionality working\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m- 📊 Real-time analytics dashboard\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m**Word count:** 1,247 words\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m**Reading time:** 5 minutes\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m╰─────────────────────────────────────────────────────────────╯\u001b[0m\r\n"]
|
||||||
|
[0.000174, "o", "\r\n⚡ \u001b[1mThe difference?\u001b[0m Crawailer executes JavaScript like a real browser!\r\n"]
|
||||||
|
[0.000134, "o", "🎉 \u001b[1;32mYour AI assistant can now access ANY website!\u001b[0m\r\n"]
|
||||||
|
[0.017314, "x", "0"]
|
17
demos/claude-integration.cast
Normal file
17
demos/claude-integration.cast
Normal file
File diff suppressed because one or more lines are too long
@ -115,15 +115,15 @@ all = [
|
|||||||
]
|
]
|
||||||
|
|
||||||
[project.urls]
|
[project.urls]
|
||||||
Homepage = "https://github.com/anthropics/crawailer"
|
Homepage = "https://git.supported.systems/MCP/crawailer"
|
||||||
Repository = "https://github.com/anthropics/crawailer"
|
Repository = "https://git.supported.systems/MCP/crawailer"
|
||||||
Documentation = "https://github.com/anthropics/crawailer/blob/main/docs/README.md"
|
Documentation = "https://git.supported.systems/MCP/crawailer/src/branch/main/docs/README.md"
|
||||||
"Bug Tracker" = "https://github.com/anthropics/crawailer/issues"
|
"Bug Tracker" = "https://git.supported.systems/MCP/crawailer/issues"
|
||||||
"Source Code" = "https://github.com/anthropics/crawailer"
|
"Source Code" = "https://git.supported.systems/MCP/crawailer"
|
||||||
"API Reference" = "https://github.com/anthropics/crawailer/blob/main/docs/API_REFERENCE.md"
|
"API Reference" = "https://git.supported.systems/MCP/crawailer/src/branch/main/docs/API_REFERENCE.md"
|
||||||
"JavaScript Guide" = "https://github.com/anthropics/crawailer/blob/main/docs/JAVASCRIPT_API.md"
|
"JavaScript Guide" = "https://git.supported.systems/MCP/crawailer/src/branch/main/docs/JAVASCRIPT_API.md"
|
||||||
"Benchmarks" = "https://github.com/anthropics/crawailer/blob/main/docs/BENCHMARKS.md"
|
"Benchmarks" = "https://git.supported.systems/MCP/crawailer/src/branch/main/docs/BENCHMARKS.md"
|
||||||
Changelog = "https://github.com/anthropics/crawailer/releases"
|
Changelog = "https://git.supported.systems/MCP/crawailer/releases"
|
||||||
|
|
||||||
[project.scripts]
|
[project.scripts]
|
||||||
crawailer = "crawailer.cli:main"
|
crawailer = "crawailer.cli:main"
|
||||||
|
@ -5,7 +5,7 @@ A delightful library for web automation and content extraction,
|
|||||||
designed for AI agents, MCP servers, and automation scripts.
|
designed for AI agents, MCP servers, and automation scripts.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
__version__ = "0.1.0"
|
__version__ = "0.1.1"
|
||||||
|
|
||||||
# Core browser control
|
# Core browser control
|
||||||
from .browser import Browser
|
from .browser import Browser
|
||||||
|
Loading…
x
Reference in New Issue
Block a user