Compare commits

..

No commits in common. "main" and "feature/js-api-integration" have entirely different histories.

8 changed files with 25 additions and 258 deletions

View File

@ -80,4 +80,4 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
---
For more details about changes, see the [commit history](https://git.supported.systems/MCP/crawailer/commits/branch/main).
For more details about changes, see the [commit history](https://github.com/anthropics/crawailer/commits/main).

View File

@ -1,65 +1,29 @@
# 🕷️ Crawailer
**The web scraper that doesn't suck at JavaScript** ✨
**The JavaScript-first web scraper that actually works with modern websites**
> **Stop fighting modern websites.** While `requests` gives you empty `<div id="root"></div>`, Crawailer actually executes JavaScript and extracts real content from React, Vue, and Angular apps. Finally, web scraping that works in 2025.
> ⚡ **Claude Code's new best friend** - Your AI assistant can now access ANY website
> **Finally!** A Python library that handles React, Vue, Angular, and dynamic content without the headaches. When `requests` fails and Selenium feels like overkill, Crawailer delivers clean, AI-ready content extraction with bulletproof JavaScript execution.
```python
pip install crawailer
```
[![PyPI version](https://badge.fury.io/py/crawailer.svg)](https://badge.fury.io/py/crawailer)
[![Downloads](https://pepy.tech/badge/crawailer)](https://pepy.tech/project/crawailer)
[![Python Support](https://img.shields.io/pypi/pyversions/crawailer.svg)](https://pypi.org/project/crawailer/)
## ✨ Why Developers Choose Crawailer
## ✨ Features
**🔥 JavaScript That Actually Works**
While other tools timeout or crash, Crawailer executes real JavaScript like a human browser
**⚡ Stupidly Fast**
5-10x faster than BeautifulSoup with C-based parsing that doesn't make you wait
**🤖 AI Assistant Ready**
Perfect markdown output that your Claude/GPT/local model will love
**🎯 Zero Learning Curve**
`pip install` → works immediately → no 47-page configuration guides
**🧪 Production Battle-Tested**
18 comprehensive test suites covering every edge case we could think of
**🎨 Actually Enjoyable**
Rich terminal output, helpful errors, progress bars that don't lie
- **🎯 JavaScript-First**: Executes real JavaScript on React, Vue, Angular sites (unlike `requests`)
- **⚡ Lightning Fast**: 5-10x faster HTML processing with C-based selectolax
- **🤖 AI-Optimized**: Clean markdown output perfect for LLM training and RAG
- **🔧 Three Ways to Use**: Library, CLI tool, or MCP server - your choice
- **📦 Zero Config**: Works immediately with sensible defaults
- **🧪 Battle-Tested**: 18 comprehensive test suites with 70+ real-world scenarios
- **🎨 Developer Joy**: Rich terminal output, helpful errors, progress tracking
## 🚀 Quick Start
> *(Honestly, you probably don't need to read these examples - just ask your AI assistant to figure it out. That's what models are for! But here they are anyway...)*
### 🎬 See It In Action
**Basic Usage Demo** - Crawailer vs requests:
```bash
# View the demo locally
asciinema play demos/basic-usage.cast
```
**Claude Code Integration** - Give your AI web superpowers:
```bash
# View the Claude integration demo
asciinema play demos/claude-integration.cast
```
*Don't have asciinema? `pip install asciinema` or run the demos yourself:*
```bash
# Clone the repo and run demos interactively
git clone https://git.supported.systems/MCP/crawailer.git
cd crawailer
python demo_basic_usage.py
python demo_claude_integration.py
```
```python
import crawailer as web
@ -97,30 +61,6 @@ research = await web.discover(
# Crawailer → Full content + dynamic data
```
### 🧠 Claude Code MCP Integration
> *"Hey Claude, go grab that data from the React app"* ← This actually works now
```python
# Add to your Claude Code MCP server
from crawailer.mcp import create_mcp_server
@mcp_tool("web_extract")
async def extract_content(url: str, script: str = ""):
"""Extract content from any website with optional JavaScript execution"""
content = await web.get(url, script=script)
return {
"title": content.title,
"markdown": content.markdown,
"script_result": content.script_result,
"word_count": content.word_count
}
# 🎉 No more "I can't access that site"
# 🎉 No more copy-pasting content manually
# 🎉 Your AI can now browse the web like a human
```
## 🎯 Design Philosophy
### For Robots, By Humans
@ -332,18 +272,16 @@ MIT License - see [LICENSE](LICENSE) for details.
---
## 🚀 Ready to Stop Losing Your Mind?
## 🚀 Ready to Stop Fighting JavaScript?
```bash
pip install crawailer
crawailer setup # Install browser engines
```
**Life's too short** for empty `<div>` tags and "JavaScript required" messages.
**Join the revolution**: Stop losing data to `requests.get()` failures. Start extracting **real content** from **real websites** that actually use JavaScript.
Get content that actually exists. From websites that actually work.
**Star us if this saves your sanity** → [git.supported.systems/MCP/crawailer](https://git.supported.systems/MCP/crawailer)
**Star us on GitHub** if Crawailer saves your scraping sanity!
---

View File

@ -1,58 +0,0 @@
#!/usr/bin/env python3
"""
Quick Crawailer demo for asciinema recording
Shows basic usage vs requests failure
"""
import asyncio
import requests
from rich.console import Console
from rich.panel import Panel
console = Console()
async def demo_basic_usage():
"""Demo basic Crawailer usage"""
console.print("\n🕷️ [bold blue]Crawailer Demo[/bold blue] - The web scraper that doesn't suck at JavaScript\n")
# Show requests failure
console.print("📉 [red]What happens with requests:[/red]")
console.print("```python")
console.print("import requests")
console.print("response = requests.get('https://react-app.example.com')")
console.print("print(response.text)")
console.print("```")
# Simulate requests response
console.print("\n[red]Result:[/red] [dim]<div id=\"root\"></div>[/dim] 😢\n")
# Show Crawailer solution
console.print("🚀 [green]What happens with Crawailer:[/green]")
console.print("```python")
console.print("import crawailer as web")
console.print("content = await web.get('https://react-app.example.com')")
console.print("print(content.markdown)")
console.print("```")
await asyncio.sleep(2)
# Simulate Crawailer response
console.print("\n[green]Result:[/green]")
console.print(Panel.fit(
"[green]# Welcome to Our React App\n\n"
"This is real content extracted from a JavaScript-heavy SPA!\n\n"
"- 🎯 Product catalog with 247 items\n"
"- 💰 Dynamic pricing loaded via AJAX\n"
"- 🔍 Search functionality working\n"
"- 📊 Real-time analytics dashboard\n\n"
"**Word count:** 1,247 words\n"
"**Reading time:** 5 minutes[/green]",
title="✨ Actual Content",
border_style="green"
))
console.print("\n⚡ [bold]The difference?[/bold] Crawailer executes JavaScript like a real browser!")
console.print("🎉 [bold green]Your AI assistant can now access ANY website![/bold green]")
if __name__ == "__main__":
asyncio.run(demo_basic_usage())

View File

@ -1,76 +0,0 @@
#!/usr/bin/env python3
"""
Claude Code MCP integration demo for asciinema recording
Shows how easy it is to give Claude web access
"""
import asyncio
from rich.console import Console
from rich.panel import Panel
from rich.syntax import Syntax
console = Console()
async def demo_claude_integration():
"""Demo Claude Code MCP integration"""
console.print("\n🧠 [bold blue]Claude Code MCP Integration Demo[/bold blue]\n")
console.print("[dim]\"Hey Claude, go grab that data from the React app\"[/dim] ← This actually works now!\n")
# Show the MCP server code
mcp_code = '''# Add to your Claude Code MCP server
from crawailer.mcp import create_mcp_server
@mcp_tool("web_extract")
async def extract_content(url: str, script: str = ""):
"""Extract content from any website with optional JavaScript execution"""
content = await web.get(url, script=script)
return {
"title": content.title,
"markdown": content.markdown,
"script_result": content.script_result,
"word_count": content.word_count
}'''
syntax = Syntax(mcp_code, "python", theme="monokai", line_numbers=True)
console.print(Panel(syntax, title="🔧 MCP Server Setup", border_style="blue"))
await asyncio.sleep(3)
# Show Claude conversation
console.print("\n💬 [bold]Claude Code Conversation:[/bold]")
await asyncio.sleep(1)
console.print("\n[cyan]You:[/cyan] Claude, can you extract the pricing info from https://shop.example.com/product/123?")
await asyncio.sleep(2)
console.print("\n[green]Claude:[/green] I'll extract that pricing information for you!")
console.print("[green]Claude:[/green] [dim]Using web_extract tool...[/dim]")
await asyncio.sleep(2)
console.print("\n[green]Claude:[/green] Here's the pricing information I extracted:")
# Show extracted content
result_panel = Panel.fit(
"[green]**Product:** Premium Widget Pro\n"
"**Price:** $299.99 (was $399.99)\n"
"**Discount:** 25% off - Limited time!\n"
"**Stock:** 47 units available\n"
"**Shipping:** Free 2-day delivery\n"
"**Reviews:** 4.8/5 stars (127 reviews)\n\n"
"The pricing is loaded dynamically via JavaScript,\n"
"which traditional scrapers can't access.[/green]",
title="📊 Extracted Data",
border_style="green"
)
console.print(result_panel)
console.print("\n🎉 [bold green]Benefits:[/bold green]")
console.print(" • No more \"I can't access that site\"")
console.print(" • No more copy-pasting content manually")
console.print(" • Your AI can now browse the web like a human")
console.print("\n⚡ [bold]Claude Code + Crawailer = Web superpowers for your AI![/bold]")
if __name__ == "__main__":
asyncio.run(demo_claude_integration())

View File

@ -1,20 +0,0 @@
{"version":3,"term":{"cols":80,"rows":24,"type":"xterm-kitty"},"timestamp":1758237889,"command":"python demo_basic_usage.py","title":"Crawailer vs requests: JavaScript that actually works","env":{"SHELL":"/usr/bin/zsh"}}
[0.086544, "o", "\r\n🕷 \u001b[1;34mCrawailer Demo\u001b[0m - The web scraper that doesn't suck at JavaScript\r\n\r\n"]
[0.000139, "o", "📉 \u001b[31mWhat happens with requests:\u001b[0m\r\n"]
[0.000067, "o", "```python\r\n"]
[0.000063, "o", "import requests\r\n"]
[0.000108, "o", "response = \u001b[1;35mrequests.get\u001b[0m\u001b[1m(\u001b[0m\u001b[32m'https://react-app.example.com'\u001b[0m\u001b[1m)\u001b[0m\r\n"]
[0.000078, "o", "\u001b[1;35mprint\u001b[0m\u001b[1m(\u001b[0mresponse.text\u001b[1m)\u001b[0m\r\n"]
[0.000044, "o", "```\r\n"]
[0.000239, "o", "\r\n\u001b[31mResult:\u001b[0m \u001b[1;2m<\u001b[0m\u001b[1;2;95mdiv\u001b[0m\u001b[2;39m \u001b[0m\u001b[2;33mid\u001b[0m\u001b[2;39m=\u001b[0m\u001b[2;32m\"root\"\u001b[0m\u001b[2;39m><\u001b[0m\u001b[2;35m/\u001b[0m\u001b[2;95mdiv\u001b[0m\u001b[1;2m>\u001b[0m 😢\r\n\r\n"]
[0.000106, "o", "🚀 \u001b[32mWhat happens with Crawailer:\u001b[0m\r\n"]
[0.000054, "o", "```python\r\n"]
[0.000058, "o", "import crawailer as web\r\n"]
[0.000092, "o", "content = await \u001b[1;35mweb.get\u001b[0m\u001b[1m(\u001b[0m\u001b[32m'https://react-app.example.com'\u001b[0m\u001b[1m)\u001b[0m\r\n"]
[0.000075, "o", "\u001b[1;35mprint\u001b[0m\u001b[1m(\u001b[0mcontent.markdown\u001b[1m)\u001b[0m\r\n"]
[0.000045, "o", "```\r\n"]
[2.001535, "o", "\r\n\u001b[32mResult:\u001b[0m\r\n"]
[0.000583, "o", "\u001b[32m╭─\u001b[0m\u001b[32m────────────────────\u001b[0m\u001b[32m ✨ Actual Content \u001b[0m\u001b[32m────────────────────\u001b[0m\u001b[32m─╮\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m# Welcome to Our React App\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32mThis is real content extracted from a JavaScript-heavy SPA!\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m- 🎯 Product catalog with 247 items\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m- 💰 Dynamic pricing loaded via AJAX\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m- 🔍 Search functionality working\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m- 📊 Real-time analytics dashboard\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m**Word count:** 1,247 words\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m│\u001b[0m \u001b[32m**Reading time:** 5 minutes\u001b[0m \u001b[32m│\u001b[0m\r\n\u001b[32m╰─────────────────────────────────────────────────────────────╯\u001b[0m\r\n"]
[0.000174, "o", "\r\n⚡ \u001b[1mThe difference?\u001b[0m Crawailer executes JavaScript like a real browser!\r\n"]
[0.000134, "o", "🎉 \u001b[1;32mYour AI assistant can now access ANY website!\u001b[0m\r\n"]
[0.017314, "x", "0"]

File diff suppressed because one or more lines are too long

View File

@ -115,15 +115,15 @@ all = [
]
[project.urls]
Homepage = "https://git.supported.systems/MCP/crawailer"
Repository = "https://git.supported.systems/MCP/crawailer"
Documentation = "https://git.supported.systems/MCP/crawailer/src/branch/main/docs/README.md"
"Bug Tracker" = "https://git.supported.systems/MCP/crawailer/issues"
"Source Code" = "https://git.supported.systems/MCP/crawailer"
"API Reference" = "https://git.supported.systems/MCP/crawailer/src/branch/main/docs/API_REFERENCE.md"
"JavaScript Guide" = "https://git.supported.systems/MCP/crawailer/src/branch/main/docs/JAVASCRIPT_API.md"
"Benchmarks" = "https://git.supported.systems/MCP/crawailer/src/branch/main/docs/BENCHMARKS.md"
Changelog = "https://git.supported.systems/MCP/crawailer/releases"
Homepage = "https://github.com/anthropics/crawailer"
Repository = "https://github.com/anthropics/crawailer"
Documentation = "https://github.com/anthropics/crawailer/blob/main/docs/README.md"
"Bug Tracker" = "https://github.com/anthropics/crawailer/issues"
"Source Code" = "https://github.com/anthropics/crawailer"
"API Reference" = "https://github.com/anthropics/crawailer/blob/main/docs/API_REFERENCE.md"
"JavaScript Guide" = "https://github.com/anthropics/crawailer/blob/main/docs/JAVASCRIPT_API.md"
"Benchmarks" = "https://github.com/anthropics/crawailer/blob/main/docs/BENCHMARKS.md"
Changelog = "https://github.com/anthropics/crawailer/releases"
[project.scripts]
crawailer = "crawailer.cli:main"

View File

@ -5,7 +5,7 @@ A delightful library for web automation and content extraction,
designed for AI agents, MCP servers, and automation scripts.
"""
__version__ = "0.1.1"
__version__ = "0.1.0"
# Core browser control
from .browser import Browser