# π·οΈ Crawailer **The JavaScript-first web scraper that actually works with modern websites** > **Finally!** A Python library that handles React, Vue, Angular, and dynamic content without the headaches. When `requests` fails and Selenium feels like overkill, Crawailer delivers clean, AI-ready content extraction with bulletproof JavaScript execution. ```python pip install crawailer ``` [](https://badge.fury.io/py/crawailer) [](https://pepy.tech/project/crawailer) [](https://pypi.org/project/crawailer/) ## β¨ Features - **π― JavaScript-First**: Executes real JavaScript on React, Vue, Angular sites (unlike `requests`) - **β‘ Lightning Fast**: 5-10x faster HTML processing with C-based selectolax - **π€ AI-Optimized**: Clean markdown output perfect for LLM training and RAG - **π§ Three Ways to Use**: Library, CLI tool, or MCP server - your choice - **π¦ Zero Config**: Works immediately with sensible defaults - **π§ͺ Battle-Tested**: 18 comprehensive test suites with 70+ real-world scenarios - **π¨ Developer Joy**: Rich terminal output, helpful errors, progress tracking ## π Quick Start ```python import crawailer as web # Simple content extraction content = await web.get("https://example.com") print(content.markdown) # Clean, LLM-ready markdown print(content.text) # Human-readable text print(content.title) # Extracted title # JavaScript execution for dynamic content content = await web.get( "https://spa-app.com", script="document.querySelector('.dynamic-price').textContent" ) print(f"Price: {content.script_result}") # Batch processing with JavaScript results = await web.get_many( ["url1", "url2", "url3"], script="document.title + ' | ' + document.querySelector('.description')?.textContent" ) for result in results: print(f"{result.title}: {result.script_result}") # Smart discovery with interaction research = await web.discover( "AI safety papers", script="document.querySelector('.show-more')?.click()", max_pages=10 ) # Returns the most relevant content with enhanced extraction # Compare: Traditional scraping fails on modern sites # requests.get("https://react-app.com") β Empty
# Crawailer β Full content + dynamic data ``` ## π― Design Philosophy ### For Robots, By Humans - **Predictive**: Anticipates what you need and provides it - **Forgiving**: Handles errors gracefully with helpful suggestions - **Efficient**: Fast by default, with smart caching and concurrency - **Composable**: Small, focused functions that work well together ### Perfect for AI Workflows - **LLM-Optimized**: Clean markdown, structured data, semantic chunking - **Context-Aware**: Extracts relationships and metadata automatically - **Quality-Focused**: Built-in content quality assessment - **Archive-Ready**: Designed for long-term storage and retrieval ## π Use Cases ### π€ AI Agents & LLM Applications **Problem**: Training data scattered across JavaScript-heavy academic sites ```python # Research assistant workflow with JavaScript interaction research = await web.discover( "quantum computing breakthroughs", script="document.querySelector('.show-abstract')?.click(); return document.querySelector('.full-text')?.textContent" ) for paper in research: # Rich content includes JavaScript-extracted data summary = await llm.summarize(paper.markdown) dynamic_content = paper.script_result # JavaScript execution result insights = await llm.extract_insights(paper.content + dynamic_content) ``` ### π E-commerce Price Monitoring **Problem**: Product prices loaded via AJAX, `requests` sees loading spinners ```python # Monitor competitor pricing with dynamic content products = await web.get_many( competitor_urls, script="return {price: document.querySelector('.price')?.textContent, stock: document.querySelector('.inventory')?.textContent}" ) for product in products: if product.script_result['price'] != cached_price: await alert_price_change(product.url, product.script_result) ``` ### π MCP Servers **Problem**: Claude needs reliable web content extraction tools ```python # Easy MCP integration (with crawailer[mcp]) from crawailer.mcp import create_mcp_server server = create_mcp_server() # Automatically exposes web.get, web.discover, etc. as MCP tools ``` ### π Social Media & Content Analysis **Problem**: Posts and comments load infinitely via JavaScript ```python # Extract social media discussions with infinite scroll content = await web.get( "https://social-platform.com/topic/ai-safety", script="window.scrollTo(0, document.body.scrollHeight); return document.querySelectorAll('.post').length" ) # Gets full thread content, not just initial page load ``` ## π οΈ Installation ```bash # Basic installation pip install crawailer # With AI features (semantic search, entity extraction) pip install crawailer[ai] # With MCP server capabilities pip install crawailer[mcp] # Everything pip install crawailer[all] # Post-install setup (installs Playwright browsers) crawailer setup ``` ## ποΈ Architecture Crawailer is built on modern, focused libraries: - **π Playwright**: Reliable browser automation - **β‘ selectolax**: 5-10x faster HTML parsing (C-based) - **π markdownify**: Clean HTMLβMarkdown conversion - **π§Ή justext**: Intelligent content extraction and cleaning - **π httpx**: Modern async HTTP client ## π§ͺ Battle-Tested Quality Crawailer includes **18 comprehensive test suites** with real-world scenarios: - **Modern Frameworks**: React, Vue, Angular demos with full JavaScript APIs - **Mobile Compatibility**: Safari iOS, Chrome Android, responsive designs - **Production Edge Cases**: Network failures, memory pressure, browser differences - **Performance Testing**: Stress tests, concurrency, resource management **Want to contribute?** We welcome PRs with new test scenarios! Our test sites library shows exactly how different frameworks should behave with JavaScript execution. > π **Future TODO**: Move examples to dedicated repository for community contributions ## π€ Perfect for MCP Projects MCP servers love Crawailer because it provides: - **Focused tools**: Each function does one thing well - **Rich outputs**: Structured data ready for LLM consumption - **Smart defaults**: Works out of the box with minimal configuration - **Extensible**: Easy to add domain-specific extraction logic ```python # Example MCP server tool @mcp_tool("web_research") async def research_topic(topic: str, depth: str = "comprehensive"): results = await web.discover(topic, max_pages=20) return { "sources": len(results), "content": [r.summary for r in results], "insights": await analyze_patterns(results) } ``` ## π₯ Crawailer vs Traditional Tools | Challenge | `requests` & HTTP libs | Selenium | **Crawailer** | |-----------|------------------------|----------|---------------| | **React/Vue/Angular** | β Empty templates | π‘ Slow, complex setup | β **Just works** | | **Dynamic Pricing** | β Shows loading spinner | π‘ Requires waits/timeouts | β **Intelligent waiting** | | **JavaScript APIs** | β No access | π‘ Clunky WebDriver calls | β **Native page.evaluate()** | | **Speed** | π’ 100-500ms | β 5-15 seconds | β **2-5 seconds** | | **Memory** | π’ 1-5MB | β 200-500MB | π‘ **100-200MB** | | **AI-Ready Output** | β Raw HTML | β Raw HTML | β **Clean Markdown** | | **Developer Experience** | π‘ Manual parsing | β Complex WebDriver | β **Intuitive API** | > **The bottom line**: When JavaScript matters, Crawailer delivers. When it doesn't, use `requests`. > > π **[See complete tool comparison β](docs/COMPARISON.md)** (includes Scrapy, Playwright, BeautifulSoup, and more) ## π What Makes It Delightful ### JavaScript-Powered Intelligence ```python # Dynamic content extraction from SPAs content = await web.get( "https://react-app.com", script="window.testData?.framework + ' v' + window.React?.version" ) # Automatically detects: React application with version info # Extracts: Dynamic content + framework details # E-commerce with JavaScript-loaded prices product = await web.get( "https://shop.com/product", script="document.querySelector('.dynamic-price')?.textContent", wait_for=".price-loaded" ) # Recognizes product page with dynamic pricing # Extracts: Real-time price, reviews, availability, specs ``` ### Beautiful Output ``` β¨ Found 15 high-quality sources π Sources: 4 arxiv, 3 journals, 2 conferences, 6 blogs π Date range: 2023-2024 (recent research) β‘ Average quality score: 8.7/10 π Key topics: transformers, safety, alignment ``` ### Helpful Errors ```python try: content = await web.get("problematic-site.com") except web.CloudflareProtected: # "π‘ Try: await web.get(url, stealth=True)" except web.PaywallDetected as e: # "π Found archived version: {e.archive_url}" ``` ## π Documentation - **[Tool Comparison](docs/COMPARISON.md)**: How Crawailer compares to Scrapy, Selenium, BeautifulSoup, etc. - **[Getting Started](docs/getting-started.md)**: Installation and first steps - **[JavaScript API](docs/JAVASCRIPT_API.md)**: Complete JavaScript execution guide - **[API Reference](docs/API_REFERENCE.md)**: Complete function documentation - **[Benchmarks](docs/BENCHMARKS.md)**: Performance comparison with other tools - **[MCP Integration](docs/mcp.md)**: Building MCP servers with Crawailer - **[Examples](examples/)**: Real-world usage patterns - **[Architecture](docs/architecture.md)**: How Crawailer works internally ## π€ Contributing We love contributions! Crawailer is designed to be: - **Easy to extend**: Add new content extractors and browser capabilities - **Well-tested**: Comprehensive test suite with real websites - **Documented**: Every feature has examples and use cases See [CONTRIBUTING.md](CONTRIBUTING.md) for details. ## π License MIT License - see [LICENSE](LICENSE) for details. --- ## π Ready to Stop Fighting JavaScript? ```bash pip install crawailer crawailer setup # Install browser engines ``` **Join the revolution**: Stop losing data to `requests.get()` failures. Start extracting **real content** from **real websites** that actually use JavaScript. β **Star us on GitHub** if Crawailer saves your scraping sanity! --- **Built with β€οΈ for the age of AI agents and automation** *Crawailer: Because robots deserve delightful web experiences too* π€β¨