
- Complete browser automation with Playwright integration - High-level API functions: get(), get_many(), discover() - JavaScript execution support with script parameters - Content extraction optimized for LLM workflows - Comprehensive test suite with 18 test files (700+ scenarios) - Local Caddy test server for reproducible testing - Performance benchmarking vs Katana crawler - Complete documentation including JavaScript API guide - PyPI-ready packaging with professional metadata - UNIX philosophy: do web scraping exceptionally well
3.2 KiB
3.2 KiB
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[Unreleased]
Added
- Initial release of Crawailer
- Full JavaScript execution support with
page.evaluate()
- Modern framework support (React, Vue, Angular)
- Comprehensive content extraction with rich metadata
- High-level API functions:
get()
,get_many()
,discover()
- Browser automation with Playwright integration
- Fast HTML processing with selectolax (5-10x faster than BeautifulSoup)
- WebContent dataclass with computed properties
- Async-first design with concurrent processing
- Command-line interface
- MCP (Model Context Protocol) server integration
- Comprehensive test suite with 357+ scenarios
- Local Docker test server for development
- Security hardening with XSS prevention
- Memory management and leak detection
- Cross-browser engine compatibility
- Performance optimization strategies
Features
- JavaScript Execution: Execute arbitrary JavaScript with
script
,script_before
,script_after
parameters - SPA Support: Handle React, Vue, Angular, and other modern frameworks
- Dynamic Content: Extract content loaded via AJAX, user interactions, and lazy loading
- Batch Processing: Process multiple URLs concurrently with intelligent batching
- Content Quality: Rich metadata extraction including author, reading time, quality scores
- Error Handling: Comprehensive error capture with graceful degradation
- Performance Monitoring: Extract timing and memory metrics from pages
- Framework Detection: Automatic detection of JavaScript frameworks and versions
- User Interaction: Simulate clicks, form submissions, scrolling, and complex workflows
Documentation
- Complete JavaScript API guide with examples
- Comprehensive API reference documentation
- Performance benchmarks vs Katana crawler
- Testing infrastructure documentation
- Strategic positioning and use case guidance
Testing
- 18 test files with 16,554+ lines of test code
- Modern framework integration tests
- Mobile browser compatibility tests
- Security and penetration testing
- Memory management and leak detection
- Network resilience and error handling
- Performance under pressure validation
- Browser engine compatibility testing
Performance
- Intelligent content extraction optimized for LLM consumption
- Concurrent processing with configurable limits
- Memory-efficient batch processing
- Resource cleanup and garbage collection
- Connection pooling and request optimization
Security
- XSS prevention and input validation
- Script execution sandboxing
- Safe error handling without information leakage
- Comprehensive security test suite
[0.1.0] - 2024-09-18
Added
- Initial public release
- Core browser automation functionality
- JavaScript execution capabilities
- Content extraction and processing
- MCP server integration
- Comprehensive documentation
- Production-ready test suite
For more details about changes, see the commit history.