Performant, extensible and lean web crawler, utilizes all available CPUs by default.
Uses event loop for I/O and processes for analyzing the pages.
- Basic
httpx
page downloader S3
page storage- Local filesystem page storage
- Have a look at
tests/integration/test_crawl.py
- Implement your own
PageAnalyzer
andPageDownloader
classes - Optionally customize
structlog
logging, see configuration - Have fun!
All classes in the modules folder can be replaced with your custom implementation.