Skip to content

datek/web-crawler

Repository files navigation

codecov

Web Crawler

Performant, extensible and lean web crawler, utilizes all available CPUs by default.

Uses event loop for I/O and processes for analyzing the pages.

Batteries included

  • Basic httpx page downloader
  • S3 page storage
  • Local filesystem page storage

Usage

  • Have a look at tests/integration/test_crawl.py
  • Implement your own PageAnalyzer and PageDownloader classes
  • Optionally customize structlog logging, see configuration
  • Have fun!

Customization

All classes in the modules folder can be replaced with your custom implementation.

About

Performant, lean, highly customizable python web crawler

Resources

License

Stars

Watchers

Forks

Packages

No packages published