Skip to content
@commoncrawl

Common Crawl Foundation

Common Crawl provides an archive of webpages going back to 2007.

Pinned Loading

  1. cc-pyspark cc-pyspark Public

    Process Common Crawl data with Python and Spark

    Python 442 90

  2. cc-crawl-statistics cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    Python 195 16

  3. cc-index-table cc-index-table Public

    Index Common Crawl archives in tabular format

    Java 122 14

  4. cc-warc-examples cc-warc-examples Public

    Forked from Smerity/cc-warc-examples

    CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

    Java 37 18

  5. cc-citations cc-citations Public

    Scientific articles using or citing Common Crawl data

    Jupyter Notebook 26 5

  6. cc-notebooks cc-notebooks Public

    Various Jupyter notebooks about Common Crawl data

    Jupyter Notebook 59 10

Repositories

Showing 10 of 75 repositories
  • cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    commoncrawl/cc-crawl-statistics’s past year of commit activity
    Python 195 Apache-2.0 16 3 1 Updated Nov 3, 2025
  • cc-pyspark Public

    Process Common Crawl data with Python and Spark

    commoncrawl/cc-pyspark’s past year of commit activity
    Python 442 MIT 90 5 1 Updated Nov 1, 2025
  • web-languages Public

    Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

    commoncrawl/web-languages’s past year of commit activity
    60 78 3 1 Updated Nov 1, 2025
  • cc-index-table Public

    Index Common Crawl archives in tabular format

    commoncrawl/cc-index-table’s past year of commit activity
    Java 122 Apache-2.0 14 7 2 Updated Oct 31, 2025
  • whirlwind-python-notebook Public

    A jupyter notebook illistrating the basics of Common Crawl's datasets.

    commoncrawl/whirlwind-python-notebook’s past year of commit activity
    Jupyter Notebook 1 Apache-2.0 0 0 0 Updated Oct 28, 2025
  • whirlwind-python Public

    A whirlwind tour of Common Crawl's data using Python

    commoncrawl/whirlwind-python’s past year of commit activity
    Python 28 Apache-2.0 6 0 1 Updated Oct 27, 2025
  • cc-webgraph-statistics Public

    Statistics of Common Crawl monthly Web Graphs

    commoncrawl/cc-webgraph-statistics’s past year of commit activity
    Python 5 Apache-2.0 1 2 0 Updated Oct 25, 2025
  • robotstxt-experiments Public

    How is the Robots Exclusion Protocol (robots.txt) used in the WWW? This projects tries to get some insights mining Common Crawl's robots.txt captures of the years 2016 – 2024.

    commoncrawl/robotstxt-experiments’s past year of commit activity
    Jupyter Notebook 0 MIT 0 0 0 Updated Oct 23, 2025
  • cc-webgraph Public

    Tools to construct and process Common Crawl webgraphs

    commoncrawl/cc-webgraph’s past year of commit activity
    Java 101 Apache-2.0 4 2 (1 issue needs help) 0 Updated Oct 23, 2025
  • presentations Public

    A collection of public presentations from the Common Crawl Foundation

    commoncrawl/presentations’s past year of commit activity
    9 1 0 0 Updated Oct 22, 2025