Skip to content

Cost-efficient alternatives to HAR filesΒ #1092

@nrllh

Description

@nrllh

Currently, we only host HAR files for the most recent crawl (as discussed in #1011) and all older HAR files have been removed. I've used them extensively in a project and I know others in the community have also relied on them for research across different domains.

Since their removal, reproducing HAR-like data from BigQuery is difficult and expensive. Querying the raw request/response data across multiple tables at page-level granularity quickly becomes cost-prohibitive for many users.

One idea could be to provide a UDF that reassembles HAR-like structures, but that still risks being costly depending on the size of the crawl and query.

We should make historical crawl data more accessible again, in a way that's sustainable and doesn't shift high costs to users. Ideally, the community should be able to query or download HAR-like data efficiently.

Would be good to discuss options for restoring this kind of access, either through BigQuery optimizations or external exports.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions