-
-
Notifications
You must be signed in to change notification settings - Fork 46
Description
Currently, we only host HAR files for the most recent crawl (as discussed in #1011) and all older HAR files have been removed. I've used them extensively in a project and I know others in the community have also relied on them for research across different domains.
Since their removal, reproducing HAR-like data from BigQuery is difficult and expensive. Querying the raw request/response data across multiple tables at page-level granularity quickly becomes cost-prohibitive for many users.
One idea could be to provide a UDF that reassembles HAR-like structures, but that still risks being costly depending on the size of the crawl and query.
We should make historical crawl data more accessible again, in a way that's sustainable and doesn't shift high costs to users. Ideally, the community should be able to query or download HAR-like data efficiently.
Would be good to discuss options for restoring this kind of access, either through BigQuery optimizations or external exports.