Skip to content

Cache file listing results and Parquet metadata #516

@linhr

Description

@linhr

Currently, the plan resolver lists all files and fetches metadata for each individual Parquet file when planning each query, even if the dataset has already been registered as a temporary view. This adds overhead especially when the data is remote (e.g. in an object store such as AWS S3) and when the query involves multiple datasets with large number of partitioned files.

We may want to cache file listing results and Parquet metadata in the plan resolver. The downside is that there is no way to detect staleness of the cache. This is acceptable though, since we usually assume the files would not change. If the dataset does change after being overwritten, as a workaround, the user can restart the session.

Metadata

Metadata

Assignees

No one assigned

    Labels

    non-trivialLikely not a quick addition and may require design discussions

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions