Skip to content

Conversation

PeterKeDer
Copy link
Contributor

Description

This adds a custom object store that caches files to disk. I've seen some discussions around but no implementation of it yet, so I'm opening this quick PR to gather opinions. Very open on changing the API or implementation of it.

This solution is very simple (possibly overly-naive) and specific to our use-case of delta-rs, so there might be some edge cases from newer features that I did not consider. To enable it, you specify a file_cache_path, and it downloads all files requested by get requests and saves them to this path. For range requests, this still downloads the entire file which means it will be worse for performance in many cases on the first run- I think there's a lot of potential optimizations to be done here.

There is also an option to specify a cache duration for _last_checkpoint which to my knowledge is the only file that is mutable. We found that this increases performance by using the local (and potentially outdated) checkpoint parquet because loading incremental json files are way faster than a new checkpoint parquet.

We've been using this at our company for the last couple weeks. It has been very helpful for loading large datasets, so we think a caching feature like this would be quite valuable.

Usage example:

dt = DeltaTable(
  "s3://bucket/table",
  storage_options={
    "file_cache_path": "/Users/peter/delta-cache",
    "file_cache_last_checkpoint_valid_duration": "30m",
})
# do stuff with dt

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label May 17, 2025
@ion-elgreco
Copy link
Collaborator

@roeap was actually going to add a caching object store based on foyer this weekend, which supports hybrid caching :)

@roeap
Copy link
Collaborator

roeap commented May 17, 2025

Just opened a draft PR - still need to wire things through, but the cache implementation should be close to finished.

@PeterKeDer - would love your feedback and happy to discuss which route we should take.

The most obvious difference would be the I used an external implementation of a cache. The policy is also more selective in that we right now only cache specific log files (the json ones) and omit the others. We do however try to make the cache sharable, such that it can be used for multiple table instances - i.e. we use a fully qualified URL as cache key.

To enable additional caching there have been various updates to delta kernel, which allow us to incrementally update a snapshot and scan state.

These would however only get activated once we have kernel for log handling ...

Copy link

codecov bot commented May 17, 2025

Codecov Report

❌ Patch coverage is 5.48780% with 155 lines in your changes missing coverage. Please review.
✅ Project coverage is 14.57%. Comparing base (6a4f249) to head (3643c91).
⚠️ Report is 28 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/logstore/storage/file_cache.rs 0.00% 145 Missing ⚠️
crates/core/src/logstore/config.rs 52.94% 6 Missing and 2 partials ⚠️
crates/core/src/logstore/mod.rs 0.00% 0 Missing and 2 partials ⚠️

❗ There is a different number of reports uploaded between BASE (6a4f249) and HEAD (3643c91). Click for more details.

HEAD has 5 uploads less than BASE
Flag BASE (6a4f249) HEAD (3643c91)
6 1
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3439       +/-   ##
===========================================
- Coverage   74.50%   14.57%   -59.94%     
===========================================
  Files         146       71       -75     
  Lines       44876    12273    -32603     
  Branches    44876    12273    -32603     
===========================================
- Hits        33437     1789    -31648     
- Misses       9248    10270     +1022     
+ Partials     2191      214     -1977     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@PeterKeDer
Copy link
Contributor Author

@roeap looks cool! Glad you're already working on it 😄

I'm curious about the decision to only cache json log files. What's fundamentally blocking your implementation from also caching checkpoints and data files? I'm not too familiar with the delta kernel so there may be a knowledge gap there.

From my experience, downloading the data files generally consumes a lot more time than updating the transaction log. Having a cache work for the data files too would be very valuable for our use cases.

@roeap
Copy link
Collaborator

roeap commented May 17, 2025

@PeterKeDer - there may be a few things to unpack before that choice makes sense 😆.

  1. we have 2 phases to look at a) metadata b) data ... in the metadata phase we can be very opinionated about caching as we will always be reading files from the log. In the data phase, its "just abput reading the data files.

  2. delta-rs more or less is a cache, since we keep all file actions for a snapshot in memory. in kernel we already added methods to take advantage of that an replay from a given state.

  3. In a kernel log replay, we discard some information from the json files we may later need.

With that, the idea is to use the filesystem / byte level caching only for the json files, so we can re-read them to extract other actions.

Kernel also introduces file format level abstractions - i.e. read_parquet - which allows us to also avoid additional decoding of parquet data by e.g. caching the data in feather format (basically flat arrow).

All of this mainly applies to the metadata pahse, in the data phase we then can use general purpose caching - here I hope that we can leverage mechanics build into datafusion ...

All that said, if we make the caching here configurable - just enable / disable - we can see that we get this PR merged, as all the above is obviously a multi-step and longer duration effort 😆. I would then maybe de-construct this over time?

@ion-elgreco
Copy link
Collaborator

@PeterKeDer - there may be a few things to unpack before that choice makes sense 😆.

  1. we have 2 phases to look at a) metadata b) data ... in the metadata phase we can be very opinionated about caching as we will always be reading files from the log. In the data phase, its "just abput reading the data files.

  2. delta-rs more or less is a cache, since we keep all file actions for a snapshot in memory. in kernel we already added methods to take advantage of that an replay from a given state.

  3. In a kernel log replay, we discard some information from the json files we may later need.

With that, the idea is to use the filesystem / byte level caching only for the json files, so we can re-read them to extract other actions.

Kernel also introduces file format level abstractions - i.e. read_parquet - which allows us to also avoid additional decoding of parquet data by e.g. caching the data in feather format (basically flat arrow).

All of this mainly applies to the metadata pahse, in the data phase we then can use general purpose caching - here I hope that we can leverage mechanics build into datafusion ...

All that said, if we make the caching here configurable - just enable / disable - we can see that we get this PR merged, as all the above is obviously a multi-step and longer duration effort 😆. I would then maybe de-construct this over time?

Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level

@roeap
Copy link
Collaborator

roeap commented May 17, 2025

Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level

Which one is that? When the time comes I wanted to do some research... I also think that datafusion-comet has some clever caching build in I wanted to look at.

@ion-elgreco
Copy link
Collaborator

Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level

Which one is that? When the time comes I wanted to do some research... I also think that datafusion-comet has some clever caching build in I wanted to look at.

This one from seafowl: https://github.com/splitgraph/seafowl/blob/main/src/object_store/cache.rs

@PeterKeDer
Copy link
Contributor Author

@PeterKeDer - there may be a few things to unpack before that choice makes sense 😆.

  1. we have 2 phases to look at a) metadata b) data ... in the metadata phase we can be very opinionated about caching as we will always be reading files from the log. In the data phase, its "just abput reading the data files.
  2. delta-rs more or less is a cache, since we keep all file actions for a snapshot in memory. in kernel we already added methods to take advantage of that an replay from a given state.
  3. In a kernel log replay, we discard some information from the json files we may later need.

With that, the idea is to use the filesystem / byte level caching only for the json files, so we can re-read them to extract other actions.

Kernel also introduces file format level abstractions - i.e. read_parquet - which allows us to also avoid additional decoding of parquet data by e.g. caching the data in feather format (basically flat arrow).

All of this mainly applies to the metadata pahse, in the data phase we then can use general purpose caching - here I hope that we can leverage mechanics build into datafusion ...

All that said, if we make the caching here configurable - just enable / disable - we can see that we get this PR merged, as all the above is obviously a multi-step and longer duration effort 😆. I would then maybe de-construct this over time?

Makes sense, thanks for the explanation!

Yep, the caching here is entirely configurable, it is disabled by default but you can just pass in a storage option to enable it. If you think this PR is fine to merge, I'm happy to clean this up a bit and add some few test cases. We should also make clear of the potential downsides, i.e. slower initial reads if we rely on range requests, and expanding cache size as we load more files.

@PeterKeDer
Copy link
Contributor Author

@roeap @ion-elgreco I cleaned this up & added simple tests, appreciate if you can take a look and let me know your thoughts

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking great @PeterKeDer.

There is one thing that I hope we can maybe change is how we treat configuration and setup.

Specifically, since recently we can use deltalake_derive::DeltaConfig to derive extracting config from string maps, some details here. basic idea would be to include a new config, and add it to StorageConfig.

I also tried starting establishing a pattern for adding layers to object stores via decorate_store methods on the config.

IF we like this pattern, maybe we can update to use that apporach? I could then in a follow up update this config to alow allow, for the more minimal cache we discussed above?

@PeterKeDer
Copy link
Contributor Author

@roeap I took a shot at moving the config over to DeltaConfig, let me know if I'm using it correctly. Also, I noticed decorate_store is not being applied to the root store, but it probably should if we add functionality other than prefix. I added a separate function to handle that, let me know if that makes sense.

Separately, do you happen to know why this test is failing? e.g. here. It passes for me locally when running cargo test --verbose --features azure,datafusion,s3,gcs,glue,hdfs

@roeap
Copy link
Collaborator

roeap commented May 21, 2025

@PeterKeDer - looking good, the config system is used as intended. In case you have feedback w.r.t. ease of use, happy to hear it, since this is very new.

You are absolutely right with the decoration being applied to the root store. We will eventually hopefully be able to get rid of the prefixed store all together. That said, I'll likely have to tinker a bit with how we apply things since we may have some requirements to access async code when setting things up (e.g. to get a dedicated io runtime without risking to be outside of async context) while also avoiding making all the setup code async.

I'll have a look at the test failure later today and see if I can support here.

@PeterKeDer
Copy link
Contributor Author

@roeap were you able to take a look at the test failures? I wonder if there are differences writing files when the test is running in the pipeline vs locally that causes the cache to not work as intended

@roeap roeap self-requested a review May 29, 2025 09:01
@roeap
Copy link
Collaborator

roeap commented May 29, 2025

@PeterKeDer - after a bit of investigating, I believe the reason is that the temporary director gets dropped (and with that deleted) prior to the last access, so we effectively purged our cache. Why this shows up only in CI and and not locally 🤷‍♂️.

We have seen some flakiness with tempdir() in the past as well. While we could see that we somehow manage the lifetime of the dir differently, I'd suggest a slight refactor to a) work around this and b) be a bit more flexible i think. I tried this out here - essentially making the store generic over the file cache store.

@PeterKeDer
Copy link
Contributor Author

Ahh interesting, thanks for looking into it! I did use tempdir in some other tests though, I wonder why they didn't fail 🤔

The change to make the store more generic makes sense, people can customize the cache if needed; let me add that change and update the tests.

@PeterKeDer
Copy link
Contributor Author

@roeap updated per your changes, let me know what you think

@rtyler rtyler force-pushed the local-file-cache branch 2 times, most recently from a5cef59 to 8bdb076 Compare July 15, 2025 16:24
@rtyler rtyler force-pushed the local-file-cache branch from 8bdb076 to 3643c91 Compare July 15, 2025 16:42
@abhi-airspace-intelligence
Copy link
Contributor

Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level

Which one is that? When the time comes I wanted to do some research... I also think that datafusion-comet has some clever caching build in I wanted to look at.

Regarding this, in DF50 there's now a parquet file cache that can be instantiated. Since the parquet handling is done with the delta kernel and not Datafusion there's probably no way to directly benefit from it, but this epic probably has nice snippets for what it could look like. apache/datafusion#17000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants