feat: implement object store that caches to disk #3439

PeterKeDer · 2025-05-17T07:52:32Z

Description

This adds a custom object store that caches files to disk. I've seen some discussions around but no implementation of it yet, so I'm opening this quick PR to gather opinions. Very open on changing the API or implementation of it.

This solution is very simple (possibly overly-naive) and specific to our use-case of delta-rs, so there might be some edge cases from newer features that I did not consider. To enable it, you specify a file_cache_path, and it downloads all files requested by get requests and saves them to this path. For range requests, this still downloads the entire file which means it will be worse for performance in many cases on the first run- I think there's a lot of potential optimizations to be done here.

There is also an option to specify a cache duration for _last_checkpoint which to my knowledge is the only file that is mutable. We found that this increases performance by using the local (and potentially outdated) checkpoint parquet because loading incremental json files are way faster than a new checkpoint parquet.

We've been using this at our company for the last couple weeks. It has been very helpful for loading large datasets, so we think a caching feature like this would be quite valuable.

Usage example:

dt = DeltaTable(
  "s3://bucket/table",
  storage_options={
    "file_cache_path": "/Users/peter/delta-cache",
    "file_cache_last_checkpoint_valid_duration": "30m",
})
# do stuff with dt

ion-elgreco · 2025-05-17T08:27:57Z

@roeap was actually going to add a caching object store based on foyer this weekend, which supports hybrid caching :)

roeap · 2025-05-17T09:22:43Z

Just opened a draft PR - still need to wire things through, but the cache implementation should be close to finished.

@PeterKeDer - would love your feedback and happy to discuss which route we should take.

The most obvious difference would be the I used an external implementation of a cache. The policy is also more selective in that we right now only cache specific log files (the json ones) and omit the others. We do however try to make the cache sharable, such that it can be used for multiple table instances - i.e. we use a fully qualified URL as cache key.

To enable additional caching there have been various updates to delta kernel, which allow us to incrementally update a snapshot and scan state.

These would however only get activated once we have kernel for log handling ...

codecov · 2025-05-17T13:05:00Z

Codecov Report

❌ Patch coverage is 5.48780% with 155 lines in your changes missing coverage. Please review.
✅ Project coverage is 14.57%. Comparing base (6a4f249) to head (3643c91).
⚠️ Report is 28 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/core/src/logstore/storage/file_cache.rs	0.00%	145 Missing ⚠️
crates/core/src/logstore/config.rs	52.94%	6 Missing and 2 partials ⚠️
crates/core/src/logstore/mod.rs	0.00%	0 Missing and 2 partials ⚠️

❗ There is a different number of reports uploaded between BASE (6a4f249) and HEAD (3643c91). Click for more details.

HEAD has 5 uploads less than BASE

Flag BASE (6a4f249) HEAD (3643c91)

6 1

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #3439       +/-   ##
===========================================
- Coverage   74.50%   14.57%   -59.94%     
===========================================
  Files         146       71       -75     
  Lines       44876    12273    -32603     
  Branches    44876    12273    -32603     
===========================================
- Hits        33437     1789    -31648     
- Misses       9248    10270     +1022     
+ Partials     2191      214     -1977

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

PeterKeDer · 2025-05-17T20:31:46Z

@roeap looks cool! Glad you're already working on it 😄

I'm curious about the decision to only cache json log files. What's fundamentally blocking your implementation from also caching checkpoints and data files? I'm not too familiar with the delta kernel so there may be a knowledge gap there.

From my experience, downloading the data files generally consumes a lot more time than updating the transaction log. Having a cache work for the data files too would be very valuable for our use cases.

roeap · 2025-05-17T20:49:31Z

@PeterKeDer - there may be a few things to unpack before that choice makes sense 😆.

we have 2 phases to look at a) metadata b) data ... in the metadata phase we can be very opinionated about caching as we will always be reading files from the log. In the data phase, its "just abput reading the data files.
delta-rs more or less is a cache, since we keep all file actions for a snapshot in memory. in kernel we already added methods to take advantage of that an replay from a given state.
In a kernel log replay, we discard some information from the json files we may later need.

With that, the idea is to use the filesystem / byte level caching only for the json files, so we can re-read them to extract other actions.

Kernel also introduces file format level abstractions - i.e. read_parquet - which allows us to also avoid additional decoding of parquet data by e.g. caching the data in feather format (basically flat arrow).

All of this mainly applies to the metadata pahse, in the data phase we then can use general purpose caching - here I hope that we can leverage mechanics build into datafusion ...

All that said, if we make the caching here configurable - just enable / disable - we can see that we get this PR merged, as all the above is obviously a multi-step and longer duration effort 😆. I would then maybe de-construct this over time?

ion-elgreco · 2025-05-17T20:53:24Z

@PeterKeDer - there may be a few things to unpack before that choice makes sense 😆.

we have 2 phases to look at a) metadata b) data ... in the metadata phase we can be very opinionated about caching as we will always be reading files from the log. In the data phase, its "just abput reading the data files.

delta-rs more or less is a cache, since we keep all file actions for a snapshot in memory. in kernel we already added methods to take advantage of that an replay from a given state.

In a kernel log replay, we discard some information from the json files we may later need.

With that, the idea is to use the filesystem / byte level caching only for the json files, so we can re-read them to extract other actions.

Kernel also introduces file format level abstractions - i.e. read_parquet - which allows us to also avoid additional decoding of parquet data by e.g. caching the data in feather format (basically flat arrow).

All of this mainly applies to the metadata pahse, in the data phase we then can use general purpose caching - here I hope that we can leverage mechanics build into datafusion ...

All that said, if we make the caching here configurable - just enable / disable - we can see that we get this PR merged, as all the above is obviously a multi-step and longer duration effort 😆. I would then maybe de-construct this over time?

Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level

roeap · 2025-05-17T20:59:01Z

Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level

Which one is that? When the time comes I wanted to do some research... I also think that datafusion-comet has some clever caching build in I wanted to look at.

ion-elgreco · 2025-05-17T21:51:27Z

Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level

Which one is that? When the time comes I wanted to do some research... I also think that datafusion-comet has some clever caching build in I wanted to look at.

This one from seafowl: https://github.com/splitgraph/seafowl/blob/main/src/object_store/cache.rs

PeterKeDer · 2025-05-18T04:48:33Z

@PeterKeDer - there may be a few things to unpack before that choice makes sense 😆.

we have 2 phases to look at a) metadata b) data ... in the metadata phase we can be very opinionated about caching as we will always be reading files from the log. In the data phase, its "just abput reading the data files.

delta-rs more or less is a cache, since we keep all file actions for a snapshot in memory. in kernel we already added methods to take advantage of that an replay from a given state.

In a kernel log replay, we discard some information from the json files we may later need.

With that, the idea is to use the filesystem / byte level caching only for the json files, so we can re-read them to extract other actions.

Kernel also introduces file format level abstractions - i.e. read_parquet - which allows us to also avoid additional decoding of parquet data by e.g. caching the data in feather format (basically flat arrow).

All of this mainly applies to the metadata pahse, in the data phase we then can use general purpose caching - here I hope that we can leverage mechanics build into datafusion ...

All that said, if we make the caching here configurable - just enable / disable - we can see that we get this PR merged, as all the above is obviously a multi-step and longer duration effort 😆. I would then maybe de-construct this over time?

Makes sense, thanks for the explanation!

Yep, the caching here is entirely configurable, it is disabled by default but you can just pass in a storage option to enable it. If you think this PR is fine to merge, I'm happy to clean this up a bit and add some few test cases. We should also make clear of the potential downsides, i.e. slower initial reads if we rely on range requests, and expanding cache size as we load more files.

PeterKeDer · 2025-05-19T20:50:44Z

@roeap @ion-elgreco I cleaned this up & added simple tests, appreciate if you can take a look and let me know your thoughts

roeap

Overall looking great @PeterKeDer.

There is one thing that I hope we can maybe change is how we treat configuration and setup.

Specifically, since recently we can use deltalake_derive::DeltaConfig to derive extracting config from string maps, some details here. basic idea would be to include a new config, and add it to StorageConfig.

I also tried starting establishing a pattern for adding layers to object stores via decorate_store methods on the config.

IF we like this pattern, maybe we can update to use that apporach? I could then in a follow up update this config to alow allow, for the more minimal cache we discussed above?

crates/core/src/logstore/mod.rs

PeterKeDer · 2025-05-21T01:15:40Z

@roeap I took a shot at moving the config over to DeltaConfig, let me know if I'm using it correctly. Also, I noticed decorate_store is not being applied to the root store, but it probably should if we add functionality other than prefix. I added a separate function to handle that, let me know if that makes sense.

Separately, do you happen to know why this test is failing? e.g. here. It passes for me locally when running cargo test --verbose --features azure,datafusion,s3,gcs,glue,hdfs

roeap · 2025-05-21T12:53:37Z

@PeterKeDer - looking good, the config system is used as intended. In case you have feedback w.r.t. ease of use, happy to hear it, since this is very new.

You are absolutely right with the decoration being applied to the root store. We will eventually hopefully be able to get rid of the prefixed store all together. That said, I'll likely have to tinker a bit with how we apply things since we may have some requirements to access async code when setting things up (e.g. to get a dedicated io runtime without risking to be outside of async context) while also avoiding making all the setup code async.

I'll have a look at the test failure later today and see if I can support here.

PeterKeDer · 2025-05-28T19:19:36Z

@roeap were you able to take a look at the test failures? I wonder if there are differences writing files when the test is running in the pipeline vs locally that causes the cache to not work as intended

roeap · 2025-05-29T11:02:18Z

@PeterKeDer - after a bit of investigating, I believe the reason is that the temporary director gets dropped (and with that deleted) prior to the last access, so we effectively purged our cache. Why this shows up only in CI and and not locally 🤷‍♂️.

We have seen some flakiness with tempdir() in the past as well. While we could see that we somehow manage the lifetime of the dir differently, I'd suggest a slight refactor to a) work around this and b) be a bit more flexible i think. I tried this out here - essentially making the store generic over the file cache store.

PeterKeDer · 2025-05-29T19:38:57Z

Ahh interesting, thanks for looking into it! I did use tempdir in some other tests though, I wonder why they didn't fail 🤔

The change to make the store more generic makes sense, people can customize the cache if needed; let me add that change and update the tests.

PeterKeDer · 2025-05-30T19:21:24Z

@roeap updated per your changes, let me know what you think

Signed-off-by: Peter Ke <[email protected]>

abhi-airspace-intelligence · 2025-10-03T17:26:29Z

Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level

Which one is that? When the time comes I wanted to do some research... I also think that datafusion-comet has some clever caching build in I wanted to look at.

Regarding this, in DF50 there's now a parquet file cache that can be instantiated. Since the parquet handling is done with the delta kernel and not Datafusion there's probably no way to directly benefit from it, but this epic probably has nice snippets for what it could look like. apache/datafusion#17000

PeterKeDer requested review from roeap, rtyler, hntd187 and ion-elgreco as code owners May 17, 2025 07:52

github-actions bot added the binding/rust Issues for the Rust crate label May 17, 2025

roeap mentioned this pull request May 17, 2025

feat: add log file caching layer #3440

Draft

PeterKeDer force-pushed the local-file-cache branch from 51afa2f to ad7ba08 Compare May 19, 2025 20:44

roeap reviewed May 20, 2025

View reviewed changes

crates/core/src/logstore/mod.rs Outdated Show resolved Hide resolved

PeterKeDer force-pushed the local-file-cache branch from ad7ba08 to fb41a90 Compare May 21, 2025 01:10

roeap self-requested a review May 29, 2025 09:01

PeterKeDer force-pushed the local-file-cache branch from fb41a90 to a6dbb4e Compare May 30, 2025 19:20

rtyler force-pushed the local-file-cache branch 2 times, most recently from a5cef59 to 8bdb076 Compare July 15, 2025 16:24

feat: implement object store that caches to disk

3643c91

Signed-off-by: Peter Ke <[email protected]>

rtyler force-pushed the local-file-cache branch from 8bdb076 to 3643c91 Compare July 15, 2025 16:42

feat: implement object store that caches to disk #3439

Are you sure you want to change the base?

feat: implement object store that caches to disk #3439

Uh oh!

Conversation

PeterKeDer commented May 17, 2025

Description

Uh oh!

ion-elgreco commented May 17, 2025

Uh oh!

roeap commented May 17, 2025

Uh oh!

codecov bot commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PeterKeDer commented May 17, 2025

Uh oh!

roeap commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ion-elgreco commented May 17, 2025

Uh oh!

roeap commented May 17, 2025

Uh oh!

ion-elgreco commented May 17, 2025

Uh oh!

PeterKeDer commented May 18, 2025

Uh oh!

PeterKeDer commented May 19, 2025

Uh oh!

roeap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PeterKeDer commented May 21, 2025

Uh oh!

roeap commented May 21, 2025

Uh oh!

PeterKeDer commented May 28, 2025

Uh oh!

roeap commented May 29, 2025

Uh oh!

PeterKeDer commented May 29, 2025

Uh oh!

PeterKeDer commented May 30, 2025

Uh oh!

abhi-airspace-intelligence commented Oct 3, 2025

Uh oh!

Uh oh!

codecov bot commented May 17, 2025 •

edited

Loading

roeap commented May 17, 2025 •

edited

Loading