-
Notifications
You must be signed in to change notification settings - Fork 527
feat: implement object store that caches to disk #3439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@roeap was actually going to add a caching object store based on |
Just opened a draft PR - still need to wire things through, but the cache implementation should be close to finished. @PeterKeDer - would love your feedback and happy to discuss which route we should take. The most obvious difference would be the I used an external implementation of a cache. The policy is also more selective in that we right now only cache specific log files (the json ones) and omit the others. We do however try to make the cache sharable, such that it can be used for multiple table instances - i.e. we use a fully qualified URL as cache key. To enable additional caching there have been various updates to delta kernel, which allow us to incrementally update a snapshot and scan state. These would however only get activated once we have kernel for log handling ... |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3439 +/- ##
===========================================
- Coverage 74.50% 14.57% -59.94%
===========================================
Files 146 71 -75
Lines 44876 12273 -32603
Branches 44876 12273 -32603
===========================================
- Hits 33437 1789 -31648
- Misses 9248 10270 +1022
+ Partials 2191 214 -1977 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@roeap looks cool! Glad you're already working on it 😄 I'm curious about the decision to only cache json log files. What's fundamentally blocking your implementation from also caching checkpoints and data files? I'm not too familiar with the delta kernel so there may be a knowledge gap there. From my experience, downloading the data files generally consumes a lot more time than updating the transaction log. Having a cache work for the data files too would be very valuable for our use cases. |
@PeterKeDer - there may be a few things to unpack before that choice makes sense 😆.
With that, the idea is to use the filesystem / byte level caching only for the json files, so we can re-read them to extract other actions. Kernel also introduces file format level abstractions - i.e. All of this mainly applies to the metadata pahse, in the data phase we then can use general purpose caching - here I hope that we can leverage mechanics build into datafusion ... All that said, if we make the caching here configurable - just enable / disable - we can see that we get this PR merged, as all the above is obviously a multi-step and longer duration effort 😆. I would then maybe de-construct this over time? |
Regarding Datafusion and parquet caching, there is another oss database that does this, with some clever caching on bytes level instead of file level |
Which one is that? When the time comes I wanted to do some research... I also think that |
This one from seafowl: https://github.com/splitgraph/seafowl/blob/main/src/object_store/cache.rs |
Makes sense, thanks for the explanation! Yep, the caching here is entirely configurable, it is disabled by default but you can just pass in a storage option to enable it. If you think this PR is fine to merge, I'm happy to clean this up a bit and add some few test cases. We should also make clear of the potential downsides, i.e. slower initial reads if we rely on range requests, and expanding cache size as we load more files. |
51afa2f
to
ad7ba08
Compare
@roeap @ion-elgreco I cleaned this up & added simple tests, appreciate if you can take a look and let me know your thoughts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking great @PeterKeDer.
There is one thing that I hope we can maybe change is how we treat configuration and setup.
Specifically, since recently we can use deltalake_derive::DeltaConfig
to derive extracting config from string maps, some details here. basic idea would be to include a new config, and add it to StorageConfig
.
I also tried starting establishing a pattern for adding layers to object stores via decorate_store
methods on the config.
IF we like this pattern, maybe we can update to use that apporach? I could then in a follow up update this config to alow allow, for the more minimal cache we discussed above?
ad7ba08
to
fb41a90
Compare
@roeap I took a shot at moving the config over to Separately, do you happen to know why this test is failing? e.g. here. It passes for me locally when running |
@PeterKeDer - looking good, the config system is used as intended. In case you have feedback w.r.t. ease of use, happy to hear it, since this is very new. You are absolutely right with the decoration being applied to the root store. We will eventually hopefully be able to get rid of the prefixed store all together. That said, I'll likely have to tinker a bit with how we apply things since we may have some requirements to access async code when setting things up (e.g. to get a dedicated io runtime without risking to be outside of async context) while also avoiding making all the setup code async. I'll have a look at the test failure later today and see if I can support here. |
@roeap were you able to take a look at the test failures? I wonder if there are differences writing files when the test is running in the pipeline vs locally that causes the cache to not work as intended |
@PeterKeDer - after a bit of investigating, I believe the reason is that the temporary director gets dropped (and with that deleted) prior to the last access, so we effectively purged our cache. Why this shows up only in CI and and not locally 🤷♂️. We have seen some flakiness with |
Ahh interesting, thanks for looking into it! I did use The change to make the store more generic makes sense, people can customize the cache if needed; let me add that change and update the tests. |
fb41a90
to
a6dbb4e
Compare
@roeap updated per your changes, let me know what you think |
a5cef59
to
8bdb076
Compare
Signed-off-by: Peter Ke <[email protected]>
Regarding this, in DF50 there's now a parquet file cache that can be instantiated. Since the parquet handling is done with the delta kernel and not Datafusion there's probably no way to directly benefit from it, but this epic probably has nice snippets for what it could look like. apache/datafusion#17000 |
Description
This adds a custom object store that caches files to disk. I've seen some discussions around but no implementation of it yet, so I'm opening this quick PR to gather opinions. Very open on changing the API or implementation of it.
This solution is very simple (possibly overly-naive) and specific to our use-case of delta-rs, so there might be some edge cases from newer features that I did not consider. To enable it, you specify a
file_cache_path
, and it downloads all files requested by get requests and saves them to this path. For range requests, this still downloads the entire file which means it will be worse for performance in many cases on the first run- I think there's a lot of potential optimizations to be done here.There is also an option to specify a cache duration for
_last_checkpoint
which to my knowledge is the only file that is mutable. We found that this increases performance by using the local (and potentially outdated) checkpoint parquet because loading incremental json files are way faster than a new checkpoint parquet.We've been using this at our company for the last couple weeks. It has been very helpful for loading large datasets, so we think a caching feature like this would be quite valuable.
Usage example: