-
Notifications
You must be signed in to change notification settings - Fork 524
feat: add deletion_vector_descriptor method #3721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ze'ev Maor <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a new public method deletion_vector_descriptor
to the LogicalFileView
struct that provides access to the deletion vector descriptor. This enhances the API by allowing external consumers to retrieve deletion vector descriptors without directly accessing the internal deletion vector implementation.
- Adds a new public method
deletion_vector_descriptor
that returnsOption<DeletionVectorDescriptor>
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
.and_then(|c| Scalar::from_array(c.as_ref(), self.index)) | ||
.map(|s| round_ms_datetimes(s, &ceil_datetime)) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new public method deletion_vector_descriptor
lacks documentation. Consider adding a doc comment explaining what the method returns and when it would be None
.
Copilot uses AI. Check for mistakes.
@zeevm Would you share more about the use-case on bubbling this transaction log information further up to the consumer? |
@rtyler sure, our query engine has several "Delta query" modes, one of the modes builds an internal representation of the Delta transaction log to accelerate queries, this internal representation also includes the deletion vectors of course, hence we need this exposed. |
qq: @zeevm - is this by any chance related to caching deletion vectors read from disk? If so there may be better ways to achieve this - i.e. doing this on the kernel engine side. As it so happens this is something I am working on right now as well. |
@roeap we're caching the entire transaction log, including the DVs in a proprietary internal format. |
@zeevm - yeah, if you are caching the whole log data this is maybe a bit different.
I should have been more specific. this would mean having a custom engine implementation for kernel which all IO goes through. I am currently exploring a datafusion engine and will be adding some caching here soon. I can ping you once this gets wired up to see where one could inject custom caches etc ... |
@roeap again, I don't understand how a custom kernel engine would be used here, delta-rs doesn't exposed the underlying delta-kernel implementation so there's no way for me to "inject" any customer engine into delta-kernel. |
@zeevm - certainly not. And you are right, right now, you can only customize logstore / objectstore. However with the ongoing work around moving the scans to leverage the kernel metadata, we should also allow a custom engine implementation and for the caching efforts also a custom cache which is used in conjunction the default engine used in delta-rs. w.r.t. to this specific PR, while we are hoping to hide more and more delta specific things, right now we already expose the raw deletion vector data, so we should be able to expose the descriptor. We are however hoping that we can eventually just expose all of this as arrow. In a way we do this already since much of this is just a view into the arrow data. |
@roeap yep, it's just nicer to already get a parsed descriptor than the raw view and duplicate the code on my end to get the nicely structured descriptor, so for time being until more low level functionality is exposed can we merge this? |
Signed-off-by: R. Tyler Croy <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some rustdoc, I am comfortable adding this now with the understanding that we can/will yoink it in the future once we're further along
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3721 +/- ##
==========================================
- Coverage 75.38% 75.36% -0.02%
==========================================
Files 144 145 +1
Lines 43661 43949 +288
Branches 43661 43949 +288
==========================================
+ Hits 32912 33123 +211
- Misses 9157 9222 +65
- Partials 1592 1604 +12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@rtyler anything blocking this from being merged? |
Description
add deletion_vector_descriptor method to get the DV descriptor