-
Notifications
You must be signed in to change notification settings - Fork 117
[For Review Only] Add API for Remove files to Transaction #1242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1242 +/- ##
==========================================
- Coverage 83.65% 83.62% -0.03%
==========================================
Files 106 106
Lines 24966 24976 +10
Branches 24966 24976 +10
==========================================
+ Hits 20886 20887 +1
- Misses 3046 3055 +9
Partials 1034 1034 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally we prefer to implement pieces in relatively complete chunks - ideally we don't want to really ever be in a state with panics. Perhaps it would make sense to split this into a few PRs like: (1) add data_change
(2) add ability to start a scan from transaction (though perhaps we just want to expose a flow like txn -> snapshot -> scan
?) and (3) add remove_files API? LMK what you think!
Discussed offline, but I think the plan is to review these APIs, then I'll split up the each API to one or more self-contained PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few comments but data_change
one def seems fine to move forward with a PR!
// Set whether this transaction represents a logical data change for the table. | ||
// Operations are not a data change when they just reorganize rows within files | ||
// but do not add, remove or modify the data in the table otherwise. | ||
pub fn with_is_data_change(mut self, is_data_change: bool) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable - and this feels like it will be a quick PR :)
nit: I wonder if with_data_change
would be reasonable? just since with-is feels weird?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with_data_change
seems fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think this should be sequenced after remove. Because right now without a remove API the user has no business setting this (isDataChange should always be true) and we should likely enforce that there are at least some adds and removes provided by the user to allow isDataChange=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realized we can have multiple add actions for back-fills, I'll tackle with_data_change first.
// Creates a new [`ScanBuilder`] that can read the table in the given transaction. | ||
// The transaction offers snapshot isolation, this method can be called multiple | ||
// times and each Scan produced by the builder will read at a consistent snapshot. | ||
pub fn scan_builder(self: Arc<Self>) -> ScanBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want a direct scan API or do we want to just expose the underlying snapshot? (which we kinda already do since you have to hand us a snapshot to create Transaction but could see this getting more complex as we rebase transactions, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want scan. My thinking is that what we actually return from the scan will be possibly dependent on what type of DML we potentially want to perform (i.e. using DVs or not)
- For DVs we always want to retrieve full parsed stats from checkpoints (we need to propagate all the Add stats with the new DV)
- Without DVs we are free to project a smaller number of stats columns
I think the way to control this would be to build this into the as either pre or post processing on the ScanBuilder (but maybe there is a better concept someplace between snapshot and ScanBuilder here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, yeah this is a bit tricky (and I now understand your comment on the DV pr which I left a confused comment on :) ).
I don't love APIs that require that you call things in a particular order, so I'm not a huge fan of "call with_enable_deletion_vectors
before you call this otherwise things break". There's also nothing that enforces getting the scan data from here, you could always go around the txn and get a scan, so we'll probably have to validate that the supplied metadata has the right shape for the operation no matter what.
Today we just return the full unparsed stats string anyway, so we should consider if this is an optimization we even care about.
If we really want this, we could have a new type (ScannableTxn
?) that you can build from a Txn
and the requirement is that you first call all the with_xxx
on the Txn
and then turn it into a ScannableTxn
and then you can get scans from that.
/// The metadata passed in here is expected to be taken from the output of a scan of the | ||
/// table (i.e. using [`Transaction::scan_builder`]). | ||
/// The expected schema for `remove_metadata` is given by [`crate::scan::scan_row_schema`]. | ||
pub fn remove_files(&mut self, remove_metadata: Box<dyn EngineData>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh interesting - we are taking scan_row_schema
directly?
it does seem the most reasonable way of doing a file-based remove is to just ask for the scan row schema directly but then that would require handing back more fields than necessary (stats
, etc.) - perhaps we should split out just a specific schema for removing files that can be easily deduced from the scan_row_schema
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed simplest. We could define a subset. There is also maybe a subtler issue here. Remove file has a lot of optional fields, most of which are populated from AddFile actions.
The main things that would be discarded if everything that gets passed back in are:
- modificationTime (long)
- dataChange (boolean)
- clusteringProvider (string)
For the rest, it seems like it is desirable to have most of additional fields (e.g. extendedFileMetadata can be made true if partition values, size and tags are present). stats
is also on the remove file action.
So I think it comes down to a few questions:
- How opinionated kernel should be for these optional fields
- Contract change concerns if we want a smaller schema in the future.
- Implementation churn.
My thoughs on these
- Not sure about this one, and if Kernel has a philosophy here?
- It seems we can always add a trimmed down schema later if needed. A connector passing in more fields then necessary shouldn't really break any contracts.
- For V4 metadata we will need to track additional information on addfile (the source of where the action was derived from). It seems like it would create slightly less code churn to only have to add this into one place.
Based on these I would lean towards not doing the extra schema work here, but happy to accept whatever Kernel maintainers decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside I went back and for on EngineData or FilteredEngineData here. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think let's keep it as simple as possible for now and use the same schema.
Re EngineData
or FilteredEngineData
, I think it might be nice if we use FilteredEngineData
, because then engines can just set bools in order to remove files, rather than having to construct a new set of rows, but it is a bit more fiddly. Engines can always pass an empty array of bools to have FilteredEngineData
be equivalent to EngineData
🥞 Stacked PR
Use this link to review incremental changes.
Add API to support removing files (implemention will follow once we have agreement). This is adapted/extended from prior work by @zachschuermann (#1020). This is the first step for completing #1146
The added APIs are added to support the following use-cases: