[For Review Only] Add API for Remove files to Transaction #1242

emkornfield · 2025-09-02T21:03:58Z

🥞 Stacked PR

Use this link to review incremental changes.

stack/remove_file_api [Files changed]
- stack/update_dv_api [Files changed]
  - stack/tree_metadata_api [Files changed]

Add API to support removing files (implemention will follow once we have agreement). This is adapted/extended from prior work by @zachschuermann (#1020). This is the first step for completing #1146

The added APIs are added to support the following use-cases:

Remove a file after identifying it isn't needed as part of a scan (this requires scanning so we the ability to construct a ScanBuilder directly on transaction).
Add an API to record if the the transaction is a data change or not. Since the ability to remove data was added, there is now a possibility for engines to re-organize data so it feel like we should have this. Based on conversations it seems that having this at the transaction level (vs file level allowed in protocol is preferred).

codecov · 2025-09-02T21:07:45Z

Codecov Report

❌ Patch coverage is 10.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.62%. Comparing base (447767e) to head (bf76f46).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/transaction/mod.rs	10.00%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1242      +/-   ##
==========================================
- Coverage   83.65%   83.62%   -0.03%     
==========================================
  Files         106      106              
  Lines       24966    24976      +10     
  Branches    24966    24976      +10     
==========================================
+ Hits        20886    20887       +1     
- Misses       3046     3055       +9     
  Partials     1034     1034

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zachschuermann

I think generally we prefer to implement pieces in relatively complete chunks - ideally we don't want to really ever be in a state with panics. Perhaps it would make sense to split this into a few PRs like: (1) add data_change (2) add ability to start a scan from transaction (though perhaps we just want to expose a flow like txn -> snapshot -> scan?) and (3) add remove_files API? LMK what you think!

emkornfield · 2025-09-03T21:37:49Z

I think generally we prefer to implement pieces in relatively complete chunks - ideally we don't want to really ever be in a state with panics. Perhaps it would make sense to split this into a few PRs like: (1) add data_change (2) add ability to start a scan from transaction (though perhaps we just want to expose a flow like txn -> snapshot -> scan?) and (3) add remove_files API? LMK what you think!

Discussed offline, but I think the plan is to review these APIs, then I'll split up the each API to one or more self-contained PRs.

zachschuermann

few comments but data_change one def seems fine to move forward with a PR!

zachschuermann · 2025-09-05T22:06:24Z

kernel/src/transaction/mod.rs

+    // Set whether this transaction represents a logical data change for the table.
+    // Operations are not a data change when they just reorganize rows within files
+    // but do not add, remove or modify the data in the table otherwise.
+    pub fn with_is_data_change(mut self, is_data_change: bool) -> Self {


seems reasonable - and this feels like it will be a quick PR :)

nit: I wonder if with_data_change would be reasonable? just since with-is feels weird?

with_data_change seems fine to me.

BTW, I think this should be sequenced after remove. Because right now without a remove API the user has no business setting this (isDataChange should always be true) and we should likely enforce that there are at least some adds and removes provided by the user to allow isDataChange=False

Realized we can have multiple add actions for back-fills, I'll tackle with_data_change first.

zachschuermann · 2025-09-05T22:08:03Z

kernel/src/transaction/mod.rs

+    // Creates a new [`ScanBuilder`] that can read the table in the given transaction.
+    // The transaction offers snapshot isolation, this method can be called multiple
+    // times and each Scan produced by the builder will read at a consistent snapshot.
+    pub fn scan_builder(self: Arc<Self>) -> ScanBuilder {


do we want a direct scan API or do we want to just expose the underlying snapshot? (which we kinda already do since you have to hand us a snapshot to create Transaction but could see this getting more complex as we rebase transactions, etc.)

I think we want scan. My thinking is that what we actually return from the scan will be possibly dependent on what type of DML we potentially want to perform (i.e. using DVs or not)

For DVs we always want to retrieve full parsed stats from checkpoints (we need to propagate all the Add stats with the new DV)

Without DVs we are free to project a smaller number of stats columns

I think the way to control this would be to build this into the as either pre or post processing on the ScanBuilder (but maybe there is a better concept someplace between snapshot and ScanBuilder here)?

Mmm, yeah this is a bit tricky (and I now understand your comment on the DV pr which I left a confused comment on :) ).

I don't love APIs that require that you call things in a particular order, so I'm not a huge fan of "call with_enable_deletion_vectors before you call this otherwise things break". There's also nothing that enforces getting the scan data from here, you could always go around the txn and get a scan, so we'll probably have to validate that the supplied metadata has the right shape for the operation no matter what.

Today we just return the full unparsed stats string anyway, so we should consider if this is an optimization we even care about.

If we really want this, we could have a new type (ScannableTxn?) that you can build from a Txn and the requirement is that you first call all the with_xxx on the Txn and then turn it into a ScannableTxn and then you can get scans from that.

zachschuermann · 2025-09-05T22:12:41Z

kernel/src/transaction/mod.rs

+    /// The metadata passed in here is expected to be taken from the output of a scan of the
+    /// table (i.e. using [`Transaction::scan_builder`]).
+    /// The expected schema for `remove_metadata` is given by [`crate::scan::scan_row_schema`].
+    pub fn remove_files(&mut self, remove_metadata: Box<dyn EngineData>) {


oh interesting - we are taking scan_row_schema directly?

it does seem the most reasonable way of doing a file-based remove is to just ask for the scan row schema directly but then that would require handing back more fields than necessary (stats, etc.) - perhaps we should split out just a specific schema for removing files that can be easily deduced from the scan_row_schema?

This seemed simplest. We could define a subset. There is also maybe a subtler issue here. Remove file has a lot of optional fields, most of which are populated from AddFile actions.

The main things that would be discarded if everything that gets passed back in are:

modificationTime (long)

dataChange (boolean)

clusteringProvider (string)

For the rest, it seems like it is desirable to have most of additional fields (e.g. extendedFileMetadata can be made true if partition values, size and tags are present). stats is also on the remove file action.

So I think it comes down to a few questions:

How opinionated kernel should be for these optional fields

Contract change concerns if we want a smaller schema in the future.

Implementation churn.

My thoughs on these

Not sure about this one, and if Kernel has a philosophy here?

It seems we can always add a trimmed down schema later if needed. A connector passing in more fields then necessary shouldn't really break any contracts.

For V4 metadata we will need to track additional information on addfile (the source of where the action was derived from). It seems like it would create slightly less code churn to only have to add this into one place.

Based on these I would lean towards not doing the extra schema work here, but happy to accept whatever Kernel maintainers decide.

Aside I went back and for on EngineData or FilteredEngineData here. Thoughts?

Yeah, I think let's keep it as simple as possible for now and use the same schema.

Re EngineData or FilteredEngineData, I think it might be nice if we use FilteredEngineData, because then engines can just set bools in order to remove files, rather than having to construct a new set of rows, but it is a bit more fiddly. Engines can always pass an empty array of bools to have FilteredEngineData be equivalent to EngineData

Add API for Remove files to Transaction

87f5c21

github-actions bot assigned emkornfield Sep 2, 2025

emkornfield requested review from allisonport-db, nicklan and zachschuermann September 2, 2025 21:12

fix doc

bf76f46

zachschuermann changed the title ~~Add API for Remove files to Transaction~~ feat: Add API for Remove files to Transaction Sep 3, 2025

zachschuermann reviewed Sep 3, 2025

View reviewed changes

emkornfield changed the title ~~feat: Add API for Remove files to Transaction~~ [For Review Only] Add API for Remove files to Transaction Sep 3, 2025

emkornfield requested a review from zachschuermann September 3, 2025 21:37

zachschuermann added the merge hold Don't allow the PR to merge label Sep 4, 2025

zachschuermann reviewed Sep 5, 2025

View reviewed changes

This was referenced Sep 8, 2025

[For Review Only] Add APIs for deletion vector writes #1269

Draft

[For Review Only] Add initial APIs to working with metadata tree #1271

Closed

emkornfield requested a review from zachschuermann September 9, 2025 05:10

[For Review Only] Add API for Remove files to Transaction #1242

Are you sure you want to change the base?

[For Review Only] Add API for Remove files to Transaction #1242

Uh oh!

Conversation

emkornfield commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🥞 Stacked PR

Uh oh!

codecov bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

emkornfield commented Sep 3, 2025

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

emkornfield commented Sep 2, 2025 •

edited

Loading

codecov bot commented Sep 2, 2025 •

edited

Loading