Skip to content

Conversation

emkornfield
Copy link
Collaborator

@emkornfield emkornfield commented Sep 2, 2025

🥞 Stacked PR

Use this link to review incremental changes.


Add API to support removing files (implemention will follow once we have agreement). This is adapted/extended from prior work by @zachschuermann (#1020). This is the first step for completing #1146

The added APIs are added to support the following use-cases:

  1. Remove a file after identifying it isn't needed as part of a scan (this requires scanning so we the ability to construct a ScanBuilder directly on transaction).
  2. Add an API to record if the the transaction is a data change or not. Since the ability to remove data was added, there is now a possibility for engines to re-organize data so it feel like we should have this. Based on conversations it seems that having this at the transaction level (vs file level allowed in protocol is preferred).

Copy link

codecov bot commented Sep 2, 2025

Codecov Report

❌ Patch coverage is 10.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.62%. Comparing base (447767e) to head (bf76f46).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/transaction/mod.rs 10.00% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1242      +/-   ##
==========================================
- Coverage   83.65%   83.62%   -0.03%     
==========================================
  Files         106      106              
  Lines       24966    24976      +10     
  Branches    24966    24976      +10     
==========================================
+ Hits        20886    20887       +1     
- Misses       3046     3055       +9     
  Partials     1034     1034              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zachschuermann zachschuermann changed the title Add API for Remove files to Transaction feat: Add API for Remove files to Transaction Sep 3, 2025
Copy link
Member

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally we prefer to implement pieces in relatively complete chunks - ideally we don't want to really ever be in a state with panics. Perhaps it would make sense to split this into a few PRs like: (1) add data_change (2) add ability to start a scan from transaction (though perhaps we just want to expose a flow like txn -> snapshot -> scan?) and (3) add remove_files API? LMK what you think!

@emkornfield emkornfield changed the title feat: Add API for Remove files to Transaction [For Review Only] Add API for Remove files to Transaction Sep 3, 2025
@emkornfield
Copy link
Collaborator Author

I think generally we prefer to implement pieces in relatively complete chunks - ideally we don't want to really ever be in a state with panics. Perhaps it would make sense to split this into a few PRs like: (1) add data_change (2) add ability to start a scan from transaction (though perhaps we just want to expose a flow like txn -> snapshot -> scan?) and (3) add remove_files API? LMK what you think!

Discussed offline, but I think the plan is to review these APIs, then I'll split up the each API to one or more self-contained PRs.

@zachschuermann zachschuermann added the merge hold Don't allow the PR to merge label Sep 4, 2025
Copy link
Member

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments but data_change one def seems fine to move forward with a PR!

// Set whether this transaction represents a logical data change for the table.
// Operations are not a data change when they just reorganize rows within files
// but do not add, remove or modify the data in the table otherwise.
pub fn with_is_data_change(mut self, is_data_change: bool) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable - and this feels like it will be a quick PR :)

nit: I wonder if with_data_change would be reasonable? just since with-is feels weird?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with_data_change seems fine to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think this should be sequenced after remove. Because right now without a remove API the user has no business setting this (isDataChange should always be true) and we should likely enforce that there are at least some adds and removes provided by the user to allow isDataChange=False

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized we can have multiple add actions for back-fills, I'll tackle with_data_change first.

// Creates a new [`ScanBuilder`] that can read the table in the given transaction.
// The transaction offers snapshot isolation, this method can be called multiple
// times and each Scan produced by the builder will read at a consistent snapshot.
pub fn scan_builder(self: Arc<Self>) -> ScanBuilder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want a direct scan API or do we want to just expose the underlying snapshot? (which we kinda already do since you have to hand us a snapshot to create Transaction but could see this getting more complex as we rebase transactions, etc.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want scan. My thinking is that what we actually return from the scan will be possibly dependent on what type of DML we potentially want to perform (i.e. using DVs or not)

  1. For DVs we always want to retrieve full parsed stats from checkpoints (we need to propagate all the Add stats with the new DV)
  2. Without DVs we are free to project a smaller number of stats columns

I think the way to control this would be to build this into the as either pre or post processing on the ScanBuilder (but maybe there is a better concept someplace between snapshot and ScanBuilder here)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, yeah this is a bit tricky (and I now understand your comment on the DV pr which I left a confused comment on :) ).

I don't love APIs that require that you call things in a particular order, so I'm not a huge fan of "call with_enable_deletion_vectors before you call this otherwise things break". There's also nothing that enforces getting the scan data from here, you could always go around the txn and get a scan, so we'll probably have to validate that the supplied metadata has the right shape for the operation no matter what.

Today we just return the full unparsed stats string anyway, so we should consider if this is an optimization we even care about.

If we really want this, we could have a new type (ScannableTxn?) that you can build from a Txn and the requirement is that you first call all the with_xxx on the Txn and then turn it into a ScannableTxn and then you can get scans from that.

/// The metadata passed in here is expected to be taken from the output of a scan of the
/// table (i.e. using [`Transaction::scan_builder`]).
/// The expected schema for `remove_metadata` is given by [`crate::scan::scan_row_schema`].
pub fn remove_files(&mut self, remove_metadata: Box<dyn EngineData>) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh interesting - we are taking scan_row_schema directly?

it does seem the most reasonable way of doing a file-based remove is to just ask for the scan row schema directly but then that would require handing back more fields than necessary (stats, etc.) - perhaps we should split out just a specific schema for removing files that can be easily deduced from the scan_row_schema?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed simplest. We could define a subset. There is also maybe a subtler issue here. Remove file has a lot of optional fields, most of which are populated from AddFile actions.

The main things that would be discarded if everything that gets passed back in are:

  1. modificationTime (long)
  2. dataChange (boolean)
  3. clusteringProvider (string)

For the rest, it seems like it is desirable to have most of additional fields (e.g. extendedFileMetadata can be made true if partition values, size and tags are present). stats is also on the remove file action.

So I think it comes down to a few questions:

  1. How opinionated kernel should be for these optional fields
  2. Contract change concerns if we want a smaller schema in the future.
  3. Implementation churn.

My thoughs on these

  1. Not sure about this one, and if Kernel has a philosophy here?
  2. It seems we can always add a trimmed down schema later if needed. A connector passing in more fields then necessary shouldn't really break any contracts.
  3. For V4 metadata we will need to track additional information on addfile (the source of where the action was derived from). It seems like it would create slightly less code churn to only have to add this into one place.

Based on these I would lean towards not doing the extra schema work here, but happy to accept whatever Kernel maintainers decide.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside I went back and for on EngineData or FilteredEngineData here. Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think let's keep it as simple as possible for now and use the same schema.

Re EngineData or FilteredEngineData, I think it might be nice if we use FilteredEngineData, because then engines can just set bools in order to remove files, rather than having to construct a new set of rows, but it is a bit more fiddly. Engines can always pass an empty array of bools to have FilteredEngineData be equivalent to EngineData

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge hold Don't allow the PR to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants