Skip to content

Conversation

@abhi-airspace-intelligence
Copy link
Contributor

Description

I was trying to investigate the issue in 3855, and I believe I've figured it out. The problem is that when we perform a compaction, we create a bin of files where their total sum is > the target file size, and then, when we go to write the file, we also apply the exact same target file size. This means if, say, the size of all the files in your bin is 105MB, you write a 100MB file and a 5MB file, which I would argue is undesired behavior. Additionally, due to compression/rounding, I've observed that the 100MB file might be written as 97MB and 4MB respectively, which then leads to infinite compactions on the same set of file that generates wasteful IO.

Instead, this changes compactions to allow a target_file_size to be unbounded when actually performing the rewrite operation. To do this, I had to replumb a lot of the code since everything makes the assumption of target-size writes. Additionally, I found there are three separate places where a default target size is set, so I consolidated them to just one place.

Related Issue(s)

Closes #3855

Documentation

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Oct 27, 2025
@github-actions
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@abhi-airspace-intelligence abhi-airspace-intelligence changed the title Consolidate target_file_size and allow unbounded writes feat: consolidate target_file_size and allow unbounded writes Oct 27, 2025
@codecov
Copy link

codecov bot commented Oct 28, 2025

Codecov Report

❌ Patch coverage is 73.23944% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.73%. Comparing base (8f22045) to head (d4ec741).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
python/src/lib.rs 0.00% 16 Missing ⚠️
crates/core/src/operations/write/writer.rs 93.33% 0 Missing and 2 partials ⚠️
crates/core/src/operations/write/mod.rs 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3901      +/-   ##
==========================================
- Coverage   73.76%   73.73%   -0.04%     
==========================================
  Files         151      151              
  Lines       39396    39391       -5     
  Branches    39396    39391       -5     
==========================================
- Hits        29061    29045      -16     
- Misses       9023     9036      +13     
+ Partials     1312     1310       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

/// Size above which we will write a buffered parquet file to disk.
target_file_size: Option<usize>,
/// If None, the writer will not create a new file until the writer is closed.
target_file_size: Option<Option<NonZeroUsize>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just flatten it instead of nesting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was done this way to support the following usecases:

None: bounded writes, fallback to the delta table configuration
Some(None): unbounded writes
Some(val): bounded writes, overriding delta table configuration

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but maybe we can use an enum for this, it's not quite apparent when you look at it :)

let target_file_size = this.target_file_size.or_else(|| {
Some(super::get_target_file_size(config, &this.configuration) as usize)
let target_file_size = this.target_file_size.unwrap_or_else(|| {
this.snapshot.as_ref().and_then(|snapshot| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will cause a first write to be unbounded, no? Not sure if thats something we want

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it should fallback to the default size.

partition_values.clone(),
Some(task_parameters.writer_properties.clone()),
Some(task_parameters.input_parameters.target_size as usize),
// Since we know the total size of the bin, we can set the target file size to None.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@abhi-airspace-intelligence
Copy link
Contributor Author

This moves a bunch of a bunch of parameters to be Option<NonZeroUsize>, which enables niche optimization, but is also a breaking API change. https://doc.rust-lang.org/std/num/type.NonZeroUsize.html

@abhi-airspace-intelligence
Copy link
Contributor Author

Also, now that I think about this, using usize for a file size is probably an antipattern... perhaps, it should be switched to an u64? object_store docs reference this here. https://docs.rs/object_store/latest/object_store/struct.ObjectMeta.html#structfield.size.

@rtyler thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package binding/rust Issues for the Rust crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compacting a delta table consistently undershoots the target_file_size, creating an unnecessary extra file

2 participants