feat: provide delta writer option to flush buffer after every batch #3675

smeyerre · 2025-08-12T18:08:45Z

Description

This provides the option to the delta writer to flush the in-memory buffer to disk after each record batch as opposed to waiting for the targeted file size by passing the flush_per_batch parameter. This prevents accumulating lots of memory in some cases.

Note: I believe there is still some accumulated memory usage through things like transaction metadata.

Related Issue(s)

closes Refactor delta writer to not use buffer #3578

Documentation

N/A

Signed-off-by: Sam Meyer-Reed <[email protected]>

codecov · 2025-08-12T18:31:44Z

Codecov Report

❌ Patch coverage is 92.04545% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.64%. Comparing base (3354e73) to head (1ba0e07).
⚠️ Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/core/src/operations/write/writer.rs	94.85%	0 Missing and 7 partials ⚠️
python/src/lib.rs	0.00%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3675      +/-   ##
==========================================
+ Coverage   75.58%   75.64%   +0.06%     
==========================================
  Files         146      146              
  Lines       45172    45342     +170     
  Branches    45172    45342     +170     
==========================================
+ Hits        34141    34299     +158     
- Misses       9210     9215       +5     
- Partials     1821     1828       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

smeyerre · 2025-08-12T18:35:52Z

python/src/lib.rs

+                // does not extract write_batch_size from WriterProperties
+                if let Some(write_batch_size) = writer_props.write_batch_size {
+                    builder = builder.with_write_batch_size(write_batch_size);
+                }


This feels weird to me, but I wasn't sure how much I wanted to change the write_deltalake function signature here since we provide this write_batch_size setting in the WriterProperties as opposed to a standalone parameter like target_file_size.

smeyerre · 2025-08-12T18:40:27Z

Oops also realized I had pyarrow in my local uv env so that failing test passed for me locally but not here

ion-elgreco

This has the side effect of creating a lot of small files, right?

I was thinking, couldn't we do the flushing more tied together to file-writing, so the flush arrow writer becomes aware of the amount of flushed data.

Typing this on a phone so hope it's clear:

Write_batch --> bytes --> flush to disk --> check if flushed data is gt file_size_limit, if so close multipart and start new multipart

The only thing is the last part can only be smaller if you close the multipart write, so you have to make sure the writer only flushes full parts, and somehow becomes aware to check if the last part was the final part.

smeyerre · 2025-08-12T19:11:27Z

This has the side effect of creating a lot of small files, right?

I was thinking, couldn't we do the flushing more tied together to file-writing, so the flush arrow writer becomes aware of the amount of flushed data.

Typing this on a phone so hope it's clear:

Write_batch --> bytes --> flush to disk --> check if flushed data is gt file_size_limit, if so close multipart and start new multipart

The only thing is the last part can only be smaller if you close the multipart write, so you have to make sure the writer only flushes full parts, and somehow becomes aware to check if the last part was the final part.

Yes this does currently write many more small files.

Ahhh ok I see, yeah that makes a lot of sense. Ok I'll work on that, thanks!

smeyerre · 2025-08-13T01:42:51Z

@ion-elgreco Ok follow up question about this:

Currently create_add() (

delta-rs/crates/core/src/operations/write/writer.rs

Lines 433 to 444 in 2920177

    
           self.files_written.push( 
        
               create_add( 
        
                   &self.config.partition_values, 
        
                   path.to_string(), 
        
                   file_size, 
        
                   &metadata, 
        
                   self.num_indexed_cols, 
        
                   &self.stats_columns, 
        
               ) 
        
               .map_err(|err| WriteError::CreateAdd { 
        
                   source: Box::new(err), 
        
               })?,

) expects complete metadata from writer.close(), but now I think we have to call writer.close() multiple times per file to get buffer data. Currently the way I'm handling this is I'm just using the metadata from the first flush for a file and accumulating row counts, but we're losing out on column statistics for the whole set of batches (just using the stats from the first batch) which doesn't seem great.

Another option I can think of is merging the stats from each flush and rebuilding the whole metadata at the end but that seems pretty intense for this.

I'm also seeing this which seems potentially helpful: https://github.com/delta-io/delta-rs/blob/2920177ac5215e192e0182bed93c42c0b4a98b6f/crates/core/src/writer/stats.rs#L436C1-L489C2

Just wanted to run this past you and see if you had an opinion on the best course of action, thanks!

ion-elgreco · 2025-08-13T07:49:25Z

I think we should avoid calling AsyncArrowWriter.close() before reaching the max file size.

Reset_writer currently returns the old buffer and writet and creates new buffer and writer but that should perhaps change 🤔

smeyerre added 4 commits August 9, 2025 18:09

feat: provide delta writer option to flush buffer after every batch

fc0eaa6

Signed-off-by: Sam Meyer-Reed <[email protected]>

update and add tests

0592f20

Signed-off-by: Sam Meyer-Reed <[email protected]>

pass write_batch_size to writer builder

5afe724

Signed-off-by: Sam Meyer-Reed <[email protected]>

fmt

1ba0e07

Signed-off-by: Sam Meyer-Reed <[email protected]>

smeyerre requested review from ion-elgreco, roeap, rtyler and hntd187 as code owners August 12, 2025 18:08

github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Aug 12, 2025

ion-elgreco self-assigned this Aug 12, 2025

smeyerre commented Aug 12, 2025

View reviewed changes

ion-elgreco reviewed Aug 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: provide delta writer option to flush buffer after every batch #3675

feat: provide delta writer option to flush buffer after every batch #3675

Uh oh!

smeyerre commented Aug 12, 2025

Uh oh!

codecov bot commented Aug 12, 2025 •

edited

Loading

Uh oh!

smeyerre Aug 12, 2025

Uh oh!

smeyerre commented Aug 12, 2025

Uh oh!

ion-elgreco left a comment •

edited

Loading

Uh oh!

smeyerre commented Aug 12, 2025

Uh oh!

smeyerre commented Aug 13, 2025

Uh oh!

ion-elgreco commented Aug 13, 2025

Uh oh!

Uh oh!

feat: provide delta writer option to flush buffer after every batch #3675

Are you sure you want to change the base?

feat: provide delta writer option to flush buffer after every batch #3675

Uh oh!

Conversation

smeyerre commented Aug 12, 2025

Description

Related Issue(s)

Documentation

Uh oh!

codecov bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

smeyerre Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

smeyerre commented Aug 12, 2025

Uh oh!

ion-elgreco left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smeyerre commented Aug 12, 2025

Uh oh!

smeyerre commented Aug 13, 2025

Uh oh!

ion-elgreco commented Aug 13, 2025

Uh oh!

Uh oh!

codecov bot commented Aug 12, 2025 •

edited

Loading

ion-elgreco left a comment •

edited

Loading