- 
                Notifications
    
You must be signed in to change notification settings  - Fork 539
 
          feat: consolidate target_file_size and allow unbounded writes
          #3901
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
  
    feat: consolidate target_file_size and allow unbounded writes
  
  #3901
              Conversation
Signed-off-by: Abhi Agarwal <[email protected]>
| 
           ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.  | 
    
target_file_size and allow unbounded writestarget_file_size and allow unbounded writes
      
          Codecov Report❌ Patch coverage is  Additional details and impacted files@@            Coverage Diff             @@
##             main    #3901      +/-   ##
==========================================
- Coverage   73.76%   73.73%   -0.04%     
==========================================
  Files         151      151              
  Lines       39396    39391       -5     
  Branches    39396    39391       -5     
==========================================
- Hits        29061    29045      -16     
- Misses       9023     9036      +13     
+ Partials     1312     1310       -2     ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
  | 
    
| /// Size above which we will write a buffered parquet file to disk. | ||
| target_file_size: Option<usize>, | ||
| /// If None, the writer will not create a new file until the writer is closed. | ||
| target_file_size: Option<Option<NonZeroUsize>>, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just flatten it instead of nesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was done this way to support the following usecases:
None: bounded writes, fallback to the delta table configuration
Some(None): unbounded writes
Some(val): bounded writes, overriding delta table configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but maybe we can use an enum for this, it's not quite apparent when you look at it :)
| let target_file_size = this.target_file_size.or_else(|| { | ||
| Some(super::get_target_file_size(config, &this.configuration) as usize) | ||
| let target_file_size = this.target_file_size.unwrap_or_else(|| { | ||
| this.snapshot.as_ref().and_then(|snapshot| { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will cause a first write to be unbounded, no? Not sure if thats something we want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it should fallback to the default size.
| partition_values.clone(), | ||
| Some(task_parameters.writer_properties.clone()), | ||
| Some(task_parameters.input_parameters.target_size as usize), | ||
| // Since we know the total size of the bin, we can set the target file size to None. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| 
           This moves a bunch of a bunch of parameters to be   | 
    
| 
           Also, now that I think about this, using  @rtyler thoughts?  | 
    
Signed-off-by: Abhi Agarwal <[email protected]>
Signed-off-by: Abhi Agarwal <[email protected]>
Signed-off-by: Abhi Agarwal <[email protected]>
Signed-off-by: Abhi Agarwal <[email protected]>
Signed-off-by: Abhi Agarwal <[email protected]>
Description
I was trying to investigate the issue in 3855, and I believe I've figured it out. The problem is that when we perform a compaction, we create a bin of files where their total sum is > the target file size, and then, when we go to write the file, we also apply the exact same target file size. This means if, say, the size of all the files in your bin is 105MB, you write a 100MB file and a 5MB file, which I would argue is undesired behavior. Additionally, due to compression/rounding, I've observed that the 100MB file might be written as 97MB and 4MB respectively, which then leads to infinite compactions on the same set of file that generates wasteful IO.
Instead, this changes compactions to allow a target_file_size to be unbounded when actually performing the rewrite operation. To do this, I had to replumb a lot of the code since everything makes the assumption of target-size writes. Additionally, I found there are three separate places where a default target size is set, so I consolidated them to just one place.
Related Issue(s)
Closes #3855
Documentation