fix(reader): filter row groups when FileScanTask contains byte ranges #1779
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What issue does this PR close?
Partially address #1749.
Rationale for this change
Iceberg's file splitting feature allows large Parquet files to be divided across multiple tasks for parallel processing. When Iceberg Java splits a file using
splitOffsets(), it returns byte positions corresponding to row group boundaries, and eachFileScanTaskcontainsstartandlengthfields specifying which byte range to read.However, iceberg-rust ignores these fields for row group pruning. This manifested as a test failure in Comet where the Iceberg Java test
TestRewriteDataFilesActionreturned duplicate rows.Root cause: The
process_file_scan_taskfunction in iceberg-rust does not have row group filtering based on thestartandlengthbyte ranges, despite these fields being passed into FileScanTasks.What changes are included in this PR?
New method
filter_row_groups_by_byte_range(lines 733-776):[start, start+length)byte rangeIntegrated byte range filtering into
process_file_scan_task(lines 245-291):filter_row_groups_by_byte_rangewhenstart != 0 || length != 0to maintain backwards compatibilityTechnical details:
rg_start < end && start < rg_endAre these changes tested?
New test
test_file_splits_respect_byte_ranges(lines 1325-1523):FileScanTaskinstances with different byte ranges:Iceberg Java tests TestRewriteDataFilesAction now pass with Comet