-
Notifications
You must be signed in to change notification settings - Fork 110
feat: Introduce row index metadata column #1272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1272 +/- ##
==========================================
+ Coverage 84.23% 84.28% +0.04%
==========================================
Files 111 111
Lines 27497 27641 +144
Branches 27497 27641 +144
==========================================
+ Hits 23163 23296 +133
- Misses 3203 3215 +12
+ Partials 1131 1130 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
58651a0
to
b66e562
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach LGTM. All the blockers are lower in the PR stack, I believe.
b66e562
to
38d4963
Compare
Some(MetadataColumnSpec::RowIndex) => { | ||
debug!("Inserting a row index column: {}", field.name()); | ||
reorder_indices.push(ReorderIndex::row_index( | ||
requested_position, | ||
Arc::new(field.try_into_arrow()?), | ||
)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the other PR discussions, it seems like we have two possible approaches for resolving name collisions with metadata:
- Forbid name collisions (at least vs. table schema). Note that this can still lead to name collisions at file level, because there's no rule forbidding the parquet file to have extra columns in it.
- Treat metadata columns as special, and don't even look at the parquet schema when resolving them.
If we go with approach 2/, then this loop even need to check for metadata columns because the requested schema never contains them. If we go with approach 1/... not sure. Do we blow up with an "incompatible parquet file" error of some kind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with approach 2/, my understanding is that this loop can stay as is because it loops over the fields from the requested schema that couldn't be found in the parquet columns.
Since we modify match_parquet_fields
to always consider metadata columns as not matched/found in #1266, this loop should be good as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! would just like to make the row computation optional
333bd45
to
8eceb51
Compare
Address review comments Address 2nd review comments Make row indexes optional
8eceb51
to
465784a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! thanks, excited to have this support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i can stamp to unblock and let's take follow-ups but main concern is lacking a test for whenever we have a predicate? (lmk if i've overlooked it)
let starting_offsets = match self.row_group_ordinals { | ||
Some(ordinals) => ordinals | ||
.iter() | ||
.map(|i| self.row_group_row_index_ranges[*i].clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need clone
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also I think raw indexing can cause panics - can we see if we can do better here either docs for invariant or add bounds check etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about panicking in raw indexing! I changed this to get()
to ignore invalid indices in optimized builds and added a debug_assert!
that should catch improper use of RowIndexBuilder::select_row_groups
(i.e., out-of-bounds indices in self.row_group_ordinals
) during tests.
As for the clone
: I think clone
is a reasonable trade-off here, since (1) we are only cloning Range
s (i.e., two i64
s) and (2) preventing the clone would require us to modify the underlying vector (costly) or designing something like a FilteredRowIndexIterator
that does not collect intermediate results.
pub(crate) fn fixup_parquet_read<T>( | ||
batch: RecordBatch, | ||
requested_ordering: &[ReorderIndex], | ||
row_indexes: Option<&mut <RowIndexBuilder as IntoIterator>::IntoIter>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the <Struct as Trait>::AssociatedType
feels a little hard to parse (I understand writing the full type above would be worse) - wonder if we can improve just with type alias?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced type FlattenedRangeIterator<T>
.
])?); | ||
|
||
let snapshot = Snapshot::builder_for(location).build(engine.as_ref())?; | ||
let scan = snapshot.scan_builder().with_schema(schema).build()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a test with a predicate to test the skipping logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added unit tests for row group skipping in arrow_utils.rs
. As for integration tests, we first need to address #860 to support predicate push-down in scans, right?
I quickly vibe-coded a test to derisk merge + release. seems like everything is fine, will follow up with it shortly after release cc @nicklan |
…Builder (#1334) ## What changes are proposed in this pull request? This PR follows up on #1272 and addresses open review comments. - It removes raw pointer indexing that can theoretically lead to panics. - We refactor `impl IntoIter` to a fallible `RowInderBuilder::build` method that verifies that row group ordinals are unique and within bounds. - It replaces complex type signatures with more intuitive type aliases. - It adds more unit tests for `RowIndexBuilder`. Note: We cannot add integration tests for row index reads with filter predicates before addressing #860 and implementing predicate pushdown. ## How was this change tested? New UT. --------- Co-authored-by: Zach Schuermann <[email protected]>
What changes are proposed in this pull request?
This PR follows up on #1266 and adds support for reading the row index metadata column to the default engine. The implementation directly follows the approach proposed in #920 and slightly modifies it to match the new metadata column API.
Quoting from #920
This PR affects the following public APIs
None - the breaking changes were introduced in #1266.
How was this change tested?
New UT.