Skip to content

Commit 9880c2c

Browse files
lbhmzachschuermann
andauthored
feat!: Introduce row index metadata column (#1272)
## What changes are proposed in this pull request? This PR follows up on #1266 and adds support for reading the row index metadata column to the default engine. The implementation directly follows the approach proposed in #920 and slightly modifies it to match the new metadata column API. Quoting from #920 > Deletion vectors (and row tracking, eventually) rely on accurate file-level row indexes. But they're not implemented in the kernel's default parquet reader. That means we must rely on the position of rows in data batches returned by each read, and we cannot apply optimizations such as stats-based row group skipping (see #860). > > Add row index support to the default Parquet reader, in the form of a new RowIndex variant of ReorderIndexTransform. [...] The default parquet reader recognizes (the RowIndex metadata) column and injects a transform to generate row indexes (with appropriate adjustments for any row group skipping that might occur). > > Fixes #919 > > NOTE: If/when arrow-rs parquet reader gains native support for row indexes, e.g. apache/arrow-rs#7307, we should switch to using that. Our solution here is not robust to advanced parquet reader features like page-level skipping. row-level predicate pushdown, etc. ### This PR affects the following public APIs None - the breaking changes were introduced in #1266. ## How was this change tested? New UT. Co-authored-by: Zach Schuermann <[email protected]>
1 parent 459e832 commit 9880c2c

File tree

5 files changed

+491
-57
lines changed

5 files changed

+491
-57
lines changed

0 commit comments

Comments
 (0)