You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat!: Introduce row index metadata column (#1272)
## What changes are proposed in this pull request?
This PR follows up on #1266 and adds support for reading the row index
metadata column to the default engine. The implementation directly
follows the approach proposed in #920 and slightly modifies it to match
the new metadata column API.
Quoting from #920
> Deletion vectors (and row tracking, eventually) rely on accurate
file-level row indexes. But they're not implemented in the kernel's
default parquet reader. That means we must rely on the position of rows
in data batches returned by each read, and we cannot apply optimizations
such as stats-based row group skipping (see
#860).
>
> Add row index support to the default Parquet reader, in the form of a
new RowIndex variant of ReorderIndexTransform. [...] The default parquet
reader recognizes (the RowIndex metadata) column and injects a transform
to generate row indexes (with appropriate adjustments for any row group
skipping that might occur).
>
> Fixes#919
>
> NOTE: If/when arrow-rs parquet reader gains native support for row
indexes, e.g. apache/arrow-rs#7307, we should
switch to using that. Our solution here is not robust to advanced
parquet reader features like page-level skipping. row-level predicate
pushdown, etc.
### This PR affects the following public APIs
None - the breaking changes were introduced in #1266.
## How was this change tested?
New UT.
Co-authored-by: Zach Schuermann <[email protected]>
0 commit comments