Commit 9880c2c

and

authored

feat!: Introduce row index metadata column (#1272)

## What changes are proposed in this pull request? This PR follows up on #1266 and adds support for reading the row index metadata column to the default engine. The implementation directly follows the approach proposed in #920 and slightly modifies it to match the new metadata column API. Quoting from #920 > Deletion vectors (and row tracking, eventually) rely on accurate file-level row indexes. But they're not implemented in the kernel's default parquet reader. That means we must rely on the position of rows in data batches returned by each read, and we cannot apply optimizations such as stats-based row group skipping (see #860). > > Add row index support to the default Parquet reader, in the form of a new RowIndex variant of ReorderIndexTransform. [...] The default parquet reader recognizes (the RowIndex metadata) column and injects a transform to generate row indexes (with appropriate adjustments for any row group skipping that might occur). > > Fixes #919 > > NOTE: If/when arrow-rs parquet reader gains native support for row indexes, e.g. apache/arrow-rs#7307, we should switch to using that. Our solution here is not robust to advanced parquet reader features like page-level skipping. row-level predicate pushdown, etc. ### This PR affects the following public APIs None - the breaking changes were introduced in #1266. ## How was this change tested? New UT. Co-authored-by: Zach Schuermann <[email protected]>

1 parent 459e832 commit 9880c2cCopy full SHA for 9880c2c

5 files changed

+491

-57

lines changed

kernel
- src/engine
  - arrow_utils.rs
  - default
    - parquet.rs
  - parquet_row_group_skipping.rs
  - sync
    - parquet.rs
- tests
  - read.rs

5 files changed

+491

-57

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 9880c2c

5 files changed

5 files changed

File tree

5 files changed

5 files changed

0 commit comments