-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Support typed access for numeric types in variant_get #8179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Variant] Support typed access for numeric types in variant_get #8179
Conversation
} | ||
} | ||
|
||
impl From<half::f16> for Variant<'_, '_> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to support DataType::Float16
.
Unrelated to this issue: Seems like it would make sense to also add Variant::as_f16
for symmetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// in this case dispatch on the typed_value and | ||
// TODO macro'ize this using downcast! to handle all other primitive types | ||
// TODO(perf): avoid builders entirely (and write the raw variant directly as we know the metadata is the same) | ||
let mut array_builder = VariantArrayBuilder::new(variant_array.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this variable into the macro itself, although passing this into the macro would also work.
FYI @carpecodeum and @scovich do you have some time to review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for i in 0..$variant_array.len() { | ||
if primitive_array.is_null(i) { | ||
array_builder.append_null(); | ||
} else { | ||
let value = primitive_array.value(i); | ||
array_builder.append_variant(Variant::from(value)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if https://github.com/apache/arrow-rs/pull/8105/files#diff-ed469f33b865066aa56a96aa5279ce67d4bf85ea34c7a8c12ffb16afef107b6fR23-R31 merges, this simplifies to:
for i in 0..$variant_array.len() { | |
if primitive_array.is_null(i) { | |
array_builder.append_null(); | |
} else { | |
let value = primitive_array.value(i); | |
array_builder.append_variant(Variant::from(value)); | |
} | |
} | |
primitive_conversion_array!($arrow_type, $typed_value, array_builder); |
TBD whether it's worth defining the macro just to turn 3 lines into one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, re-using that macro after it merges would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched over to using the merged macro.
DataType::Int8 => { | ||
cast_partially_shredded_primitive!( | ||
typed_value, | ||
variant_array, | ||
Int8Type, | ||
DataType::Int8 | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: I'm a bit surprised that fmt
didn't simplify these match arms:
DataType::Int8 => { | |
cast_partially_shredded_primitive!( | |
typed_value, | |
variant_array, | |
Int8Type, | |
DataType::Int8 | |
) | |
} | |
DataType::Int8 => cast_partially_shredded_primitive!( | |
typed_value, | |
variant_array, | |
Int8Type, | |
DataType::Int8 | |
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's having trouble inferring things across the macro boundary, so it doesn't know that it can simplify the match arms. I've noticed that macros sometimes behave strangely when it comes to formatting(and LSP diagnostics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually suspect fmt
just plain isn't running, for some reason? I've seen other places in this file that aren't formatted as I would have expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an experiment with messing up the formatting of the file and running cargo fmt
manually. It modifies the file back to it's current state. I don't have any rustfmt.toml
customizations, so I think this is the formatted result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok... weird, but I guess fmt does what it does
95a4fab
to
db5183d
Compare
let typed_value = $array_type::from(vec![ | ||
Some(<$primitive_type>::try_from(34u8).unwrap()), // row 0 is shredded, so it has a value | ||
None, // row 1 is null, so no value | ||
None, // row 2 is a string, so no typed value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why cargo fmt
aligns the comment like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wild guess: it would pass 100 char limit if indented like the others
for i in 0..$variant_array.len() { | ||
if primitive_array.is_null(i) { | ||
array_builder.append_null(); | ||
} else { | ||
let value = primitive_array.value(i); | ||
array_builder.append_variant(Variant::from(value)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, re-using that macro after it merges would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @superserious-dev and @scovich -- this PR looks good to me. I think it needs to have some merge conflicts resolved and we'll be ready go merge it.
} | ||
} | ||
|
||
impl From<half::f16> for Variant<'_, '_> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) -> arrow::error::Result<ArrayRef> { | ||
// in this case dispatch on the typed_value and | ||
// TODO macro'ize this using downcast! to handle all other primitive types | ||
// TODO(perf): avoid builders entirely (and write the raw variant directly as we know the metadata is the same) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that #8208 is done, we could use the ReadOnlyMetadataBuilder (as a follow on PR)
db5183d
to
49390f4
Compare
Incorporated @scovich's suggestion and rebased on the latest main. One question: This PR introduces a few macros(eg. arrow-rs/parquet-variant-compute/src/variant_get/mod.rs Lines 368 to 387 in 1dacecb
Any suggestions for whether to keep the macros in this PR or switch over to the generic functions? There are also quite a few conversion left, so trying to consider what would work well for those.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good enough to merge, possibly with a couple tweaks to tests (see below)
Open questions/follow-ups in my mind:
-
re
Can use
variant_get
for shredded numeric typesWhat happens if
variant_get
encounters atyped_value: LONG
at the requested path, but the caller requestedSome(Int32)
type that doesn't match? Is there code somewhere else to handle the casting? (I don't see any unit tests that cover the case, either) -
this PR modifies
OutputBuilder
and the types that implement it, while another open PR aims to delete the trait entirely:But that other PR seems to have stalled due to questions around null mask handling (**), and the tests in this PR will remain useful even if the implementation changes later.
Should we go ahead and merge this PR even knowing a lot of the code it touches will need to change Real Soon?
(**) Honestly, we should probably let the other PR merge without addressing the null mask question, because it's not making the problem worse.
-
re
the macro approach reduces some boilerplate while making it harder to read the tests
Any suggestions for whether to keep the macros in this PR or switch over to the generic functions?
I left a couple comments in the code with my suggestions. Nothing drastic.
($test:ident, $data_fn:ident, $primitive_type:ty) => { | ||
#[test] | ||
fn $test() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're worried about readability -- I'm not sure we gain a lot by pulling the test function definition inside the macro? What if we left that to the caller, and swapped the remaining two args, so that:
numeric_partially_shredded_test!(
get_variant_partially_shredded_int8_as_variant,
partially_shredded_int8_variant_array,
i8
);
becomes
#[test]
fn get_variant_partially_shredded_int8_as_variant() {
numeric_partially_shredded_test!(i8, partially_shredded_int8_variant_array);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I agree that it looks more readable this way.
let typed_value = $array_type::from(vec) and after ([0ba91ae..f6fd915](https://github.com/apache/arrow-rs/compare/0ba91aed9..f6fd91583)) diffs gives the following non-trivial differences vs. the original PR * Ran `cargo fmt` * `typed_value_to_variant` now supports all primitive numeric types (previously only int16) * cast options plumbed through and respected * Fix a null buffer bug in `shredded_get_path` -- the original code was wrongly unioning in the null buffer from `typed_value` column: ```patch // Path exhausted! Create a new `VariantArray` for the location we landed on. - // Also union nulls from the final typed_value field we landed on - if let Some(typed_value) = shredding_state.typed_value_field() { - accumulated_nulls = arrow::buffer::NullBuffer::union( - accumulated_nulls.as_ref(), - typed_value.nulls(), - ); - } let target = make_target_variant( shredding_state.value_field().cloned(), shredding_state.typed_value_field().cloned(), accumulated_nulls, ); ``` * Remove the `get_variant_perfectly_shredded_int32_as_variant` test case, because #8179 introduced a battery of unit tests that cover the same functionality. * Remove now-unnecessary `.unwrap()` calls from object builder `finish` calls in unit tests * Fixed broken test code in `create_depth_1_shredded_test_data_working`, which captured the return value of a nested builder's `finish` (`()`) instead of the return value of the top-level builder. I'm not quite sure what this code was trying to do, but I changed it to just create a nested builder instead of a second top-level builder: ```patch fn create_depth_1_shredded_test_data_working() -> ArrayRef { // Create metadata following the working pattern from shredded_object_with_x_field_variant_array let (metadata, _) = { - let a_variant = { - let mut a_builder = parquet_variant::VariantBuilder::new(); - let mut a_obj = a_builder.new_object(); - a_obj.insert("x", Variant::Int32(55)); // "a.x" field (shredded when possible) - a_obj.finish().unwrap() - }; let mut builder = parquet_variant::VariantBuilder::new(); let mut obj = builder.new_object(); - obj.insert("a", a_variant); + + // Create the nested "a" object + let mut a_obj = obj.new_object("a"); + a_obj.insert("x", Variant::Int32(55)); + a_obj.finish(); + obj.finish().unwrap(); builder.finish() }; ``` * Similar fix (twice, `a_variant` and `b_variant`) for `create_depth_2_shredded_test_data_working` * `make_shredding_row_builder` now supports signed int and float types (unsigned int not supported yet) * A new `get_type_name` helper in row_builder.rs that gives human-readable data type names. I'm not convinced it's necessary (and the code is in the wrong spot, jammed in the middle of `VariantAsPrimitive` code. * `impl VariantAsPrimitive` for all signed int and float types * `PrimitiveVariantShreddingRowBuilder` now has a lifetime param because it takes a reference to cast options (it now respects unsafe vs. safe casting) # What changes are included in this PR? Everything in the original PR, plus merge in the main branch, fix logical conflicts and fix various broken tests. # Are these changes tested? All unit tests now pass. # Are there any user-facing changes? No (variant is not public yet) --------- Co-authored-by: carpecodeum <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Are these changes tested?
Yes
Are there any user-facing changes?
Can use
variant_get
for shredded numeric types