Skip to content

Conversation

superserious-dev
Copy link
Contributor

Which issue does this PR close?

Are these changes tested?

Yes

Are there any user-facing changes?

Can use variant_get for shredded numeric types

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Aug 20, 2025
}
}

impl From<half::f16> for Variant<'_, '_> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to support DataType::Float16.

Unrelated to this issue: Seems like it would make sense to also add Variant::as_f16 for symmetry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// in this case dispatch on the typed_value and
// TODO macro'ize this using downcast! to handle all other primitive types
// TODO(perf): avoid builders entirely (and write the raw variant directly as we know the metadata is the same)
let mut array_builder = VariantArrayBuilder::new(variant_array.len());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this variable into the macro itself, although passing this into the macro would also work.

@alamb
Copy link
Contributor

alamb commented Aug 20, 2025

FYI @carpecodeum and @scovich do you have some time to review this PR?

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to

... it will be interesting to see how this ends up meshing with

(I'm guessing the infrastructure for casting arrays remains useful, and the call/use site would just change if/when the other PR merges?)

Comment on lines 33 to 40
for i in 0..$variant_array.len() {
if primitive_array.is_null(i) {
array_builder.append_null();
} else {
let value = primitive_array.value(i);
array_builder.append_variant(Variant::from(value));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if https://github.com/apache/arrow-rs/pull/8105/files#diff-ed469f33b865066aa56a96aa5279ce67d4bf85ea34c7a8c12ffb16afef107b6fR23-R31 merges, this simplifies to:

Suggested change
for i in 0..$variant_array.len() {
if primitive_array.is_null(i) {
array_builder.append_null();
} else {
let value = primitive_array.value(i);
array_builder.append_variant(Variant::from(value));
}
}
primitive_conversion_array!($arrow_type, $typed_value, array_builder);

TBD whether it's worth defining the macro just to turn 3 lines into one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, re-using that macro after it merges would work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched over to using the merged macro.

Comment on lines 90 to 76
DataType::Int8 => {
cast_partially_shredded_primitive!(
typed_value,
variant_array,
Int8Type,
DataType::Int8
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: I'm a bit surprised that fmt didn't simplify these match arms:

Suggested change
DataType::Int8 => {
cast_partially_shredded_primitive!(
typed_value,
variant_array,
Int8Type,
DataType::Int8
)
}
DataType::Int8 => cast_partially_shredded_primitive!(
typed_value,
variant_array,
Int8Type,
DataType::Int8
),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's having trouble inferring things across the macro boundary, so it doesn't know that it can simplify the match arms. I've noticed that macros sometimes behave strangely when it comes to formatting(and LSP diagnostics).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually suspect fmt just plain isn't running, for some reason? I've seen other places in this file that aren't formatted as I would have expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an experiment with messing up the formatting of the file and running cargo fmt manually. It modifies the file back to it's current state. I don't have any rustfmt.toml customizations, so I think this is the formatted result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok... weird, but I guess fmt does what it does

@superserious-dev superserious-dev force-pushed the typed-access-numeric-shredded branch from 95a4fab to db5183d Compare August 22, 2025 00:34
let typed_value = $array_type::from(vec![
Some(<$primitive_type>::try_from(34u8).unwrap()), // row 0 is shredded, so it has a value
None, // row 1 is null, so no value
None, // row 2 is a string, so no typed value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why cargo fmt aligns the comment like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wild guess: it would pass 100 char limit if indented like the others

Comment on lines 33 to 40
for i in 0..$variant_array.len() {
if primitive_array.is_null(i) {
array_builder.append_null();
} else {
let value = primitive_array.value(i);
array_builder.append_variant(Variant::from(value));
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, re-using that macro after it merges would work.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @superserious-dev and @scovich -- this PR looks good to me. I think it needs to have some merge conflicts resolved and we'll be ready go merge it.

}
}

impl From<half::f16> for Variant<'_, '_> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

) -> arrow::error::Result<ArrayRef> {
// in this case dispatch on the typed_value and
// TODO macro'ize this using downcast! to handle all other primitive types
// TODO(perf): avoid builders entirely (and write the raw variant directly as we know the metadata is the same)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that #8208 is done, we could use the ReadOnlyMetadataBuilder (as a follow on PR)

@superserious-dev superserious-dev force-pushed the typed-access-numeric-shredded branch from db5183d to 49390f4 Compare August 27, 2025 18:12
@superserious-dev
Copy link
Contributor Author

Incorporated @scovich's suggestion and rebased on the latest main.

One question: This PR introduces a few macros(eg. numeric_perfectly_shredded_test,numeric_partially_shredded_test, numeric_perfectly_shredded_variant_array_fn, numeric_partially_shredded_variant_array_fn) to reduce some boilerplate with the numeric tests; @klion26 Introduced an alternate approach for using functions that are generic over PrimitiveArray

fn perfectly_shredded_variant_array<T>(typed_value: PrimitiveArray<T>) -> ArrayRef
where
T: arrow::datatypes::ArrowPrimitiveType,
{
// At the time of writing, the `VariantArrayBuilder` does not support shredding.
// so we must construct the array manually. see https://github.com/apache/arrow-rs/issues/7895
let (metadata, _value) = { parquet_variant::VariantBuilder::new().finish() };
let metadata =
BinaryViewArray::from_iter_values(std::iter::repeat_n(&metadata, typed_value.len()));
let struct_array = StructArrayBuilder::new()
.with_field("metadata", Arc::new(metadata))
.with_field("typed_value", Arc::new(typed_value))
.build();
Arc::new(
VariantArray::try_new(Arc::new(struct_array)).expect("should create variant array"),
)
}
. I can see pros and cons to each:

  • the macro approach reduces some boilerplate while making it harder to read the tests
  • the generic approach makes the tests easier to understand at the cost of some more boilerplate

Any suggestions for whether to keep the macros in this PR or switch over to the generic functions? There are also quite a few conversion left, so trying to consider what would work well for those.

  • StringTypes: DataType::Utf8, Utf8View, etc
  • BinaryTypes: Binary, BinaryView, etc
  • Boolean
  • Time Types
  • Date Types
  • Timestamp Types

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good enough to merge, possibly with a couple tweaks to tests (see below)

Open questions/follow-ups in my mind:

  1. re

    Can use variant_get for shredded numeric types

    What happens if variant_get encounters a typed_value: LONG at the requested path, but the caller requested Some(Int32) type that doesn't match? Is there code somewhere else to handle the casting? (I don't see any unit tests that cover the case, either)

  2. this PR modifies OutputBuilder and the types that implement it, while another open PR aims to delete the trait entirely:

    But that other PR seems to have stalled due to questions around null mask handling (**), and the tests in this PR will remain useful even if the implementation changes later.

    Should we go ahead and merge this PR even knowing a lot of the code it touches will need to change Real Soon?

    (**) Honestly, we should probably let the other PR merge without addressing the null mask question, because it's not making the problem worse.

  3. re

    the macro approach reduces some boilerplate while making it harder to read the tests

    Any suggestions for whether to keep the macros in this PR or switch over to the generic functions?

    I left a couple comments in the code with my suggestions. Nothing drastic.

Comment on lines 208 to 210
($test:ident, $data_fn:ident, $primitive_type:ty) => {
#[test]
fn $test() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're worried about readability -- I'm not sure we gain a lot by pulling the test function definition inside the macro? What if we left that to the caller, and swapped the remaining two args, so that:

    numeric_partially_shredded_test!(
        get_variant_partially_shredded_int8_as_variant,
        partially_shredded_int8_variant_array,
        i8
    );

becomes

    #[test]
    fn get_variant_partially_shredded_int8_as_variant() {
        numeric_partially_shredded_test!(i8, partially_shredded_int8_variant_array);
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I agree that it looks more readable this way.

let typed_value = $array_type::from(vec![
Some(<$primitive_type>::try_from(34u8).unwrap()), // row 0 is shredded, so it has a value
None, // row 1 is null, so no value
None, // row 2 is a string, so no typed value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wild guess: it would pass 100 char limit if indented like the others

use std::sync::Arc;

macro_rules! cast_partially_shredded_primitive {
($typed_value:expr, $variant_array:expr, $arrow_type:ty, $data_type:expr) => {{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$data_type arg is unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch; removed.


macro_rules! cast_partially_shredded_primitive {
($typed_value:expr, $variant_array:expr, $arrow_type:ty, $data_type:expr) => {{
let mut array_builder = VariantArrayBuilder::new($variant_array.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look quite right? Seems like we should either

  1. Propagate the existing metadata column unchanged, or
  2. Create a new column of empty metadata, since these are all primitives that don't need field names in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... the current code is taking approach 2/, and is correct.

It could be optimized to just stamp out the empty metadata directly (or even use a dictionary array to reference the same one over and over).

If we pursued approach 1/, we wouldn't have to do any work at all -- just arc clone the existing column -- but we'd need to create a new kind of variant array builder that doesn't try to create a metadata column.

Either way, it seems like a potential TODO to track, rather than something to pursue in this PR.

Comment on lines 73 to 74
match typed_value.data_type() {
DataType::Int32 => {
let primitive_array = typed_value.as_primitive::<Int32Type>();
for i in 0..variant_array.len() {
if variant_array.is_null(i) {
array_builder.append_null();
continue;
}

if typed_value.is_null(i) {
// fall back to the value (variant) field
// (TODO could copy the variant bytes directly)
let value = variant_array.value(i);
array_builder.append_variant(value);
continue;
}

// otherwise we have a typed value, so we can use it directly
let int_value = primitive_array.value(i);
array_builder.append_variant(Variant::from(int_value));
}
}
DataType::Int8 => cast_partially_shredded_primitive!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like something that should be handled by downcast_primitive_array, except we don't support all primitive types yet?

This match is mostly dealing with arrow data types and primitive downcasting, which that macro should handle automagically?

downcast_primitive_array!(
    typed_value => {
        let mut array_builder = VariantArrayBuilder::new(variant_array.len());
        for i in 0..variant_array.len() {
            if variant_array.is_null(i) {
                array_builder.append_null();
                continue;
            }
            
            // We must use the typed_value, if present; otherwise fall back to the value
            let value = if typed_value.is_null(i) {
                variant_array.value(i) // TODO: copy the variant bytes directly?
            } else {
                typed_value.value(i).into()
            };
            array_builder.append_variant(value);
        }
        Ok(Arc::new(array_builder.build()))            
    }
    dt => { 
        ... same error as today ...
    }
}

@superserious-dev
Copy link
Contributor Author

re casting, I left a comment here: #8087 (comment). Since casting is an extra step after typed access, I suggested implementing the conversion logic in #8086.

@scovich
Copy link
Contributor

scovich commented Aug 28, 2025

Can use variant_get for shredded numeric types

What happens if variant_get encounters a typed_value: LONG at the requested path, but the caller requested Some(Int32) type that doesn't match? Is there code somewhere else to handle the casting? (I don't see any unit tests that cover the case, either)

Any thoughts on this, @superserious-dev ?

@superserious-dev
Copy link
Contributor Author

Can use variant_get for shredded numeric types

What happens if variant_get encounters a typed_value: LONG at the requested path, but the caller requested Some(Int32) type that doesn't match? Is there code somewhere else to handle the casting? (I don't see any unit tests that cover the case, either)

Any thoughts on this, @superserious-dev ?

Currently this PR only covers the case where the user has not requested a specific output type. I noticed there was a separate issue/discussion #8086 for casting, so I didn't implement it here. However I'd be fine with amending this PR to include the casting behavior for numeric types, if that is preferred.

For reference, the code for determining typed output(as_type is not None) vs. variant output(as_type is None) is here:

pub(crate) fn instantiate_output_builder<'a>(
options: GetOptions<'a>,
) -> Result<Box<dyn OutputBuilder + 'a>> {
let GetOptions {
as_type,
path,
cast_options,
} = options;
let Some(as_type) = as_type else {
return Ok(Box::new(VariantOutputBuilder::new(path)));
};
// handle typed output
match as_type.data_type() {
DataType::Int32 => Ok(Box::new(PrimitiveOutputBuilder::<Int32Type>::new(
path,
as_type,
cast_options,
))),
dt => Err(ArrowError::NotYetImplemented(format!(
"variant_get with as_type={dt} is not implemented yet",
))),
}
}
and it's called in variant_get here:
let output_builder = instantiate_output_builder(options.clone())?;

There's a set of pre-existing tests that cover the specific case of int32 -> int32. This could be macro-ified for coverage of all the different combinations of numeric types, eg. int32->int64.
/// Shredding: extract a value as an Int32Array
#[test]
fn get_variant_shredded_int32_as_int32_safe_cast() {
// Extract the typed value as Int32Array
let array = shredded_int32_variant_array();
// specify we want the typed value as Int32
let field = Field::new("typed_value", DataType::Int32, true);
let options = GetOptions::new().with_as_type(Some(FieldRef::from(field)));
let result = variant_get(&array, options).unwrap();
let expected: ArrayRef = Arc::new(Int32Array::from(vec![
Some(34),
None,
None, // "n/a" is not an Int32 so converted to null
Some(100),
]));
assert_eq!(&result, &expected)
}
/// Shredding: extract a value as an Int32Array, unsafe cast (should error on "n/a")
#[test]
fn get_variant_shredded_int32_as_int32_unsafe_cast() {
// Extract the typed value as Int32Array
let array = shredded_int32_variant_array();
let field = Field::new("typed_value", DataType::Int32, true);
let cast_options = CastOptions {
safe: false, // unsafe cast
..Default::default()
};
let options = GetOptions::new()
.with_as_type(Some(FieldRef::from(field)))
.with_cast_options(cast_options);
let err = variant_get(&array, options).unwrap_err();
// TODO make this error message nicer (not Debug format)
assert_eq!(err.to_string(), "Cast error: Failed to extract primitive of type Int32 from variant ShortString(ShortString(\"n/a\")) at path VariantPath([])");
}

@scovich
Copy link
Contributor

scovich commented Aug 28, 2025

Can use variant_get for shredded numeric types

What happens if variant_get encounters a typed_value: LONG at the requested path, but the caller requested Some(Int32) type that doesn't match? Is there code somewhere else to handle the casting? (I don't see any unit tests that cover the case, either)

Any thoughts on this, @superserious-dev ?

Currently this PR only covers the case where the user has not requested a specific output type.

Ah, that makes it easy then!

Thanks for the clear explanation.

@alamb
Copy link
Contributor

alamb commented Sep 4, 2025

Sorry for the delay here. I was out for the last week but I am back now and catching up on PRs. I plan to merge this once CI passes

@alamb alamb merged commit 25cc570 into apache:main Sep 4, 2025
12 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 4, 2025

Thanks again @superserious-dev and @scovich

@superserious-dev superserious-dev deleted the typed-access-numeric-shredded branch September 4, 2025 23:38
alamb added a commit that referenced this pull request Sep 8, 2025
# Which issue does this PR close?

- Closes #8150
- closes #8166

# Rationale for this change

Add support for extracting fields from both shredded and non-shredded
variant arrays at any depth (like "x", "a.x", "a.b.x") and casting them
to Int32 with proper NULL handling for type mismatches.

NOTE: This is a second attempt at  
* #8166 
which suffered strong logical conflicts with
* #8179
and which itself grew out of
* #8122

See the other two PR for the vast majority of review commentary relating
to this change.

I started from the original PR commits (first three), performed the
merge, and fixed up a bunch of issues.

Manually diffing the before
([76b75ee..882aa4d](https://github.com/apache/arrow-rs/compare/76b75eebc..882aa4d69))
and after
([0ba91ae..f6fd915](https://github.com/apache/arrow-rs/compare/0ba91aed9..f6fd91583))
diffs gives the following non-trivial differences vs. the original PR
* Ran `cargo fmt`
* `typed_value_to_variant` now supports all primitive numeric types
(previously only int16)
* cast options plumbed through and respected
* Fix a null buffer bug in `shredded_get_path` -- the original code was
wrongly unioning in the null buffer from `typed_value` column:
    ```patch
// Path exhausted! Create a new `VariantArray` for the location we
landed on.
- // Also union nulls from the final typed_value field we landed on
    - if let Some(typed_value) = shredding_state.typed_value_field() {  
    -     accumulated_nulls = arrow::buffer::NullBuffer::union(         
    -         accumulated_nulls.as_ref(),    
    -         typed_value.nulls(),         
    -     );                  
    - }                
    let target = make_target_variant(    
        shredding_state.value_field().cloned(),     
        shredding_state.typed_value_field().cloned(),   
        accumulated_nulls,               
    );     
    ```
* Remove the `get_variant_perfectly_shredded_int32_as_variant` test
case, because #8179 introduced a
battery of unit tests that cover the same functionality.
* Remove now-unnecessary `.unwrap()` calls from object builder `finish`
calls in unit tests
* Fixed broken test code in `create_depth_1_shredded_test_data_working`,
which captured the return value of a nested builder's `finish` (`()`)
instead of the return value of the top-level builder. I'm not quite sure
what this code was trying to do, but I changed it to just create a
nested builder instead of a second top-level builder:
    ```patch
      fn create_depth_1_shredded_test_data_working() -> ArrayRef {
// Create metadata following the working pattern from
shredded_object_with_x_field_variant_array
          let (metadata, _) = { 
    -         let a_variant = {
- let mut a_builder = parquet_variant::VariantBuilder::new();
    -             let mut a_obj = a_builder.new_object();
- a_obj.insert("x", Variant::Int32(55)); // "a.x" field (shredded when
possible)
    -             a_obj.finish().unwrap()
    -         }; 
              let mut builder = parquet_variant::VariantBuilder::new();
              let mut obj = builder.new_object();
    -         obj.insert("a", a_variant);
    +
    +         // Create the nested "a" object 
    +         let mut a_obj = obj.new_object("a");
    +         a_obj.insert("x", Variant::Int32(55)); 
    +         a_obj.finish();
    +
              obj.finish().unwrap();
              builder.finish()
          };
    ```
* Similar fix (twice, `a_variant` and `b_variant`) for
`create_depth_2_shredded_test_data_working`
* `make_shredding_row_builder` now supports signed int and float types
(unsigned int not supported yet)
* A new `get_type_name` helper in row_builder.rs that gives
human-readable data type names. I'm not convinced it's necessary (and
the code is in the wrong spot, jammed in the middle of
`VariantAsPrimitive` code.
* `impl VariantAsPrimitive` for all signed int and float types
* `PrimitiveVariantShreddingRowBuilder` now has a lifetime param because
it takes a reference to cast options (it now respects unsafe vs. safe
casting)

# What changes are included in this PR?

Everything in the original PR, plus merge in the main branch, fix
logical conflicts and fix various broken tests.

# Are these changes tested?

All unit tests now pass.

# Are there any user-facing changes?

No (variant is not public yet)

---------

Co-authored-by: carpecodeum <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Support typed access for numeric types in variant_get

3 participants