-
Notifications
You must be signed in to change notification settings - Fork 1k
[Varint] Implement ShreddingState::AllNull variant #8093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
1ceeae9
bd43fbf
1120b07
b5d4c5d
b237cb1
6705cb5
d266031
4b52a03
b16c95b
f790be7
987d112
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
@@ -187,6 +187,7 @@ impl VariantArray { | |||||||
| typed_value_to_variant(typed_value, index) | ||||||||
| } | ||||||||
| } | ||||||||
| ShreddingState::AllNull { .. } => Variant::Null, | ||||||||
| } | ||||||||
| } | ||||||||
|
|
||||||||
|
|
@@ -226,8 +227,6 @@ impl VariantArray { | |||||||
| /// [Parquet Variant Shredding Spec]: https://github.com/apache/parquet-format/blob/master/VariantShredding.md#value-shredding | ||||||||
| #[derive(Debug)] | ||||||||
| pub enum ShreddingState { | ||||||||
| // TODO: add missing state where there is neither value nor typed_value | ||||||||
| // Missing { metadata: BinaryViewArray }, | ||||||||
| /// This variant has no typed_value field | ||||||||
| Unshredded { | ||||||||
| metadata: BinaryViewArray, | ||||||||
|
|
@@ -250,6 +249,8 @@ pub enum ShreddingState { | |||||||
| value: BinaryViewArray, | ||||||||
| typed_value: ArrayRef, | ||||||||
| }, | ||||||||
| /// All values are null, only metadata is present | ||||||||
| AllNull { metadata: BinaryViewArray }, | ||||||||
| } | ||||||||
|
|
||||||||
| impl ShreddingState { | ||||||||
|
|
@@ -270,9 +271,7 @@ impl ShreddingState { | |||||||
| metadata, | ||||||||
| typed_value, | ||||||||
| }), | ||||||||
| (_metadata_field, None, None) => Err(ArrowError::InvalidArgumentError(String::from( | ||||||||
| "VariantArray has neither value nor typed_value field", | ||||||||
| ))), | ||||||||
| (metadata, None, None) => Ok(Self::AllNull { metadata }), | ||||||||
| } | ||||||||
| } | ||||||||
|
|
||||||||
|
|
@@ -282,6 +281,7 @@ impl ShreddingState { | |||||||
| ShreddingState::Unshredded { metadata, .. } => metadata, | ||||||||
| ShreddingState::Typed { metadata, .. } => metadata, | ||||||||
| ShreddingState::PartiallyShredded { metadata, .. } => metadata, | ||||||||
| ShreddingState::AllNull { metadata } => metadata, | ||||||||
| } | ||||||||
| } | ||||||||
|
|
||||||||
|
|
@@ -291,6 +291,7 @@ impl ShreddingState { | |||||||
| ShreddingState::Unshredded { value, .. } => Some(value), | ||||||||
| ShreddingState::Typed { .. } => None, | ||||||||
| ShreddingState::PartiallyShredded { value, .. } => Some(value), | ||||||||
| ShreddingState::AllNull { .. } => None, | ||||||||
| } | ||||||||
| } | ||||||||
|
|
||||||||
|
|
@@ -300,6 +301,7 @@ impl ShreddingState { | |||||||
| ShreddingState::Unshredded { .. } => None, | ||||||||
| ShreddingState::Typed { typed_value, .. } => Some(typed_value), | ||||||||
| ShreddingState::PartiallyShredded { typed_value, .. } => Some(typed_value), | ||||||||
| ShreddingState::AllNull { .. } => None, | ||||||||
| } | ||||||||
| } | ||||||||
|
|
||||||||
|
|
@@ -326,6 +328,9 @@ impl ShreddingState { | |||||||
| value: value.slice(offset, length), | ||||||||
| typed_value: typed_value.slice(offset, length), | ||||||||
| }, | ||||||||
| ShreddingState::AllNull { metadata } => ShreddingState::AllNull { | ||||||||
| metadata: metadata.slice(offset, length), | ||||||||
| }, | ||||||||
| } | ||||||||
| } | ||||||||
| } | ||||||||
|
|
@@ -434,15 +439,17 @@ mod test { | |||||||
| } | ||||||||
|
|
||||||||
| #[test] | ||||||||
| fn invalid_missing_value() { | ||||||||
| fn all_null_missing_value_and_typed_value() { | ||||||||
| let fields = Fields::from(vec![Field::new("metadata", DataType::BinaryView, false)]); | ||||||||
| let array = StructArray::new(fields, vec![make_binary_view_array()], None); | ||||||||
| // Should fail because the StructArray does not contain a 'value' field | ||||||||
| let err = VariantArray::try_new(Arc::new(array)); | ||||||||
| assert_eq!( | ||||||||
| err.unwrap_err().to_string(), | ||||||||
| "Invalid argument error: VariantArray has neither value nor typed_value field" | ||||||||
| ); | ||||||||
| // Should succeed and create an AllNull variant when neither value nor typed_value are present | ||||||||
| let variant_array = VariantArray::try_new(Arc::new(array)).unwrap(); | ||||||||
|
||||||||
| value | typed_value | Meaning |
|---|---|---|
| null | null | The value is missing; only valid for shredded object fields |
But maybe that's a validation VariantArray::try_new should perform, not ShreddingState::try_new?
Also, it would quickly become annoying if variant_get has to replace a missing or all-null value column with an all-Variant::Null column just to comply with the spec? Maybe that's why there's this additional tidbit?
If a Variant is missing in a context where a value is required, readers must return a Variant null (
00): basic type 0 (primitive) and physical type 0 (null). For example, if a Variant is required (likemeasurementabove) and bothvalueandtyped_valueare null, the returned value must be00(Variant null).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that this case (where there is a value field present, but it is all null) should be treated as though there was no value field?
I haven't double checked the spec (probably the Arrow one) but this feels like it may be out of compliance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, the spec's requirement that "both value and typed_value are null" meaning "the value is missing."
We can see from the spec's example:
refer to here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I got it. I will add a test for the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comprehensive test that demonstrates when a value field exists in the schema but contains all null values, it correctly remains in the Unshredded state rather than AllNull.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, the treatment of NULL/NULL depends on context:
- For a top-level variant value, it's interpreted as
Variant::Null - For a shredded variant object field, it's interpreted as missing (SQL NULL)
So I guess there are two ways to get SQL NULL -- null mask on the struct(value, typed_value), or if both value and typed_value are themselves NULL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah -- the spec requires that the struct ("group") containing value and typed_value columns must be non-nullable:
The group for each named field must use repetition level
required. A field'svalueandtyped_valueare set to null (missing) to indicate that the field does not exist in the variant. To encode a field that is present with a [variant/JSON, not SQL]nullvalue, thevaluemust contain a Variant null: basic type 0 (primitive) and physical type 0 (null).
So effectively, the NULL/NULL combo becomes the null mask for that nested field. Which is why a top-level NULL/NULL combo is incorrect -- the top-level field already has its own nullability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky... see #8122 (comment)
Variant::Nullis arguably a correct way to compensate)Variant::Nullis arguably incorrect (causes SQLIS NULLoperator to returnFALSE). But we don't even have a way to return SQL NULL here (it would probably correspond toOption::None?)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolutely correct. Subsequently, we might need a value_opt() method that returns Option to properly handle SQL NULL semantics for shredded fields.