-
Notifications
You must be signed in to change notification settings - Fork 1k
Variant: Write Variant Values as JSON #7670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
FYI @scovich @PinkCrow007 and @mkarbo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @carpecodeum -- this is very cool and welcome to the community!
I think we need to resolve the following items before merging this PR:
- Add unit tests for writing objects and lists to json
- Get the CI tests passing (looks like there are some errors with
cargo clippy
andcargo fmt
) - Figure out the conflict with the variant/array PR from @scovich #7666
Something I think is important to resolve soon (I can help here) is making the serde_json depenency "opt in" (aka not required)
In terms of sorting out the conflict with @scovich 's PR, I think we should try and get that one mergable asap and merge it first and then get this one
Given that you are now familiar with Variant, perhaps you can help review #7666 ? I am sure @scovich would love some more feedback
parquet-variant/src/variant.rs
Outdated
todo!(); | ||
#[allow(unreachable_code)] // Just to infer the return type | ||
Ok(vec![].into_iter()) | ||
let len = self.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be resolved now, i hope
[dependencies] | ||
arrow-schema = "55.1.0" | ||
chrono = { workspace = true } | ||
serde_json = "1.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we could avoid adding these dependencies if possible as serdejson brings non trivial code.
This probably looks like adding a optional feature flag (like serde-json) or something
I think we can do it as a follow on PR though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convert different Variant types to JSON format, including primitives, decimals, dates, timestamps, and binary data with proper escaping and base64 encoding.
There's a very large (and sadly completely unspecified) design space for how to represent some variant subtypes as JSON. Date/time/timestamp and binary being the most glaring examples. I thought spark documented some loose suggestions for best practices, but I can't find it anywhere now.
Storing binary data as base64, in particular, is a non-portable design choice in the sense that other systems might choose to use a different base64 dialect, or even something more exotic like z85 that pays 5/4 overhead instead of 4/3.
For date/time/timestamp values, one could store as numeric (days/µs/ns since unix epoch) or as a formatted string (ISO? or something else?), and there's a decent chance most readers could handle either one. But still technically non-portable.
I don't know the best solution -- maybe this PR is it -- but I wanted to point out the conundrum.
a simple interface between the Variant implementation and the Serde JSON library.
What is the thinking about serde JSON vs. arrow-json here? Do we plan to eventually support both and this is just the easier first step? (the arrow-json tape decoder is definitely a bit more complex than the serde JSON Value
, but both implementations would have the same "shape" -- see e.g. #7403 that prototypes both)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could probably just call it to_json.rs
? We already know it's variant?
Also, is there a reason this is under the encoder
module? It seems like its own thing?
let divisor = 10_i64.pow(*scale as u32); | ||
let decimal_value = *integer as f64 / divisor as f64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f64
(53-bit precision) cannot accurately represent all decimal8 values (63-bit precision), let alone decimal16 (127-bit precision). For example, round tripping a value like 0x4000_0000_0000_ffff
from i64 to f64 and loses information about the least-significant bits, producing 0x4000_0000_0000_fe48
. See playground.
I have no idea why it clips to that specific value, but you get the idea.
Instead, we will need to extract the quotient and remainder resulting from division by an integer power of ten, and then print the two parts back to back with e.g. "{quotient}.{remainder}"
.
Once we have that capability, we should probably also use it for decimal4 values, just for uniformity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally -- floating point (base 2) can't accurately represent powers of ten (and both bases have different errors when representing e.g. powers of 3), so even a decimal4 could potentially lose information through a round trip. Tho I think most floating point parsers work very hard to avoid exposing that kind of inaccuracy, by ensuring that every representable value round trips cleanly (I have vague memories of seeing a bug report about a case that slipped past the java floating point parser several years ago).
fn convert_array_to_json<W: Write>( | ||
buffer: &mut W, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Any particular reason not to use impl
in these signatures?
fn convert_array_to_json<W: Write>( | |
buffer: &mut W, | |
fn convert_array_to_json( | |
buffer: &mut impl Write, |
Variant::Decimal8 { integer, scale } => { | ||
let divisor = 10_i64.pow(*scale as u32); | ||
let decimal_value = *integer as f64 / divisor as f64; | ||
serde_json::Number::from_f64(decimal_value) | ||
.map(Value::Number) | ||
.ok_or_else(|| ArrowError::InvalidArgumentError("Invalid decimal value".to_string())) | ||
} | ||
Variant::Decimal16 { integer, scale } => { | ||
let divisor = 10_i128.pow(*scale as u32); | ||
let decimal_value = *integer as f64 / divisor as f64; | ||
serde_json::Number::from_f64(decimal_value) | ||
.map(Value::Number) | ||
.ok_or_else(|| ArrowError::InvalidArgumentError("Invalid decimal value".to_string())) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as above... f64 cannot accurately represent all decimal8/decimal16 values. But f64 might be the best we can do here, because serde_json::Number
is compatible with the JSON spec which states:
This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as
1E400
or3.141592653589793238462643383279
may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.Note that when such software is used, numbers that are integers and are in the range
[-(2**53)+1, (2**53)-1]
are interoperable in the sense that implementations will agree exactly on their numeric values.
(it actually goes above and beyond by allowing to capture i64/u64, in addition to f64)
Variant::Date(date) => Ok(Value::String(date.format("%Y-%m-%d").to_string())), | ||
Variant::TimestampMicros(ts) => Ok(Value::String(ts.to_rfc3339())), | ||
Variant::TimestampNtzMicros(ts) => Ok(Value::String(ts.format("%Y-%m-%dT%H:%M:%S%.6f").to_string())), | ||
Variant::Binary(bytes) => Ok(Value::String(general_purpose::STANDARD.encode(bytes))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated logic. Probably better to define helper functions to completely isolate the logic from callers. At a minimum, we should hoist out those format strings as named const
that can be reused, since they're especially error-prone.
let mut map = serde_json::Map::new(); | ||
let fields = obj.fields()?; | ||
|
||
for (key, value) in fields { | ||
let json_value = variant_to_json_value(&value)?; | ||
map.insert(key.to_string(), json_value); | ||
} | ||
|
||
Ok(Value::Object(map)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a profitable place to use Iterator::collect
?
let mut map = serde_json::Map::new(); | |
let fields = obj.fields()?; | |
for (key, value) in fields { | |
let json_value = variant_to_json_value(&value)?; | |
map.insert(key.to_string(), json_value); | |
} | |
Ok(Value::Object(map)) | |
let fields = obj | |
.fields()?. | |
.map(|(key, value)| { | |
Ok((key.to_string(), variant_to_json_value(&value)?)) | |
}) | |
.collect::<Result<_, _>>()?; | |
Ok(Value::Object(fields)) |
(again below for Variant::Array
, using values()
method)
7c9c341
to
449bafe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contrbution @carpecodeum and the reviews @scovich
My comments are mostly about consolidating the tests, improving the examples, and more clearly separating the serde_json dependencies
In my opinion, we could merge this PR as is and keep iterating in follow on PRs.
Let me know what you prefer
fn test_buffer_writing() { | ||
use parquet_variant::variant_to_json; | ||
use std::io::Write; | ||
|
||
let variant = Variant::String("test buffer writing"); | ||
|
||
// Test writing to a Vec<u8> | ||
let mut buffer = Vec::new(); | ||
variant_to_json(&mut buffer, &variant).unwrap(); | ||
let result = String::from_utf8(buffer).unwrap(); | ||
assert_eq!(result, "\"test buffer writing\""); | ||
|
||
// Test writing to a custom writer | ||
struct CustomWriter { | ||
data: Vec<u8>, | ||
} | ||
|
||
impl Write for CustomWriter { | ||
fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> { | ||
self.data.extend_from_slice(buf); | ||
Ok(buf.len()) | ||
} | ||
|
||
fn flush(&mut self) -> std::io::Result<()> { | ||
Ok(()) | ||
} | ||
} | ||
|
||
let mut custom_writer = CustomWriter { data: Vec::new() }; | ||
variant_to_json(&mut custom_writer, &variant).unwrap(); | ||
let result = String::from_utf8(custom_writer.data).unwrap(); | ||
assert_eq!(result, "\"test buffer writing\""); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Vec
already implements Write
I think you can simplify test to something like
fn test_buffer_writing() { | |
use parquet_variant::variant_to_json; | |
use std::io::Write; | |
let variant = Variant::String("test buffer writing"); | |
// Test writing to a Vec<u8> | |
let mut buffer = Vec::new(); | |
variant_to_json(&mut buffer, &variant).unwrap(); | |
let result = String::from_utf8(buffer).unwrap(); | |
assert_eq!(result, "\"test buffer writing\""); | |
// Test writing to a custom writer | |
struct CustomWriter { | |
data: Vec<u8>, | |
} | |
impl Write for CustomWriter { | |
fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> { | |
self.data.extend_from_slice(buf); | |
Ok(buf.len()) | |
} | |
fn flush(&mut self) -> std::io::Result<()> { | |
Ok(()) | |
} | |
} | |
let mut custom_writer = CustomWriter { data: Vec::new() }; | |
variant_to_json(&mut custom_writer, &variant).unwrap(); | |
let result = String::from_utf8(custom_writer.data).unwrap(); | |
assert_eq!(result, "\"test buffer writing\""); | |
} | |
fn test_buffer_writing() { | |
use parquet_variant::variant_to_json; | |
use std::io::Write; | |
let variant = Variant::String("test buffer writing"); | |
// Test writing to a Vec<u8> | |
let mut buffer = Vec::new(); | |
variant_to_json(&mut buffer, &variant).unwrap(); | |
let result = String::from_utf8(buffer).unwrap(); | |
assert_eq!(result, "\"test buffer writing\""); | |
let mut buffer = vec![]; | |
variant_to_json(&mut buffer, &variant).unwrap(); | |
let result = String::from_utf8(buffer).unwrap(); | |
assert_eq!(result, "\"test buffer writing\""); | |
} |
|
||
#[test] | ||
fn test_json_roundtrip_compatibility() { | ||
// Test that our JSON output can be parsed back by serde_json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and test_comprehensive_roundtrip_compatibility are great tests
One thing that might make it easier to evaluate what was covered would be to combine the different tests into a structure as follows, which would also make the values more self descrbing
/// Each Test has the equivalent variant, json string and serde_json::Value
Test {
variant: Variant,
json: &'static str,
value: serde_json::Value,
}
And then you could write a driver function
impl Test {
fn run(self) {
// verify that converting the variant --> Json matches json and value
// also verify that converting back to variant results in the same variant
}
}
And then the tests in this file could look something like
Test {
variant: Variant::String("simple string"),
json: "\"simple string\"",
value: Value::from("simple string")
}.run()
use parquet_variant::{variant_to_json, variant_to_json_string, variant_to_json_value, Variant}; | ||
|
||
fn main() -> Result<(), Box<dyn std::error::Error>> { | ||
let variants = vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great.
Here are some suggestions:
- Make this a doc comment / example on
variant_to_json
/variant_to_json_string
-- that way it will be easier to discover as well as being run as part of the CI / tests of this crate - Reduce the example to a single API example for each function -- for example, I wouldn't bother with all the different types and corner cases here (those are great unit tests, but might obscure the API use from new users). Instead I think an example that made a single Variant and convert it to JSON, and then assert the output would be the best as it would quickly show how to use the API and what it does
/// use parquet_variant::{Variant, variant_to_json}; | ||
/// use arrow_schema::ArrowError; | ||
/// | ||
/// fn example() -> Result<(), ArrowError> { | ||
/// let variant = Variant::Int8(42); | ||
/// let mut buffer = Vec::new(); | ||
/// variant_to_json(&mut buffer, &variant)?; | ||
/// assert_eq!(String::from_utf8(buffer).unwrap(), "42"); | ||
/// Ok(()) | ||
/// } | ||
/// example().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can make this significantly shorter like this -- the #
in the front hides the line when rendered as docs and the test driver will make the stub program necessary
/// use parquet_variant::{Variant, variant_to_json}; | |
/// use arrow_schema::ArrowError; | |
/// | |
/// fn example() -> Result<(), ArrowError> { | |
/// let variant = Variant::Int8(42); | |
/// let mut buffer = Vec::new(); | |
/// variant_to_json(&mut buffer, &variant)?; | |
/// assert_eq!(String::from_utf8(buffer).unwrap(), "42"); | |
/// Ok(()) | |
/// } | |
/// example().unwrap(); | |
/// # use parquet_variant::{Variant, variant_to_json}; | |
/// # use arrow_schema::ArrowError; | |
/// let variant = Variant::Int8(42); | |
/// let mut buffer = Vec::new(); | |
/// variant_to_json(&mut buffer, &variant)?; |
Also I think it would be awesome now to show an example of a VariantObject here (which is the most likely type that people will encode)
We could use the new builder that @PinkCrow007 added in #7653
Here is an example:
arrow-rs/parquet-variant/src/builder.rs
Lines 90 to 124 in fe65b8d
/// # Example: Create a [`Variant::Object`] | |
/// | |
/// This example shows how to create an object with two fields: | |
/// ```json | |
/// { | |
/// "first_name": "Jiaying", | |
/// "last_name": "Li" | |
/// } | |
/// ``` | |
/// | |
/// ``` | |
/// # use parquet_variant::{Variant, VariantBuilder}; | |
/// let mut builder = VariantBuilder::new(); | |
/// // Create an object builder that will write fields to the object | |
/// let mut object_builder = builder.new_object(); | |
/// object_builder.append_value("first_name", "Jiaying"); | |
/// object_builder.append_value("last_name", "Li"); | |
/// object_builder.finish(); | |
/// // Finish the builder to get the metadata and value | |
/// let (metadata, value) = builder.finish(); | |
/// // use the Variant API to verify the result | |
/// let variant = Variant::try_new(&metadata, &value).unwrap(); | |
/// let Variant::Object(variant_object) = variant else { | |
/// panic!("unexpected variant type") | |
/// }; | |
/// assert_eq!( | |
/// variant_object.field_by_name("first_name").unwrap(), | |
/// Some(Variant::ShortString("Jiaying")) | |
/// ); | |
/// assert_eq!( | |
/// variant_object.field_by_name("last_name").unwrap(), | |
/// Some(Variant::ShortString("Li")) | |
/// ); | |
/// ``` | |
/// |
|
||
use crate::variant::{Variant, VariantList, VariantObject}; | ||
|
||
/// Converts a Variant to JSON and writes it to the provided buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Converts a Variant to JSON and writes it to the provided buffer | |
/// Converts a Variant to JSON and writes it to the provided `Write` |
/// | ||
/// # Example | ||
/// | ||
/// ```rust |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as above
Ok(()) | ||
} | ||
|
||
/// Convert Variant to JSON string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth commenting that this is the same as calling variant_to_json
with a Vec
/// | ||
/// # Example | ||
/// | ||
/// ```rust |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! Integration tests for JSON conversion functionality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the difference in coverage provided by these tests compared to the tests that are in the parquet-variant/src/encoder/variant_to_json.rs? If there isn't any substantial difference in what is covered I think combing the tests in parquet-variant/src/encoder/variant_to_json.rs
makes the most send
I also found there seems to be quite a bit of overlap in test coverage -- reducing the repetition I think would be valuable so it is easier to understand what is covered and what is not
if let Value::Object(obj) = parsed { | ||
assert_eq!(obj.len(), 6); | ||
assert_eq!( | ||
obj.get("string_field"), | ||
Some(&Value::String("test_string".to_string())) | ||
); | ||
assert_eq!(obj.get("int_field"), Some(&Value::Number(123.into()))); | ||
assert_eq!(obj.get("bool_field"), Some(&Value::Bool(true))); | ||
assert!(matches!(obj.get("float_field"), Some(Value::Number(_)))); | ||
assert_eq!(obj.get("null_field"), Some(&Value::Null)); | ||
assert_eq!(obj.get("long_field"), Some(&Value::Number(999.into()))); | ||
} else { | ||
panic!("Expected JSON object"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW if you like less intending you could use this pattern too (just pointing this out)
if let Value::Object(obj) = parsed { | |
assert_eq!(obj.len(), 6); | |
assert_eq!( | |
obj.get("string_field"), | |
Some(&Value::String("test_string".to_string())) | |
); | |
assert_eq!(obj.get("int_field"), Some(&Value::Number(123.into()))); | |
assert_eq!(obj.get("bool_field"), Some(&Value::Bool(true))); | |
assert!(matches!(obj.get("float_field"), Some(Value::Number(_)))); | |
assert_eq!(obj.get("null_field"), Some(&Value::Null)); | |
assert_eq!(obj.get("long_field"), Some(&Value::Number(999.into()))); | |
} else { | |
panic!("Expected JSON object"); | |
} | |
let Value::Object(obj) = parsed else { | |
panic!("Expected JSON object"); | |
} | |
assert_eq!(obj.len(), 6); | |
assert_eq!( | |
obj.get("string_field"), | |
Some(&Value::String("test_string".to_string())) | |
); | |
assert_eq!(obj.get("int_field"), Some(&Value::Number(123.into()))); | |
assert_eq!(obj.get("bool_field"), Some(&Value::Bool(true))); | |
assert!(matches!(obj.get("float_field"), Some(Value::Number(_)))); | |
assert_eq!(obj.get("null_field"), Some(&Value::Null)); | |
assert_eq!(obj.get("long_field"), Some(&Value::Number(999.into()))); |
79e63f2
to
8b63662
Compare
@alamb @scovich Thank you so much guys for your time and dedication for reviewing this PR, I have tried to fix all the suggestions given by you guys, Sorry for the delay in resolving all of them. If I missed anything please let me know I can probably fix that, as far as I am concerned, if you guys think this PR is good to be merged then I am also open to merging it, again thank you for your efforts |
as far as the serde_json dependency issues are concerend, I would like to discuss more on that perhaps, if we can connect for that @alamb ? |
Sounds good -- thank you @carpecodeum -- I plan to give this one another review later today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR is I think this is great -- thank you so much @carpecodeum -- it is very well tested and I think the APIs make sense to me.
I will file a follow on ticket about the serde_json dependency and then I'll plan to merge it tomorrow.
THanks again
Please let me know if anyone else would like additional time to review this
const MAX_SHORT_STRING_BYTES: usize = 0x3F; | ||
|
||
/// A Variant [`ShortString`] | ||
/// Represents a variant array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is incorrect
/// Represents a variant array. | |
/// Represents a [`Variant::ShortString`]. |
|
||
/// Convert Variant to serde_json::Value | ||
/// | ||
/// This function converts a Variant to a [`serde_json::Value`], which is useful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
} | ||
|
||
#[test] | ||
fn test_primitive_json_conversion() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great
parquet-variant/src/to_json.rs
Outdated
// Parse the JSON to verify structure - handle JSON parsing errors manually | ||
let parsed: Value = serde_json::from_str(&json) | ||
.map_err(|e| ArrowError::ParseError(format!("JSON parse error: {}", e)))?; | ||
let Value::Object(obj) = parsed else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we added a nicer api here recently (this morning):
let Value::Object(obj) = parsed else { | |
let obj = parsed.as_ob().unwrap() |
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! Example showing how to convert Variant values to JSON |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be great to include this inline in the docs rather than a standalone example, but I will make a follow on PR to do so
@carpecodeum I can not push to this branch directly as i don't have access to the cmu-db fork, which is fine I made a PR to update this PR from main and fix a logical conflict: Can you please check it out when you have some time? |
Close/reopen this PR to get CI to re-run |
The CI failures should be fixed if this is merged: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great.
One question -- how to render Inf/NaN float values? I think they should turn to Value::Null
instead of throwing an error, see #7670 (comment)
Also, a bunch of opportunities to simplify quite a bit.
parquet-variant/src/variant.rs
Outdated
impl From<bool> for Variant<'_, '_> { | ||
fn from(value: bool) -> Self { | ||
match value { | ||
true => Variant::BooleanTrue, | ||
false => Variant::BooleanFalse, | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why move it here, out of curiosity? (between decimal4 and decimal8)? Was there some merge conflict?
(the original location wasn't amazing either... but if we're going to move it why not move to near the top, below the impl From<()>
?)
parquet-variant/src/to_json.rs
Outdated
Variant::Null => { | ||
write!(json_buffer, "null")?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write!
returns Result<()>
, so should be able to simplify a bunch of these:
Variant::Null => { | |
write!(json_buffer, "null")?; | |
} | |
Variant::Null => write!(json_buffer, "null")?, |
parquet-variant/src/to_json.rs
Outdated
if *scale == 0 { | ||
write!(json_buffer, "{}", integer)?; | ||
} else { | ||
let divisor = 10_i32.pow(*scale as u32); | ||
let quotient = integer / divisor; | ||
let remainder = (integer % divisor).abs(); | ||
let formatted_remainder = format!("{:0width$}", remainder, width = *scale as usize); | ||
let trimmed_remainder = formatted_remainder.trim_end_matches('0'); | ||
if trimmed_remainder.is_empty() { | ||
write!(json_buffer, "{}", quotient)?; | ||
} else { | ||
write!(json_buffer, "{}.{}", quotient, trimmed_remainder)?; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is begging for a generic method... three syntactically ~identical copies of some rather complex code. I don't know a good way to do it, tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work:
fn write_decimal(
json_buffer: &mut impl Write,
integer: impl Into<i128>,
scale: u8,
) -> Result<(), ArrowError> {
let integer = integer.into();
// ... the current Decimal16 logic ...
}
and then all three call sites look the same:
write_decimal(json_buffer, *integer, *scale)
parquet-variant/src/to_json.rs
Outdated
let formatted_remainder = format!("{:0width$}", remainder, width = *scale as usize); | ||
let trimmed_remainder = formatted_remainder.trim_end_matches('0'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is clever enough to merit a code comment? Given a remainder like 100
:
- The
{remainder:0width$}
format ensures it correctly renders with leading zeros (e.g..00000100
) - The
trime_end_matches
then strips away any trailing zeros (e.g..000001
)
parquet-variant/src/to_json.rs
Outdated
let remainder = (integer % divisor).abs(); | ||
let formatted_remainder = format!("{:0width$}", remainder, width = *scale as usize); | ||
let trimmed_remainder = formatted_remainder.trim_end_matches('0'); | ||
if trimmed_remainder.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simpler to check for remainder == 0
, before formatting?
if remainder == 0 {
write!(json_buffer, "{quotient}")?;
} else {
let remainder = format!("{remainder:0width$}", width = *scale as usize)
let remainder = remainder.trim_end_matches('0');
write!(json_buffer, "{quotient}.{remainder}")?;
}
let mut buffer = Vec::new(); | ||
variant_to_json(&mut buffer, variant)?; | ||
String::from_utf8(buffer) | ||
.map_err(|e| ArrowError::InvalidArgumentError(format!("UTF-8 conversion error: {}", e))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String also implements Write:
let mut buffer = Vec::new(); | |
variant_to_json(&mut buffer, variant)?; | |
String::from_utf8(buffer) | |
.map_err(|e| ArrowError::InvalidArgumentError(format!("UTF-8 conversion error: {}", e))) | |
let mut string = String::new(); | |
variant_to_json(&mut string, variant)?; | |
Ok(string) |
parquet-variant/src/to_json.rs
Outdated
Variant::Int16(i) => Ok(Value::Number((*i).into())), | ||
Variant::Int32(i) => Ok(Value::Number((*i).into())), | ||
Variant::Int64(i) => Ok(Value::Number((*i).into())), | ||
Variant::Float(f) => serde_json::Number::from_f64(*f as f64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variant::Float(f) => serde_json::Number::from_f64(*f as f64) | |
Variant::Float(f) => serde_json::Number::from_f64((*f).into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should NaN/Inf be rendered as Variant::Null
? Seems a bit harsh to just fail the conversion of a deeply nested variant because a stray divide by zero slipped in somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: impl From<f64> for serde_json::Value
converts them to Value::Null
.
parquet-variant/src/to_json.rs
Outdated
format!("{}.{}", quotient, trimmed_remainder) | ||
}; | ||
|
||
// Parse as serde_json::Number to preserve precision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting conundrum... I don't think parsing actually preserves precision at all, other than the difference between f64
and i64
? Also, serde_json::Value
provides an extensive set of impl From
:
let integer = if *scale == 0 {
*integer
} else {
let divisor = ...;
if integer % divisor != 0 {
// fall back to floating point
return Ok(Value::from(integer as f64 / divisor as f64))
}
integer / divisor
}
Ok(Value::from(integer));
The above works for i32 and i64, but i128 would need somewhat different handling of the final integer
:
// Prefer to emit as i64, but fall back to u64 or even f64 (lossy) if necessary
let value = i64::try_from(integer)
.map(Value::from)
.or_else(|| u64::try_from(integer))
.map(Value::from)
.unwrap_or_else(|| Value::from(integer as f64));
Ok(value)
parquet-variant/src/to_json.rs
Outdated
let mut vec = Vec::new(); | ||
let len = arr.len(); | ||
|
||
for i in 0..len { | ||
let element = arr.get(i)?; | ||
let json_value = variant_to_json_value(&element)?; | ||
vec.push(json_value); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to above:
let mut vec = Vec::new(); | |
let len = arr.len(); | |
for i in 0..len { | |
let element = arr.get(i)?; | |
let json_value = variant_to_json_value(&element)?; | |
vec.push(json_value); | |
} | |
let vec = arr | |
.iter() | |
.map(|element| variant_to_json_value(&element)) | |
.collect::<Result<_, _>>(); |
parquet-variant/src/to_json.rs
Outdated
let mut map = serde_json::Map::new(); | ||
|
||
for (key, value) in obj.iter() { | ||
let json_value = variant_to_json_value(&value)?; | ||
map.insert(key.to_string(), json_value); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
serde_json::Map
implements FromIterator:
let mut map = serde_json::Map::new(); | |
for (key, value) in obj.iter() { | |
let json_value = variant_to_json_value(&value)?; | |
map.insert(key.to_string(), json_value); | |
} | |
let map = obj | |
.iter | |
.map(|(k, v)| Ok((k.to_string(), variant_to_json_value(&v)?))) | |
.collect::<Result<_, _>>()?; |
…modify the examples to a single function
6e65287
to
d8f43f2
Compare
Given the size and history of this PR I think we should merge it and keep iterating on main. I plan to do so later today |
Thanks again everyone. I am working on filing follow on tickets Update: |
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Follow on to #7670 from @carpecodeum - Part of #6736 # Rationale for this change I was going through the code and examples and I felt that there was some redundancy and that the example file was unlikely to be found as not many crates in this repo have examples I would like to propose moving the examples as close to the actual code as possible to give it the best chance to be discovered. # What changes are included in this PR? 1. Remove `parquet-variant/examples/variant_to_json_examples.rs` 2. Update some of the other examples and docs for the json functions with content from that example # Are these changes tested? The examples are covered by CI tests. # Are there any user-facing changes? Different docs
# Which issue does this PR close? * Part of #6736 # Rationale for this change Follow-up to #7670, which accidentally introduced a lossy to-string conversion for variant decimals. # What changes are included in this PR? Use integer + string operations to convert decimal values to string, instead of floating point that could lose precision. Also, the `VariantDecimalXX` structs now `impl Display`, which greatly simplifies the to-json path. A new (private) macro encapsulates the to-string logic, since it's syntactically identical for all three decimal sizes. # Are these changes tested? Yes, new unit tests added. # Are there any user-facing changes? The `VariantDecimalXX` structs now `impl Display` --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Closes Variant: Write Variant Values as JSON #7426
Part of [EPIC] [Parquet] Implement Variant type support in Parquet #6736
Rationale for this change
This is an initial version, serving as a simple interface between the Variant implementation and the Serde JSON library.
A huge thank you to @PinkCrow007, @mprammer, @alamb, the rest of the CMU variant team, and everyone else we've interacted with who has helped me get started with contributing to this project. This is my first Arrow-related PR, and I thank you all for your insight and support.
What changes are included in this PR?
This PR implements a comprehensive JSON conversion API for Variant types with three main functions (
variant_to_json
,variant_to_json_string
, andvariant_to_json_value
) that convert different Variant types to JSON format, including primitives, decimals, dates, timestamps, and binary data with proper escaping and base64 encoding. The implementation adds missing methods toVariantObject
andVariantArray
for field/element access, includes two new dependencies (serde_json
andbase64
), and provides comprehensive test coverage with unit, integration, and documentation test suites.Open to input for improving any part of this implementation.
Are there any user-facing changes?
The new API's added in parquet-variant will be user-facing.