Skip to content

Conversation

carpecodeum
Copy link
Contributor

Which issue does this PR close?

Closes Variant: Write Variant Values as JSON #7426
Part of [EPIC] [Parquet] Implement Variant type support in Parquet #6736

Rationale for this change

This is an initial version, serving as a simple interface between the Variant implementation and the Serde JSON library.

A huge thank you to @PinkCrow007, @mprammer, @alamb, the rest of the CMU variant team, and everyone else we've interacted with who has helped me get started with contributing to this project. This is my first Arrow-related PR, and I thank you all for your insight and support.

What changes are included in this PR?

This PR implements a comprehensive JSON conversion API for Variant types with three main functions (variant_to_json, variant_to_json_string, and variant_to_json_value) that convert different Variant types to JSON format, including primitives, decimals, dates, timestamps, and binary data with proper escaping and base64 encoding. The implementation adds missing methods to VariantObject and VariantArray for field/element access, includes two new dependencies (serde_json and base64), and provides comprehensive test coverage with unit, integration, and documentation test suites.

Open to input for improving any part of this implementation.

Are there any user-facing changes?

The new API's added in parquet-variant will be user-facing.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 15, 2025
@alamb
Copy link
Contributor

alamb commented Jun 16, 2025

FYI @scovich @PinkCrow007 and @mkarbo

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @carpecodeum -- this is very cool and welcome to the community!

I think we need to resolve the following items before merging this PR:

  1. Add unit tests for writing objects and lists to json
  2. Get the CI tests passing (looks like there are some errors with cargo clippy and cargo fmt)
  3. Figure out the conflict with the variant/array PR from @scovich #7666

Something I think is important to resolve soon (I can help here) is making the serde_json depenency "opt in" (aka not required)

In terms of sorting out the conflict with @scovich 's PR, I think we should try and get that one mergable asap and merge it first and then get this one

Given that you are now familiar with Variant, perhaps you can help review #7666 ? I am sure @scovich would love some more feedback

todo!();
#[allow(unreachable_code)] // Just to infer the return type
Ok(vec![].into_iter())
let len = self.len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to conflict somewhat with #7666 from @scovich

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be resolved now, i hope

[dependencies]
arrow-schema = "55.1.0"
chrono = { workspace = true }
serde_json = "1.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could avoid adding these dependencies if possible as serdejson brings non trivial code.

This probably looks like adding a optional feature flag (like serde-json) or something

I think we can do it as a follow on PR though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert different Variant types to JSON format, including primitives, decimals, dates, timestamps, and binary data with proper escaping and base64 encoding.

There's a very large (and sadly completely unspecified) design space for how to represent some variant subtypes as JSON. Date/time/timestamp and binary being the most glaring examples. I thought spark documented some loose suggestions for best practices, but I can't find it anywhere now.

Storing binary data as base64, in particular, is a non-portable design choice in the sense that other systems might choose to use a different base64 dialect, or even something more exotic like z85 that pays 5/4 overhead instead of 4/3.

For date/time/timestamp values, one could store as numeric (days/µs/ns since unix epoch) or as a formatted string (ISO? or something else?), and there's a decent chance most readers could handle either one. But still technically non-portable.

I don't know the best solution -- maybe this PR is it -- but I wanted to point out the conundrum.

a simple interface between the Variant implementation and the Serde JSON library.

What is the thinking about serde JSON vs. arrow-json here? Do we plan to eventually support both and this is just the easier first step? (the arrow-json tape decoder is definitely a bit more complex than the serde JSON Value, but both implementations would have the same "shape" -- see e.g. #7403 that prototypes both)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could probably just call it to_json.rs? We already know it's variant?

Also, is there a reason this is under the encoder module? It seems like its own thing?

Comment on lines 94 to 92
let divisor = 10_i64.pow(*scale as u32);
let decimal_value = *integer as f64 / divisor as f64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f64 (53-bit precision) cannot accurately represent all decimal8 values (63-bit precision), let alone decimal16 (127-bit precision). For example, round tripping a value like 0x4000_0000_0000_ffff from i64 to f64 and loses information about the least-significant bits, producing 0x4000_0000_0000_fe48. See playground.

I have no idea why it clips to that specific value, but you get the idea.

Instead, we will need to extract the quotient and remainder resulting from division by an integer power of ten, and then print the two parts back to back with e.g. "{quotient}.{remainder}".

Once we have that capability, we should probably also use it for decimal4 values, just for uniformity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally -- floating point (base 2) can't accurately represent powers of ten (and both bases have different errors when representing e.g. powers of 3), so even a decimal4 could potentially lose information through a round trip. Tho I think most floating point parsers work very hard to avoid exposing that kind of inaccuracy, by ensuring that every representable value round trips cleanly (I have vague memories of seeing a bug report about a case that slipped past the java floating point parser several years ago).

Comment on lines 167 to 168
fn convert_array_to_json<W: Write>(
buffer: &mut W,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Any particular reason not to use impl in these signatures?

Suggested change
fn convert_array_to_json<W: Write>(
buffer: &mut W,
fn convert_array_to_json(
buffer: &mut impl Write,

Comment on lines 271 to 280
Variant::Decimal8 { integer, scale } => {
let divisor = 10_i64.pow(*scale as u32);
let decimal_value = *integer as f64 / divisor as f64;
serde_json::Number::from_f64(decimal_value)
.map(Value::Number)
.ok_or_else(|| ArrowError::InvalidArgumentError("Invalid decimal value".to_string()))
}
Variant::Decimal16 { integer, scale } => {
let divisor = 10_i128.pow(*scale as u32);
let decimal_value = *integer as f64 / divisor as f64;
serde_json::Number::from_f64(decimal_value)
.map(Value::Number)
.ok_or_else(|| ArrowError::InvalidArgumentError("Invalid decimal value".to_string()))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above... f64 cannot accurately represent all decimal8/decimal16 values. But f64 might be the best we can do here, because serde_json::Number is compatible with the JSON spec which states:

This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.

Note that when such software is used, numbers that are integers and are in the range [-(2**53)+1, (2**53)-1] are interoperable in the sense that implementations will agree exactly on their numeric values.

(it actually goes above and beyond by allowing to capture i64/u64, in addition to f64)

Comment on lines 285 to 286
Variant::Date(date) => Ok(Value::String(date.format("%Y-%m-%d").to_string())),
Variant::TimestampMicros(ts) => Ok(Value::String(ts.to_rfc3339())),
Variant::TimestampNtzMicros(ts) => Ok(Value::String(ts.format("%Y-%m-%dT%H:%M:%S%.6f").to_string())),
Variant::Binary(bytes) => Ok(Value::String(general_purpose::STANDARD.encode(bytes))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated logic. Probably better to define helper functions to completely isolate the logic from callers. At a minimum, we should hoist out those format strings as named const that can be reused, since they're especially error-prone.

Comment on lines 291 to 297
let mut map = serde_json::Map::new();
let fields = obj.fields()?;

for (key, value) in fields {
let json_value = variant_to_json_value(&value)?;
map.insert(key.to_string(), json_value);
}

Ok(Value::Object(map))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a profitable place to use Iterator::collect?

Suggested change
let mut map = serde_json::Map::new();
let fields = obj.fields()?;
for (key, value) in fields {
let json_value = variant_to_json_value(&value)?;
map.insert(key.to_string(), json_value);
}
Ok(Value::Object(map))
let fields = obj
.fields()?.
.map(|(key, value)| {
Ok((key.to_string(), variant_to_json_value(&value)?))
})
.collect::<Result<_, _>>()?;
Ok(Value::Object(fields))

(again below for Variant::Array, using values() method)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contrbution @carpecodeum and the reviews @scovich

My comments are mostly about consolidating the tests, improving the examples, and more clearly separating the serde_json dependencies

In my opinion, we could merge this PR as is and keep iterating in follow on PRs.

Let me know what you prefer

Comment on lines 358 to 390
fn test_buffer_writing() {
use parquet_variant::variant_to_json;
use std::io::Write;

let variant = Variant::String("test buffer writing");

// Test writing to a Vec<u8>
let mut buffer = Vec::new();
variant_to_json(&mut buffer, &variant).unwrap();
let result = String::from_utf8(buffer).unwrap();
assert_eq!(result, "\"test buffer writing\"");

// Test writing to a custom writer
struct CustomWriter {
data: Vec<u8>,
}

impl Write for CustomWriter {
fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
self.data.extend_from_slice(buf);
Ok(buf.len())
}

fn flush(&mut self) -> std::io::Result<()> {
Ok(())
}
}

let mut custom_writer = CustomWriter { data: Vec::new() };
variant_to_json(&mut custom_writer, &variant).unwrap();
let result = String::from_utf8(custom_writer.data).unwrap();
assert_eq!(result, "\"test buffer writing\"");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Vec already implements Write I think you can simplify test to something like

Suggested change
fn test_buffer_writing() {
use parquet_variant::variant_to_json;
use std::io::Write;
let variant = Variant::String("test buffer writing");
// Test writing to a Vec<u8>
let mut buffer = Vec::new();
variant_to_json(&mut buffer, &variant).unwrap();
let result = String::from_utf8(buffer).unwrap();
assert_eq!(result, "\"test buffer writing\"");
// Test writing to a custom writer
struct CustomWriter {
data: Vec<u8>,
}
impl Write for CustomWriter {
fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
self.data.extend_from_slice(buf);
Ok(buf.len())
}
fn flush(&mut self) -> std::io::Result<()> {
Ok(())
}
}
let mut custom_writer = CustomWriter { data: Vec::new() };
variant_to_json(&mut custom_writer, &variant).unwrap();
let result = String::from_utf8(custom_writer.data).unwrap();
assert_eq!(result, "\"test buffer writing\"");
}
fn test_buffer_writing() {
use parquet_variant::variant_to_json;
use std::io::Write;
let variant = Variant::String("test buffer writing");
// Test writing to a Vec<u8>
let mut buffer = Vec::new();
variant_to_json(&mut buffer, &variant).unwrap();
let result = String::from_utf8(buffer).unwrap();
assert_eq!(result, "\"test buffer writing\"");
let mut buffer = vec![];
variant_to_json(&mut buffer, &variant).unwrap();
let result = String::from_utf8(buffer).unwrap();
assert_eq!(result, "\"test buffer writing\"");
}


#[test]
fn test_json_roundtrip_compatibility() {
// Test that our JSON output can be parsed back by serde_json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and test_comprehensive_roundtrip_compatibility are great tests

One thing that might make it easier to evaluate what was covered would be to combine the different tests into a structure as follows, which would also make the values more self descrbing

/// Each Test has the equivalent variant, json string and serde_json::Value
Test {
  variant: Variant,
  json: &'static str,
  value: serde_json::Value,
}

And then you could write a driver function

impl Test {
  fn run(self) {
   // verify that converting the variant --> Json matches json and value
   // also verify that converting back to variant results in the same variant
  }
}

And then the tests in this file could look something like

Test {
  variant: Variant::String("simple string"),
  json: "\"simple string\"",
  value: Value::from("simple string")
}.run()

use parquet_variant::{variant_to_json, variant_to_json_string, variant_to_json_value, Variant};

fn main() -> Result<(), Box<dyn std::error::Error>> {
let variants = vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great.

Here are some suggestions:

  1. Make this a doc comment / example on variant_to_json / variant_to_json_string -- that way it will be easier to discover as well as being run as part of the CI / tests of this crate
  2. Reduce the example to a single API example for each function -- for example, I wouldn't bother with all the different types and corner cases here (those are great unit tests, but might obscure the API use from new users). Instead I think an example that made a single Variant and convert it to JSON, and then assert the output would be the best as it would quickly show how to use the API and what it does

Comment on lines 42 to 52
/// use parquet_variant::{Variant, variant_to_json};
/// use arrow_schema::ArrowError;
///
/// fn example() -> Result<(), ArrowError> {
/// let variant = Variant::Int8(42);
/// let mut buffer = Vec::new();
/// variant_to_json(&mut buffer, &variant)?;
/// assert_eq!(String::from_utf8(buffer).unwrap(), "42");
/// Ok(())
/// }
/// example().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can make this significantly shorter like this -- the # in the front hides the line when rendered as docs and the test driver will make the stub program necessary

Suggested change
/// use parquet_variant::{Variant, variant_to_json};
/// use arrow_schema::ArrowError;
///
/// fn example() -> Result<(), ArrowError> {
/// let variant = Variant::Int8(42);
/// let mut buffer = Vec::new();
/// variant_to_json(&mut buffer, &variant)?;
/// assert_eq!(String::from_utf8(buffer).unwrap(), "42");
/// Ok(())
/// }
/// example().unwrap();
/// # use parquet_variant::{Variant, variant_to_json};
/// # use arrow_schema::ArrowError;
/// let variant = Variant::Int8(42);
/// let mut buffer = Vec::new();
/// variant_to_json(&mut buffer, &variant)?;

Also I think it would be awesome now to show an example of a VariantObject here (which is the most likely type that people will encode)

We could use the new builder that @PinkCrow007 added in #7653

Here is an example:

/// # Example: Create a [`Variant::Object`]
///
/// This example shows how to create an object with two fields:
/// ```json
/// {
/// "first_name": "Jiaying",
/// "last_name": "Li"
/// }
/// ```
///
/// ```
/// # use parquet_variant::{Variant, VariantBuilder};
/// let mut builder = VariantBuilder::new();
/// // Create an object builder that will write fields to the object
/// let mut object_builder = builder.new_object();
/// object_builder.append_value("first_name", "Jiaying");
/// object_builder.append_value("last_name", "Li");
/// object_builder.finish();
/// // Finish the builder to get the metadata and value
/// let (metadata, value) = builder.finish();
/// // use the Variant API to verify the result
/// let variant = Variant::try_new(&metadata, &value).unwrap();
/// let Variant::Object(variant_object) = variant else {
/// panic!("unexpected variant type")
/// };
/// assert_eq!(
/// variant_object.field_by_name("first_name").unwrap(),
/// Some(Variant::ShortString("Jiaying"))
/// );
/// assert_eq!(
/// variant_object.field_by_name("last_name").unwrap(),
/// Some(Variant::ShortString("Li"))
/// );
/// ```
///


use crate::variant::{Variant, VariantList, VariantObject};

/// Converts a Variant to JSON and writes it to the provided buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Converts a Variant to JSON and writes it to the provided buffer
/// Converts a Variant to JSON and writes it to the provided `Write`

///
/// # Example
///
/// ```rust
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as above

Ok(())
}

/// Convert Variant to JSON string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth commenting that this is the same as calling variant_to_json with a Vec

///
/// # Example
///
/// ```rust
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

// specific language governing permissions and limitations
// under the License.

//! Integration tests for JSON conversion functionality
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the difference in coverage provided by these tests compared to the tests that are in the parquet-variant/src/encoder/variant_to_json.rs? If there isn't any substantial difference in what is covered I think combing the tests in parquet-variant/src/encoder/variant_to_json.rs makes the most send

I also found there seems to be quite a bit of overlap in test coverage -- reducing the repetition I think would be valuable so it is easier to understand what is covered and what is not

Comment on lines 976 to 989
if let Value::Object(obj) = parsed {
assert_eq!(obj.len(), 6);
assert_eq!(
obj.get("string_field"),
Some(&Value::String("test_string".to_string()))
);
assert_eq!(obj.get("int_field"), Some(&Value::Number(123.into())));
assert_eq!(obj.get("bool_field"), Some(&Value::Bool(true)));
assert!(matches!(obj.get("float_field"), Some(Value::Number(_))));
assert_eq!(obj.get("null_field"), Some(&Value::Null));
assert_eq!(obj.get("long_field"), Some(&Value::Number(999.into())));
} else {
panic!("Expected JSON object");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW if you like less intending you could use this pattern too (just pointing this out)

Suggested change
if let Value::Object(obj) = parsed {
assert_eq!(obj.len(), 6);
assert_eq!(
obj.get("string_field"),
Some(&Value::String("test_string".to_string()))
);
assert_eq!(obj.get("int_field"), Some(&Value::Number(123.into())));
assert_eq!(obj.get("bool_field"), Some(&Value::Bool(true)));
assert!(matches!(obj.get("float_field"), Some(Value::Number(_))));
assert_eq!(obj.get("null_field"), Some(&Value::Null));
assert_eq!(obj.get("long_field"), Some(&Value::Number(999.into())));
} else {
panic!("Expected JSON object");
}
let Value::Object(obj) = parsed else {
panic!("Expected JSON object");
}
assert_eq!(obj.len(), 6);
assert_eq!(
obj.get("string_field"),
Some(&Value::String("test_string".to_string()))
);
assert_eq!(obj.get("int_field"), Some(&Value::Number(123.into())));
assert_eq!(obj.get("bool_field"), Some(&Value::Bool(true)));
assert!(matches!(obj.get("float_field"), Some(Value::Number(_))));
assert_eq!(obj.get("null_field"), Some(&Value::Null));
assert_eq!(obj.get("long_field"), Some(&Value::Number(999.into())));

@carpecodeum carpecodeum force-pushed the variant-to-json branch 2 times, most recently from 79e63f2 to 8b63662 Compare June 24, 2025 03:07
@carpecodeum
Copy link
Contributor Author

@alamb @scovich Thank you so much guys for your time and dedication for reviewing this PR, I have tried to fix all the suggestions given by you guys, Sorry for the delay in resolving all of them. If I missed anything please let me know I can probably fix that, as far as I am concerned, if you guys think this PR is good to be merged then I am also open to merging it, again thank you for your efforts

@carpecodeum
Copy link
Contributor Author

@alamb @scovich Thank you so much guys for your time and dedication for reviewing this PR, I have tried to fix all the suggestions given by you guys, Sorry for the delay in resolving all of them. If I missed anything please let me know I can probably fix that, as far as I am concerned, if you guys think this PR is good to be merged then I am also open to merging it, again thank you for your efforts

as far as the serde_json dependency issues are concerend, I would like to discuss more on that perhaps, if we can connect for that @alamb ?

@alamb
Copy link
Contributor

alamb commented Jun 24, 2025

@alamb @scovich Thank you so much guys for your time and dedication for reviewing this PR, I have tried to fix all the suggestions given by you guys, Sorry for the delay in resolving all of them. If I missed anything please let me know I can probably fix that, as far as I am concerned, if you guys think this PR is good to be merged then I am also open to merging it, again thank you for your efforts

Sounds good -- thank you @carpecodeum -- I plan to give this one another review later today

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR is I think this is great -- thank you so much @carpecodeum -- it is very well tested and I think the APIs make sense to me.

I will file a follow on ticket about the serde_json dependency and then I'll plan to merge it tomorrow.

THanks again

Please let me know if anyone else would like additional time to review this

const MAX_SHORT_STRING_BYTES: usize = 0x3F;

/// A Variant [`ShortString`]
/// Represents a variant array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect

Suggested change
/// Represents a variant array.
/// Represents a [`Variant::ShortString`].


/// Convert Variant to serde_json::Value
///
/// This function converts a Variant to a [`serde_json::Value`], which is useful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

}

#[test]
fn test_primitive_json_conversion() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great

// Parse the JSON to verify structure - handle JSON parsing errors manually
let parsed: Value = serde_json::from_str(&json)
.map_err(|e| ArrowError::ParseError(format!("JSON parse error: {}", e)))?;
let Value::Object(obj) = parsed else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we added a nicer api here recently (this morning):

Suggested change
let Value::Object(obj) = parsed else {
let obj = parsed.as_ob().unwrap()

// specific language governing permissions and limitations
// under the License.

//! Example showing how to convert Variant values to JSON
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great to include this inline in the docs rather than a standalone example, but I will make a follow on PR to do so

@alamb
Copy link
Contributor

alamb commented Jun 24, 2025

@carpecodeum I can not push to this branch directly as i don't have access to the cmu-db fork, which is fine

I made a PR to update this PR from main and fix a logical conflict:

Can you please check it out when you have some time?

@alamb
Copy link
Contributor

alamb commented Jun 24, 2025

Close/reopen this PR to get CI to re-run

@alamb alamb closed this Jun 24, 2025
@alamb alamb reopened this Jun 24, 2025
@alamb
Copy link
Contributor

alamb commented Jun 24, 2025

The CI failures should be fixed if this is merged:

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

One question -- how to render Inf/NaN float values? I think they should turn to Value::Null instead of throwing an error, see #7670 (comment)

Also, a bunch of opportunities to simplify quite a bit.

Comment on lines 1012 to 1019
impl From<bool> for Variant<'_, '_> {
fn from(value: bool) -> Self {
match value {
true => Variant::BooleanTrue,
false => Variant::BooleanFalse,
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move it here, out of curiosity? (between decimal4 and decimal8)? Was there some merge conflict?

(the original location wasn't amazing either... but if we're going to move it why not move to near the top, below the impl From<()>?)

Comment on lines 83 to 85
Variant::Null => {
write!(json_buffer, "null")?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write! returns Result<()>, so should be able to simplify a bunch of these:

Suggested change
Variant::Null => {
write!(json_buffer, "null")?;
}
Variant::Null => write!(json_buffer, "null")?,

Comment on lines 112 to 125
if *scale == 0 {
write!(json_buffer, "{}", integer)?;
} else {
let divisor = 10_i32.pow(*scale as u32);
let quotient = integer / divisor;
let remainder = (integer % divisor).abs();
let formatted_remainder = format!("{:0width$}", remainder, width = *scale as usize);
let trimmed_remainder = formatted_remainder.trim_end_matches('0');
if trimmed_remainder.is_empty() {
write!(json_buffer, "{}", quotient)?;
} else {
write!(json_buffer, "{}.{}", quotient, trimmed_remainder)?;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is begging for a generic method... three syntactically ~identical copies of some rather complex code. I don't know a good way to do it, tho

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work:

fn write_decimal(
    json_buffer: &mut impl Write, 
    integer: impl Into<i128>, 
    scale: u8,
) -> Result<(), ArrowError> {
    let integer = integer.into();
    // ... the current Decimal16 logic ...
}

and then all three call sites look the same:

write_decimal(json_buffer, *integer, *scale)

Comment on lines 118 to 119
let formatted_remainder = format!("{:0width$}", remainder, width = *scale as usize);
let trimmed_remainder = formatted_remainder.trim_end_matches('0');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clever enough to merit a code comment? Given a remainder like 100:

  • The {remainder:0width$} format ensures it correctly renders with leading zeros (e.g. .00000100)
  • The trime_end_matches then strips away any trailing zeros (e.g. .000001)

let remainder = (integer % divisor).abs();
let formatted_remainder = format!("{:0width$}", remainder, width = *scale as usize);
let trimmed_remainder = formatted_remainder.trim_end_matches('0');
if trimmed_remainder.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simpler to check for remainder == 0, before formatting?

if remainder == 0 {
    write!(json_buffer, "{quotient}")?;
} else {
    let remainder = format!("{remainder:0width$}", width = *scale as usize)
    let remainder = remainder.trim_end_matches('0');
    write!(json_buffer, "{quotient}.{remainder}")?;
}

Comment on lines +311 to +313
let mut buffer = Vec::new();
variant_to_json(&mut buffer, variant)?;
String::from_utf8(buffer)
.map_err(|e| ArrowError::InvalidArgumentError(format!("UTF-8 conversion error: {}", e)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String also implements Write:

Suggested change
let mut buffer = Vec::new();
variant_to_json(&mut buffer, variant)?;
String::from_utf8(buffer)
.map_err(|e| ArrowError::InvalidArgumentError(format!("UTF-8 conversion error: {}", e)))
let mut string = String::new();
variant_to_json(&mut string, variant)?;
Ok(string)

Variant::Int16(i) => Ok(Value::Number((*i).into())),
Variant::Int32(i) => Ok(Value::Number((*i).into())),
Variant::Int64(i) => Ok(Value::Number((*i).into())),
Variant::Float(f) => serde_json::Number::from_f64(*f as f64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Variant::Float(f) => serde_json::Number::from_f64(*f as f64)
Variant::Float(f) => serde_json::Number::from_f64((*f).into())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should NaN/Inf be rendered as Variant::Null? Seems a bit harsh to just fail the conversion of a deeply nested variant because a stray divide by zero slipped in somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: impl From<f64> for serde_json::Value converts them to Value::Null.

format!("{}.{}", quotient, trimmed_remainder)
};

// Parse as serde_json::Number to preserve precision
Copy link
Contributor

@scovich scovich Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting conundrum... I don't think parsing actually preserves precision at all, other than the difference between f64 and i64? Also, serde_json::Value provides an extensive set of impl From:

let integer = if *scale == 0 {
    *integer
} else {
    let divisor = ...;
    if integer % divisor != 0 {
        // fall back to floating point
        return Ok(Value::from(integer as f64 / divisor as f64))
    }
    integer / divisor
}
Ok(Value::from(integer));

The above works for i32 and i64, but i128 would need somewhat different handling of the final integer:

// Prefer to emit as i64, but fall back to u64 or even f64 (lossy) if necessary
let value = i64::try_from(integer)
    .map(Value::from)
    .or_else(|| u64::try_from(integer))
    .map(Value::from)
    .unwrap_or_else(|| Value::from(integer as f64));
Ok(value)

Comment on lines 463 to 471
let mut vec = Vec::new();
let len = arr.len();

for i in 0..len {
let element = arr.get(i)?;
let json_value = variant_to_json_value(&element)?;
vec.push(json_value);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to above:

Suggested change
let mut vec = Vec::new();
let len = arr.len();
for i in 0..len {
let element = arr.get(i)?;
let json_value = variant_to_json_value(&element)?;
vec.push(json_value);
}
let vec = arr
.iter()
.map(|element| variant_to_json_value(&element))
.collect::<Result<_, _>>();

Comment on lines 453 to 459
let mut map = serde_json::Map::new();

for (key, value) in obj.iter() {
let json_value = variant_to_json_value(&value)?;
map.insert(key.to_string(), json_value);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serde_json::Map implements FromIterator:

Suggested change
let mut map = serde_json::Map::new();
for (key, value) in obj.iter() {
let json_value = variant_to_json_value(&value)?;
map.insert(key.to_string(), json_value);
}
let map = obj
.iter
.map(|(k, v)| Ok((k.to_string(), variant_to_json_value(&v)?)))
.collect::<Result<_, _>>()?;

@alamb
Copy link
Contributor

alamb commented Jun 25, 2025

Given the size and history of this PR I think we should merge it and keep iterating on main. I plan to do so later today

@alamb alamb merged commit 0366140 into apache:main Jun 25, 2025
13 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 25, 2025

Thanks again everyone. I am working on filing follow on tickets

Update:

alamb added a commit that referenced this pull request Jun 26, 2025
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Follow on to #7670 from
@carpecodeum
- Part of #6736

# Rationale for this change

I was going through the code and examples and I felt that there was some
redundancy and that the example file was unlikely to be found as not
many crates in this repo have examples

I would like to propose moving the examples as close to the actual code
as possible to give it the best chance to be discovered.

# What changes are included in this PR?

1. Remove `parquet-variant/examples/variant_to_json_examples.rs`
2. Update some of the other examples and docs for the json functions
with content from that example

# Are these changes tested?

The examples are covered by CI tests.

# Are there any user-facing changes?

Different docs
alamb added a commit that referenced this pull request Jun 26, 2025
# Which issue does this PR close?

* Part of #6736

# Rationale for this change

Follow-up to #7670, which
accidentally introduced a lossy to-string conversion for variant
decimals.

# What changes are included in this PR?

Use integer + string operations to convert decimal values to string,
instead of floating point that could lose precision.

Also, the `VariantDecimalXX` structs now `impl Display`, which greatly
simplifies the to-json path. A new (private) macro encapsulates the
to-string logic, since it's syntactically identical for all three
decimal sizes.

# Are these changes tested?

Yes, new unit tests added.

# Are there any user-facing changes?

The `VariantDecimalXX` structs now `impl Display`

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Variant: Write Variant Values as JSON
3 participants