[Variant] [Pathfinding] Variant builder rollback, variant array builder, and read-only metadata builder #8167

scovich · 2025-08-18T19:21:28Z

Which issue does this PR close?

Pathfinding. Even if we like what we see here, it needs to split into a half dozen other PR in order to actually merge. But first we need to see the big picture to know whether it's worth going further.

Several flavors of pathfinding rolled up in one, but with the overarching goal of:

An alternate approach to

[Variant] VariantBuilder with VariantMetadata instead of MetadataBuilder #7915

Rationale for this change

While trying to rebase #7915, I kept running into more and more ugliness. And it was already kind of ugly.

Also, I realized that the ValueBuffer use of builders with a generic ParentState::Variant was incorrect in some rollback scenarios where the attempt to insert a value fails (try_insert_object or try_insert_list).

Further, the interaction between VariantBuilder and VariantArrayBuilder was quite convoluted.

It turns out all of those issues are a bit related, and this PR is the result of my exploration.

What changes are included in this PR?

A lot! In no particular order:

Rename ValueBuffer as ValueBuilder and make it public
Heavily rework the design of ParentState to be much clearer and hopefully easier to use
- Track finished status on behalf of the builders that use it (their drop glue goes away)
- Each variant tracks the state that needs to be rolled back in case finish is not called. In particular, the value and metadata buffer offsets are tracked and rolled back uniformly.
- New constructors eagerly update the offset/field info for list/object, and track the info necessary to roll back that change if the builder fails to finish after all.
The ObjectBuilder methods become fallible, because they immediately detect and report duplicated field names when that checking is enabled.
VariantBuilder::finish becomes infallible (because any failed field insertion is caught immediately by the now-fallible ObjectBuilder methods). However, the method still returns Result::Ok for now, to avoid excessive unit test churn (removing dozens of unwrap and ? calls scattered all over the place)
Rework ValueBuilder::[try_]append_variant to be an associated function instead of a method, taking a ParentState from its caller in order to ensure correct rollback semantics.
Make ParentState public
The MetadataBuilder struct becomes reusable, similar to normal arrow array builders -- its finish method takes&mut self.
VariantArrayBuilder now directly instantiates a ValueBuilder and MetadataBuilder pair up front, and converts them to byte slices only when its own finish method is called. It no longer attempts to create a VariantBuilder at any point, and works directly with the ValueBuilder API instead. No more mem::take magic needed -- the VariantArrayVariantBuilder helper class just takes a mut ref to its owner and its finish method handles the rest; the drop glue becomes empty (just to help catch forgotten finish calls) because the ParentState now deals with all needed cleanup.
VariantArrayBuilder also tracks starting offsets instead of (offset, len) pairs -- the latter is easily derived from the former.
Turn MetadataBuilder into a trait (the original struct is renamed as MetadataBuilderXX for now, final naming still TBD). Most methods that previously took a MetadataBuilder struct now take a &mut dyn MetadataBuilder instead.
Deleted a bunch of code originally added to support VariantArrayBuilder, now that it's no longer needed.
Added fallible and infallible pairs for new_object and new_list to VariantBuilderExt trait. That way, the existing unit test code that calls those methods can keep using the infallible versions (panicking if there's a duplicate field name). Prod code that actually cares calls the new infallible versions instead.
Introduced a basic ReadOnlyMetadataBuilder that implements the MetadataBuilder trait but "mounts" an existing VariantMetadata. It takes advantage of the fact that object field insertions are now fallible, in order to return an error if the requested field name does not exist in the underlying metadata dictionary.

Are these changes tested?

Partly. I need to port over the unit tests from #7915 to fully exercise the new ReadOnlyVariantBuilder.

Are there any user-facing changes?

Yes, lots of stuff became public.

Also, the semantics of some functions changed (breaking change)

scovich · 2025-08-18T19:21:55Z

Attn @alamb @carpecodeum @friendlymatthew , very interested in your thoughts?

scovich

Some hopefully helpful context for reviewers

scovich · 2025-08-18T19:31:02Z

parquet-variant-compute/src/variant_array_builder.rs

    nulls: NullBufferBuilder,
    /// buffer for all the metadata
-    metadata_buffer: Vec<u8>,
+    metadata_builder: MetadataBuilderXX,


Because the metadata builder is now "reusable" (each new finish call just appends another chunk of bytes to the underlying byte buffer), we can instantiate it directly and just pass it by reference to the VariantArrayVariantBuilder helper class.

scovich · 2025-08-18T19:31:53Z

parquet-variant-compute/src/variant_array_builder.rs

+    metadata_offsets: Vec<usize>,
    /// buffer for values
-    value_buffer: Vec<u8>,
+    value_builder: ValueBuilder,


This used to be called ValueBuffer, and it turns out to have a better API for our purposes than VariantBuilder. So we no longer create a VariantBuilder at all, and just pass the value and metadata builders around by reference as needed.

scovich · 2025-08-18T19:32:46Z

parquet-variant-compute/src/variant_array_builder.rs

+    metadata_builder: MetadataBuilderXX,
    /// (offset, len) pairs for locations of metadata in the buffer
-    metadata_locations: Vec<(usize, usize)>,
+    metadata_offsets: Vec<usize>,


Track starting offsets (similar to normal variant code), and infer the length of each entry as the difference between adjacent offsets. Ditto for value_offsets below.

scovich · 2025-08-18T19:33:47Z

parquet-variant-compute/src/variant_array_builder.rs

-        let metadata_buffer = std::mem::take(&mut self.metadata_buffer);
-        let value_buffer = std::mem::take(&mut self.value_buffer);
-        VariantArrayVariantBuilder::new(self, metadata_buffer, value_buffer)
+        VariantArrayVariantBuilder::new(self)


Move all the magic into the constructor (this could be done as a separate PR)

scovich · 2025-08-18T19:34:16Z

parquet-variant-compute/src/variant_array_builder.rs

 impl VariantBuilderExt for VariantArrayVariantBuilder<'_> {
    fn append_value<'m, 'v>(&mut self, value: impl Into<Variant<'m, 'v>>) {
-        self.variant_builder.append_value(value);
+        ValueBuilder::try_append_variant(self.parent_state(), value.into()).unwrap()


Missed one

Suggested change

ValueBuilder::try_append_variant(self.parent_state(), value.into()).unwrap()

ValueBuilder::append_variant(self.parent_state(), value.into())

scovich · 2025-08-18T20:12:37Z

parquet-variant/src/builder.rs

-        buffer.try_append_variant(value.into(), metadata_builder)?;
-        Ok(())
+        let (state, _) = self.parent_state(key)?;
+        ValueBuilder::try_append_variant(state, value.into())


Again, this used to not roll back fully on failure.

scovich · 2025-08-18T20:20:04Z

parquet-variant/src/builder.rs

+    fn new_list(&mut self) -> ListBuilder<'_> {
+        self.try_new_list().unwrap()
+    }
+
+    fn new_object(&mut self) -> ObjectBuilder<'_> {
+        self.try_new_object().unwrap()
+    }

-    fn new_object(&mut self) -> ObjectBuilder<'_>;
+    fn try_new_list(&mut self) -> Result<ListBuilder<'_>, ArrowError>;
+
+    fn try_new_object(&mut self) -> Result<ObjectBuilder<'_>, ArrowError>;


At first I just changed the new_XX methods to return Result, but that churned a lot of unit test code.

I'm still on the fence whether this is better in the long run tho?

Yeah, me too -- it is pretty annoying to have to check for errors on every single call to creating these objects, even when it can not fail (e.g. creating in memory variants)

Another thought I had was we could potentially defer errors that could arise into the call to build() -- but then that might results in hard to track down errors 🤔

The thing that really put me off of tracking deferred errors is, it became yet another thing that has to be rolled back if the offending builder never finishes. I'm pretty sure we weren't rolling it back properly before, and I didn't want to complicate the ParentState stuff even further by "fixing" it, if we can just eliminate it.

scovich · 2025-08-18T20:22:24Z

parquet-variant/src/builder.rs

+            .new_object()
+            .with_field("a", 1)
+            .with_field("b", 2)
+            .try_with_field("a", 3);


The duplicate field is now detected immediately upon insertion, not at finish

scovich · 2025-08-18T20:22:36Z

parquet-variant/src/builder.rs

+        let nested_result = inner_list
+            .new_object()
+            .with_field("x", 1)
+            .try_with_field("x", 2);


Again, duplicates are now detected immediately.

scovich · 2025-08-18T20:22:51Z

parquet-variant/src/builder.rs


-    /// Test reusing buffers with nested objects
-    #[test]
-    fn test_with_existing_buffers_nested() {


This was testing support code for VariantArrayBuilder that has been deleted.

alamb · 2025-08-19T11:20:26Z

I will review this PR carefully today

friendlymatthew · 2025-08-19T12:10:14Z

Hi, I have to catch up on all the variant progress (work is busy). I'll review this by the end of this week, but no need to wait for me. If I have questions, I will leave them post-merge

alamb

I took a pretty close look at this PR and I think the direction makes a lot of sense to me. I left some inline comments, but basically I think it is ready to polish up and get ready to merge (if you can break it into smaller PRs that would be nice)

cc @klion26 and @codephage2020 who may also be interested and have some additional comments to share

alamb · 2025-08-19T18:09:37Z

parquet-variant-compute/src/variant_array_builder.rs

-        let value_length = 0;
-        self.value_locations.push((value_offset, value_length));
+        self.metadata_offsets.push(self.metadata_builder.offset());
+        self.value_offsets.push(self.value_builder.offset());


Not as part of this PR, but I think the Null elements are supposed to have a Variant::Null rather than empty for all elements that are marked as null.

For top-level (which this is), I believe you are correct. Good catch!

alamb · 2025-08-19T18:14:10Z

parquet-variant-json/src/from_json.rs

+        // Size of innermost_list: 1 + 1 + 2*(128 + 1) + 2*128 = 516
+        // Size of inner object: 1 + 4 + 2*256 + 3*(256 + 1) + 256 * 516 = 133384
+        // Size of json: 1 + 4 + 2*256 + 4*(256 + 1) + 256 * 133384 = 34147849
+        assert_eq!(value.len(), 34147849);


I agree with this analysis -- the test seems overly sensitive to small pturbatons

alamb · 2025-08-19T18:16:11Z

parquet-variant/src/variant/metadata.rs

+    /// Same as `Index::index`, but with the correct lifetime.
+    pub(crate) fn get_infallible(&self, i: usize) -> &'m str {


I don't understand this comment.

Minor nit: directly using get(i).unwrap() at relevant locations rather than add a new function might be easier to understand than understanding when to use get_infallible 🤔

alamb · 2025-08-19T18:17:41Z

parquet-variant/src/builder.rs

-impl From<Vec<u8>> for ValueBuffer {
-    fn from(value: Vec<u8>) -> Self {
-        Self(value)
-    }
-}
-
-impl From<ValueBuffer> for Vec<u8> {
-    fn from(value_buffer: ValueBuffer) -> Self {
-        value_buffer.0
-    }
-}


if it isn't needed I support removing it (we can always add it back later)

alamb · 2025-08-19T18:18:34Z

parquet-variant/src/builder.rs

+    fn append_object(state: ParentState<'_>, obj: VariantObject) {
+        let mut object_builder = ObjectBuilder::new(state, false);


I like this idea -- it makes a lot of sense to me

alamb · 2025-08-19T18:19:32Z

parquet-variant/src/builder.rs

+pub trait MetadataBuilder: std::fmt::Debug {
+    fn try_upsert_field_name(&mut self, field_name: &str) -> Result<u32, ArrowError>;
+    fn field_name(&self, field_id: usize) -> &str;
+    fn num_field_names(&self) -> usize;
+    fn truncate_field_names(&mut self, new_size: usize);
+}


The API makes sense to me -- I think it would be nice to add some comments, but in general 👍

alamb · 2025-08-19T18:20:15Z

parquet-variant/src/builder.rs

 /// You can use an existing `Vec<u8>` as the metadata buffer by using the `from` impl.
 #[derive(Default, Debug)]
-struct MetadataBuilder {
+pub struct MetadataBuilderXX {


How about OwnedMetadataBuilder or DefaultMetadataBuilder

alamb · 2025-08-19T18:21:52Z

parquet-variant/src/builder.rs

-    /// The starting offset in the parent's buffer where this object starts
-    parent_value_offset_base: usize,
-    /// The starting offset in the parent's metadata buffer where this object starts
-    /// used to truncate the written fields in `drop` if the current object has not been finished
-    parent_metadata_offset_base: usize,
-    /// Whether the object has been finished, the written content of the current object
-    /// will be truncated in `drop` if `has_been_finished` is false
-    has_been_finished: bool,


it is very nice that the parent state handles this nicely now

alamb · 2025-08-19T18:22:41Z

parquet-variant/src/builder.rs


    /// Returns an object builder that can be used to append a new (nested) object to this object.
    ///
+    /// Panics if the requested key cannot be inserted (e.g. because it is a duplicate).


Another reason it could panic is that the builder has read only metadata, right?

Another way we could avoid panic's would be to defer errors until the final finish() is called 🤔

I think this is fine for now

it could panic [if] the builder has read only metadata, right?

Yep! And having the same method fail for both error cases is really appealing to me.

(I just wrote the code comment before adding the read-only builder to this PR)

We could avoid panic's [by deferring] errors until the final finish() is called

The goal was to make finish signature infallible, now that it actually is infallible in practice. I just didn't want to bloat this PR with the extra churn that would cause.

Plus, it always seemed weird to do extra work to defer an obvious known failure until later. Is there a disadvantage to failing immediately?

Plus, it always seemed weird to do extra work to defer an obvious known failure until later. Is there a disadvantage to failing immediately?

the only disadvantage is that it is annoying if you have to do error checking for operations that can't fail - like making a new object with owned metadata.

The top-level builder was, and remains infallible in both its new_object and finish methods?

Here we're dealing with the object builder's (nested) new_object, which takes a field name and so could fail due to duplicates.

And actually... it wasn't even checking in the first place:

[Variant] ObjectBuilder does not reliably check for duplicates #8170

alamb · 2025-08-19T18:25:32Z

parquet-variant/src/builder.rs

+    fn new_list(&mut self) -> ListBuilder<'_> {
+        self.try_new_list().unwrap()
+    }
+
+    fn new_object(&mut self) -> ObjectBuilder<'_> {
+        self.try_new_object().unwrap()
+    }

-    fn new_object(&mut self) -> ObjectBuilder<'_>;
+    fn try_new_list(&mut self) -> Result<ListBuilder<'_>, ArrowError>;
+
+    fn try_new_object(&mut self) -> Result<ObjectBuilder<'_>, ArrowError>;


Yeah, me too -- it is pretty annoying to have to check for errors on every single call to creating these objects, even when it can not fail (e.g. creating in memory variants)

Another thought I had was we could potentially defer errors that could arise into the call to build() -- but then that might results in hard to track down errors 🤔

scovich

I think the direction makes a lot of sense to me. I left some inline comments, but basically I think it is ready to polish up and get ready to merge (if you can break it into smaller PRs that would be nice)

Thanks for the quick review!

This will definitely become a stream of smaller/scoped PR, I just wanted a sanity check before starting down that path.

scovich · 2025-08-19T18:37:53Z

parquet-variant/src/variant/metadata.rs

+    /// Same as `Index::index`, but with the correct lifetime.
+    pub(crate) fn get_infallible(&self, i: usize) -> &'m str {


TheIndex::index method has the following signature:

fn index(&self, i: usize) -> &Self::Output

which is really

fn index<'a>(&'a self, i: usize) -> &'a Self::Output

and because the (elided) 'a is not related to 'm, we can't use the return value in a place that requires &'m str.

You're right that self.get(i).unwrap() would also work; I was just trying to preserve the expect at both call sites. Should we just unwrap everywhere and be done with it?

NOTE: The get_infallible should be private, except then Index can't access it.

scovich · 2025-08-19T18:46:51Z

parquet-variant-compute/src/variant_array_builder.rs

-        let value_length = 0;
-        self.value_locations.push((value_offset, value_length));
+        self.metadata_offsets.push(self.metadata_builder.offset());
+        self.value_offsets.push(self.value_builder.offset());


For top-level (which this is), I believe you are correct. Good catch!

scovich · 2025-08-19T18:48:06Z

parquet-variant/src/builder.rs

+pub trait MetadataBuilder: std::fmt::Debug {
+    fn try_upsert_field_name(&mut self, field_name: &str) -> Result<u32, ArrowError>;
+    fn field_name(&self, field_id: usize) -> &str;
+    fn num_field_names(&self) -> usize;
+    fn truncate_field_names(&mut self, new_size: usize);
+}


Needs comments for sure!

scovich · 2025-08-19T18:50:27Z

parquet-variant/src/builder.rs

 /// You can use an existing `Vec<u8>` as the metadata buffer by using the `from` impl.
 #[derive(Default, Debug)]
-struct MetadataBuilder {
+pub struct MetadataBuilderXX {


Yeah my other PR used DefaultMetadataBuilder, but I wasn't sure the name conveyed what it actually does differently than a "not default" (or "not owned") metadata builder might do? Maybe BasicMetadataBuilder would convey the fact that any other impl must be doing something fancy/special (= not basic)?

scovich · 2025-08-19T18:54:50Z

parquet-variant/src/builder.rs


    /// Returns an object builder that can be used to append a new (nested) object to this object.
    ///
+    /// Panics if the requested key cannot be inserted (e.g. because it is a duplicate).


it could panic [if] the builder has read only metadata, right?

Yep! And having the same method fail for both error cases is really appealing to me.

(I just wrote the code comment before adding the read-only builder to this PR)

We could avoid panic's [by deferring] errors until the final finish() is called

The goal was to make finish signature infallible, now that it actually is infallible in practice. I just didn't want to bloat this PR with the extra churn that would cause.

Plus, it always seemed weird to do extra work to defer an obvious known failure until later. Is there a disadvantage to failing immediately?

scovich · 2025-08-19T18:56:48Z

parquet-variant/src/builder.rs

+    fn new_list(&mut self) -> ListBuilder<'_> {
+        self.try_new_list().unwrap()
+    }
+
+    fn new_object(&mut self) -> ObjectBuilder<'_> {
+        self.try_new_object().unwrap()
+    }

-    fn new_object(&mut self) -> ObjectBuilder<'_>;
+    fn try_new_list(&mut self) -> Result<ListBuilder<'_>, ArrowError>;
+
+    fn try_new_object(&mut self) -> Result<ObjectBuilder<'_>, ArrowError>;


The thing that really put me off of tracking deferred errors is, it became yet another thing that has to be rolled back if the offending builder never finishes. I'm pretty sure we weren't rolling it back properly before, and I didn't want to complicate the ParentState stuff even further by "fixing" it, if we can just eliminate it.

klion26

Thanks for the contribution. This is very nice; we can now handle the rollback logic more elegantly.

klion26 · 2025-08-20T09:22:15Z

parquet-variant/src/builder.rs

+    }
+
+    // Performs any parent-specific aspects of rolling back a builder if an insertion failed.
+    fn rollback(&mut self) {


Will rollback_if_needed be better here?

klion26 · 2025-08-20T09:24:54Z

parquet-variant/src/builder.rs

+                "Field name '{field_name}' not found in metadata",
+            )));
+        };
+        self.known_field_names.insert(self.metadata.get_infallible(field_id), field_id);


Are we not filling known_field_names in at initialization due to performance considerations?

Yes -- this specific object's field names are an arbitrarily small subset of the metadata dictionary. In an extreme case, the dictionary could contain thousands of entries while the specific object being manipulated contains only a handful of fields.

scovich · 2025-08-20T18:06:43Z

@klion26 -- thanks for the review! This pathfinding PR cannot merge (too big and messy), but I pulled out the first round of rollback changes as a cleaned up and better scoped PR here:

[Variant] ParentState handles finish/rollback for builders #8185

PTAL, if you have time?

scovich · 2025-08-23T13:34:12Z

I'll go ahead and close this now. The work items are all tracked by

[Variant] Variant builder improvements #8171

And almost everything has merged except

[Variant] Support read-only metadata builders #8208

scovich added 12 commits August 18, 2025 07:30

Object builder detects and reports duplicate fields immediately

bdb0c99

checkpoint - impl seems to work

9e502d5

Only create one parent state when inserting an object or list

78cba4d

Add try_new_list and try_new_object to reduce unit test churn

984dc0b

checkpoint - use offsets instead of (start,len) pairs

55e4cc8

checkpoint - use starting offset instead of ending offset

b1478c2

checkpoint - seems to work!

c82b7f0

remove useless impl Drop

f551f9b

remove dead code

714c5a0

more cleanups, esp ValueBuilder::append_variant

8da01b4

fmt

bf39e9c

Make MetadataBuilder a trait

00ba715

github-actions bot added the parquet-variant parquet-variant* crates label Aug 18, 2025

scovich commented Aug 18, 2025

View reviewed changes

alamb reviewed Aug 19, 2025

View reviewed changes

scovich commented Aug 19, 2025

View reviewed changes

This was referenced Aug 20, 2025

[Variant] ObjectBuilder does not reliably check for duplicates #8170

Closed

[Variant] Variant builder improvements #8171

Open

klion26 approved these changes Aug 20, 2025

View reviewed changes

alamb marked this pull request as draft August 20, 2025 18:53

scovich closed this Aug 23, 2025

	ValueBuilder::try_append_variant(self.parent_state(), value.into()).unwrap()
	ValueBuilder::append_variant(self.parent_state(), value.into())

		/// Same as `Index::index`, but with the correct lifetime.
		pub(crate) fn get_infallible(&self, i: usize) -> &'m str {

		fn append_object(state: ParentState<'_>, obj: VariantObject) {
		let mut object_builder = ObjectBuilder::new(state, false);

[Variant] [Pathfinding] Variant builder rollback, variant array builder, and read-only metadata builder #8167

[Variant] [Pathfinding] Variant builder rollback, variant array builder, and read-only metadata builder #8167

Uh oh!

Conversation

scovich commented Aug 18, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

scovich commented Aug 18, 2025

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 19, 2025

Uh oh!

friendlymatthew commented Aug 19, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich left a comment