-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] [Pathfinding] Variant builder rollback, variant array builder, and read-only metadata builder #8167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Attn @alamb @carpecodeum @friendlymatthew , very interested in your thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some hopefully helpful context for reviewers
nulls: NullBufferBuilder, | ||
/// buffer for all the metadata | ||
metadata_buffer: Vec<u8>, | ||
metadata_builder: MetadataBuilderXX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the metadata builder is now "reusable" (each new finish
call just appends another chunk of bytes to the underlying byte buffer), we can instantiate it directly and just pass it by reference to the VariantArrayVariantBuilder
helper class.
metadata_offsets: Vec<usize>, | ||
/// buffer for values | ||
value_buffer: Vec<u8>, | ||
value_builder: ValueBuilder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This used to be called ValueBuffer
, and it turns out to have a better API for our purposes than VariantBuilder
. So we no longer create a VariantBuilder
at all, and just pass the value and metadata builders around by reference as needed.
metadata_builder: MetadataBuilderXX, | ||
/// (offset, len) pairs for locations of metadata in the buffer | ||
metadata_locations: Vec<(usize, usize)>, | ||
metadata_offsets: Vec<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Track starting offsets (similar to normal variant code), and infer the length of each entry as the difference between adjacent offsets. Ditto for value_offsets
below.
let metadata_buffer = std::mem::take(&mut self.metadata_buffer); | ||
let value_buffer = std::mem::take(&mut self.value_buffer); | ||
VariantArrayVariantBuilder::new(self, metadata_buffer, value_buffer) | ||
VariantArrayVariantBuilder::new(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move all the magic into the constructor (this could be done as a separate PR)
impl VariantBuilderExt for VariantArrayVariantBuilder<'_> { | ||
fn append_value<'m, 'v>(&mut self, value: impl Into<Variant<'m, 'v>>) { | ||
self.variant_builder.append_value(value); | ||
ValueBuilder::try_append_variant(self.parent_state(), value.into()).unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed one
ValueBuilder::try_append_variant(self.parent_state(), value.into()).unwrap() | |
ValueBuilder::append_variant(self.parent_state(), value.into()) |
buffer.try_append_variant(value.into(), metadata_builder)?; | ||
Ok(()) | ||
let (state, _) = self.parent_state(key)?; | ||
ValueBuilder::try_append_variant(state, value.into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this used to not roll back fully on failure.
fn new_list(&mut self) -> ListBuilder<'_> { | ||
self.try_new_list().unwrap() | ||
} | ||
|
||
fn new_object(&mut self) -> ObjectBuilder<'_> { | ||
self.try_new_object().unwrap() | ||
} | ||
|
||
fn new_object(&mut self) -> ObjectBuilder<'_>; | ||
fn try_new_list(&mut self) -> Result<ListBuilder<'_>, ArrowError>; | ||
|
||
fn try_new_object(&mut self) -> Result<ObjectBuilder<'_>, ArrowError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I just changed the new_XX
methods to return Result
, but that churned a lot of unit test code.
I'm still on the fence whether this is better in the long run tho?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, me too -- it is pretty annoying to have to check for errors on every single call to creating these objects, even when it can not fail (e.g. creating in memory variants)
Another thought I had was we could potentially defer errors that could arise into the call to build()
-- but then that might results in hard to track down errors 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing that really put me off of tracking deferred errors is, it became yet another thing that has to be rolled back if the offending builder never finishes. I'm pretty sure we weren't rolling it back properly before, and I didn't want to complicate the ParentState
stuff even further by "fixing" it, if we can just eliminate it.
.new_object() | ||
.with_field("a", 1) | ||
.with_field("b", 2) | ||
.try_with_field("a", 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplicate field is now detected immediately upon insertion, not at finish
let nested_result = inner_list | ||
.new_object() | ||
.with_field("x", 1) | ||
.try_with_field("x", 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, duplicates are now detected immediately.
|
||
/// Test reusing buffers with nested objects | ||
#[test] | ||
fn test_with_existing_buffers_nested() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was testing support code for VariantArrayBuilder
that has been deleted.
I will review this PR carefully today |
Hi, I have to catch up on all the variant progress (work is busy). I'll review this by the end of this week, but no need to wait for me. If I have questions, I will leave them post-merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a pretty close look at this PR and I think the direction makes a lot of sense to me. I left some inline comments, but basically I think it is ready to polish up and get ready to merge (if you can break it into smaller PRs that would be nice)
cc @klion26 and @codephage2020 who may also be interested and have some additional comments to share
let value_length = 0; | ||
self.value_locations.push((value_offset, value_length)); | ||
self.metadata_offsets.push(self.metadata_builder.offset()); | ||
self.value_offsets.push(self.value_builder.offset()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as part of this PR, but I think the Null elements are supposed to have a Variant::Null
rather than empty for all elements that are marked as null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For top-level (which this is), I believe you are correct. Good catch!
// Size of innermost_list: 1 + 1 + 2*(128 + 1) + 2*128 = 516 | ||
// Size of inner object: 1 + 4 + 2*256 + 3*(256 + 1) + 256 * 516 = 133384 | ||
// Size of json: 1 + 4 + 2*256 + 4*(256 + 1) + 256 * 133384 = 34147849 | ||
assert_eq!(value.len(), 34147849); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this analysis -- the test seems overly sensitive to small pturbatons
/// Same as `Index::index`, but with the correct lifetime. | ||
pub(crate) fn get_infallible(&self, i: usize) -> &'m str { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment.
Minor nit: directly using get(i).unwrap()
at relevant locations rather than add a new function might be easier to understand than understanding when to use get_infallible
🤔
impl From<Vec<u8>> for ValueBuffer { | ||
fn from(value: Vec<u8>) -> Self { | ||
Self(value) | ||
} | ||
} | ||
|
||
impl From<ValueBuffer> for Vec<u8> { | ||
fn from(value_buffer: ValueBuffer) -> Self { | ||
value_buffer.0 | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it isn't needed I support removing it (we can always add it back later)
fn append_object(state: ParentState<'_>, obj: VariantObject) { | ||
let mut object_builder = ObjectBuilder::new(state, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea -- it makes a lot of sense to me
pub trait MetadataBuilder: std::fmt::Debug { | ||
fn try_upsert_field_name(&mut self, field_name: &str) -> Result<u32, ArrowError>; | ||
fn field_name(&self, field_id: usize) -> &str; | ||
fn num_field_names(&self) -> usize; | ||
fn truncate_field_names(&mut self, new_size: usize); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API makes sense to me -- I think it would be nice to add some comments, but in general 👍
/// You can use an existing `Vec<u8>` as the metadata buffer by using the `from` impl. | ||
#[derive(Default, Debug)] | ||
struct MetadataBuilder { | ||
pub struct MetadataBuilderXX { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about OwnedMetadataBuilder
or DefaultMetadataBuilder
/// The starting offset in the parent's buffer where this object starts | ||
parent_value_offset_base: usize, | ||
/// The starting offset in the parent's metadata buffer where this object starts | ||
/// used to truncate the written fields in `drop` if the current object has not been finished | ||
parent_metadata_offset_base: usize, | ||
/// Whether the object has been finished, the written content of the current object | ||
/// will be truncated in `drop` if `has_been_finished` is false | ||
has_been_finished: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is very nice that the parent state handles this nicely now
|
||
/// Returns an object builder that can be used to append a new (nested) object to this object. | ||
/// | ||
/// Panics if the requested key cannot be inserted (e.g. because it is a duplicate). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason it could panic is that the builder has read only metadata, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way we could avoid panic's would be to defer errors until the final finish()
is called 🤔
I think this is fine for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could panic [if] the builder has read only metadata, right?
Yep! And having the same method fail for both error cases is really appealing to me.
(I just wrote the code comment before adding the read-only builder to this PR)
We could avoid panic's [by deferring] errors until the final finish() is called
The goal was to make finish
signature infallible, now that it actually is infallible in practice. I just didn't want to bloat this PR with the extra churn that would cause.
Plus, it always seemed weird to do extra work to defer an obvious known failure until later. Is there a disadvantage to failing immediately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus, it always seemed weird to do extra work to defer an obvious known failure until later. Is there a disadvantage to failing immediately?
the only disadvantage is that it is annoying if you have to do error checking for operations that can't fail - like making a new object with owned metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The top-level builder was, and remains infallible in both its new_object
and finish
methods?
Here we're dealing with the object builder's (nested) new_object
, which takes a field name and so could fail due to duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And actually... it wasn't even checking in the first place:
fn new_list(&mut self) -> ListBuilder<'_> { | ||
self.try_new_list().unwrap() | ||
} | ||
|
||
fn new_object(&mut self) -> ObjectBuilder<'_> { | ||
self.try_new_object().unwrap() | ||
} | ||
|
||
fn new_object(&mut self) -> ObjectBuilder<'_>; | ||
fn try_new_list(&mut self) -> Result<ListBuilder<'_>, ArrowError>; | ||
|
||
fn try_new_object(&mut self) -> Result<ObjectBuilder<'_>, ArrowError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, me too -- it is pretty annoying to have to check for errors on every single call to creating these objects, even when it can not fail (e.g. creating in memory variants)
Another thought I had was we could potentially defer errors that could arise into the call to build()
-- but then that might results in hard to track down errors 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the direction makes a lot of sense to me. I left some inline comments, but basically I think it is ready to polish up and get ready to merge (if you can break it into smaller PRs that would be nice)
Thanks for the quick review!
This will definitely become a stream of smaller/scoped PR, I just wanted a sanity check before starting down that path.
/// Same as `Index::index`, but with the correct lifetime. | ||
pub(crate) fn get_infallible(&self, i: usize) -> &'m str { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TheIndex::index
method has the following signature:
fn index(&self, i: usize) -> &Self::Output
which is really
fn index<'a>(&'a self, i: usize) -> &'a Self::Output
and because the (elided) 'a
is not related to 'm
, we can't use the return value in a place that requires &'m str
.
You're right that self.get(i).unwrap()
would also work; I was just trying to preserve the expect
at both call sites. Should we just unwrap
everywhere and be done with it?
NOTE: The get_infallible
should be private, except then Index
can't access it.
let value_length = 0; | ||
self.value_locations.push((value_offset, value_length)); | ||
self.metadata_offsets.push(self.metadata_builder.offset()); | ||
self.value_offsets.push(self.value_builder.offset()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For top-level (which this is), I believe you are correct. Good catch!
pub trait MetadataBuilder: std::fmt::Debug { | ||
fn try_upsert_field_name(&mut self, field_name: &str) -> Result<u32, ArrowError>; | ||
fn field_name(&self, field_id: usize) -> &str; | ||
fn num_field_names(&self) -> usize; | ||
fn truncate_field_names(&mut self, new_size: usize); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs comments for sure!
/// You can use an existing `Vec<u8>` as the metadata buffer by using the `from` impl. | ||
#[derive(Default, Debug)] | ||
struct MetadataBuilder { | ||
pub struct MetadataBuilderXX { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah my other PR used DefaultMetadataBuilder
, but I wasn't sure the name conveyed what it actually does differently than a "not default" (or "not owned") metadata builder might do? Maybe BasicMetadataBuilder
would convey the fact that any other impl must be doing something fancy/special (= not basic)?
|
||
/// Returns an object builder that can be used to append a new (nested) object to this object. | ||
/// | ||
/// Panics if the requested key cannot be inserted (e.g. because it is a duplicate). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could panic [if] the builder has read only metadata, right?
Yep! And having the same method fail for both error cases is really appealing to me.
(I just wrote the code comment before adding the read-only builder to this PR)
We could avoid panic's [by deferring] errors until the final finish() is called
The goal was to make finish
signature infallible, now that it actually is infallible in practice. I just didn't want to bloat this PR with the extra churn that would cause.
Plus, it always seemed weird to do extra work to defer an obvious known failure until later. Is there a disadvantage to failing immediately?
fn new_list(&mut self) -> ListBuilder<'_> { | ||
self.try_new_list().unwrap() | ||
} | ||
|
||
fn new_object(&mut self) -> ObjectBuilder<'_> { | ||
self.try_new_object().unwrap() | ||
} | ||
|
||
fn new_object(&mut self) -> ObjectBuilder<'_>; | ||
fn try_new_list(&mut self) -> Result<ListBuilder<'_>, ArrowError>; | ||
|
||
fn try_new_object(&mut self) -> Result<ObjectBuilder<'_>, ArrowError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing that really put me off of tracking deferred errors is, it became yet another thing that has to be rolled back if the offending builder never finishes. I'm pretty sure we weren't rolling it back properly before, and I didn't want to complicate the ParentState
stuff even further by "fixing" it, if we can just eliminate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. This is very nice; we can now handle the rollback logic more elegantly.
} | ||
|
||
// Performs any parent-specific aspects of rolling back a builder if an insertion failed. | ||
fn rollback(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will rollback_if_needed
be better here?
"Field name '{field_name}' not found in metadata", | ||
))); | ||
}; | ||
self.known_field_names.insert(self.metadata.get_infallible(field_id), field_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not filling known_field_names
in at initialization due to performance considerations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- this specific object's field names are an arbitrarily small subset of the metadata dictionary. In an extreme case, the dictionary could contain thousands of entries while the specific object being manipulated contains only a handful of fields.
@klion26 -- thanks for the review! This pathfinding PR cannot merge (too big and messy), but I pulled out the first round of rollback changes as a cleaned up and better scoped PR here: PTAL, if you have time? |
I'll go ahead and close this now. The work items are all tracked by And almost everything has merged except |
Which issue does this PR close?
Pathfinding. Even if we like what we see here, it needs to split into a half dozen other PR in order to actually merge. But first we need to see the big picture to know whether it's worth going further.
Several flavors of pathfinding rolled up in one, but with the overarching goal of:
An alternate approach to
Rationale for this change
While trying to rebase #7915, I kept running into more and more ugliness. And it was already kind of ugly.
Also, I realized that the
ValueBuffer
use of builders with a genericParentState::Variant
was incorrect in some rollback scenarios where the attempt to insert a value fails (try_insert_object
ortry_insert_list
).Further, the interaction between
VariantBuilder
andVariantArrayBuilder
was quite convoluted.It turns out all of those issues are a bit related, and this PR is the result of my exploration.
What changes are included in this PR?
A lot! In no particular order:
ValueBuffer
asValueBuilder
and make it publicParentState
to be much clearer and hopefully easier to usefinished
status on behalf of the builders that use it (their drop glue goes away)finish
is not called. In particular, the value and metadata buffer offsets are tracked and rolled back uniformly.finish
after all.ObjectBuilder
methods become fallible, because they immediately detect and report duplicated field names when that checking is enabled.VariantBuilder::finish
becomes infallible (because any failed field insertion is caught immediately by the now-fallibleObjectBuilder
methods). However, the method still returnsResult::Ok
for now, to avoid excessive unit test churn (removing dozens ofunwrap
and?
calls scattered all over the place)ValueBuilder::[try_]append_variant
to be an associated function instead of a method, taking aParentState
from its caller in order to ensure correct rollback semantics.ParentState
publicMetadataBuilder
struct becomes reusable, similar to normal arrow array builders -- itsfinish
method takes&mut self
.VariantArrayBuilder
now directly instantiates aValueBuilder
andMetadataBuilder
pair up front, and converts them to byte slices only when its ownfinish
method is called. It no longer attempts to create aVariantBuilder
at any point, and works directly with theValueBuilder
API instead. No moremem::take
magic needed -- theVariantArrayVariantBuilder
helper class just takes a mut ref to its owner and its finish method handles the rest; the drop glue becomes empty (just to help catch forgottenfinish
calls) because theParentState
now deals with all needed cleanup.VariantArrayBuilder
also tracks starting offsets instead of(offset, len)
pairs -- the latter is easily derived from the former.MetadataBuilder
into a trait (the original struct is renamed asMetadataBuilderXX
for now, final naming still TBD). Most methods that previously took aMetadataBuilder
struct now take a&mut dyn MetadataBuilder
instead.VariantArrayBuilder
, now that it's no longer needed.new_object
andnew_list
toVariantBuilderExt
trait. That way, the existing unit test code that calls those methods can keep using the infallible versions (panicking if there's a duplicate field name). Prod code that actually cares calls the new infallible versions instead.ReadOnlyMetadataBuilder
that implements theMetadataBuilder
trait but "mounts" an existingVariantMetadata
. It takes advantage of the fact that object field insertions are now fallible, in order to return an error if the requested field name does not exist in the underlying metadata dictionary.Are these changes tested?
Partly. I need to port over the unit tests from #7915 to fully exercise the new
ReadOnlyVariantBuilder
.Are there any user-facing changes?
Yes, lots of stuff became public.
Also, the semantics of some functions changed (breaking change)