-
Notifications
You must be signed in to change notification settings - Fork 45
feat: avro schema add sanitize field name #190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: avro schema add sanitize field name #190
Conversation
Please add test cases. Thanks! |
Ok |
@wgtmac Hi, cpp-linter report an error, maybe I forgot to open an issue cause it. Can you retry it? Thanks!
|
@@ -163,4 +163,9 @@ Result<::avro::NodePtr> MakeAvroNodeWithFieldIds(const ::avro::NodePtr& original | |||
Result<::avro::NodePtr> MakeAvroNodeWithFieldIds(const ::avro::NodePtr& original_node, | |||
const NameMapping& mapping); | |||
|
|||
/// \brief Sanitize a field name to make it compatible with Avro field name requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also add a bool ValidateFieldName(std::string_view)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, it would be good to explicitly state what are the Avro field name requirements
in the comment.
@@ -65,6 +65,31 @@ ::avro::CustomAttributes GetAttributesWithFieldId(int32_t field_id) { | |||
|
|||
} // namespace | |||
|
|||
std::string SanitizeFieldName(std::string_view field_name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation is obviously wrong. Please double check the Java impl.
// TODO(gangwu): sanitize field name | ||
(*node)->addName(std::string(sub_field.name())); | ||
// Sanitize field name to ensure it follows Avro field name requirements | ||
std::string sanitized_name = SanitizeFieldName(sub_field.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases, field names do not need sanitization. This may introduce a large regression.
std::string sanitized_name = SanitizeFieldName(sub_field.name()); | ||
|
||
// Store original name as a custom attribute if it was modified | ||
if (sanitized_name != sub_field.name()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, we can avoid string comparison by validating the original field name at the beginning.
@@ -839,7 +922,15 @@ Result<::avro::NodePtr> CreateRecordNodeWithFieldIds(const ::avro::NodePtr& orig | |||
// Recursively apply field IDs to nested fields | |||
ICEBERG_ASSIGN_OR_RAISE(auto new_nested_node, | |||
MakeAvroNodeWithFieldIds(field_node, *nested_field)); | |||
new_record_node->addName(field_name); | |||
std::string sanitized_name = SanitizeFieldName(field_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worths a common function.
EXPECT_EQ(SanitizeFieldName(""), "_empty"); | ||
|
||
// Field name with only special characters | ||
EXPECT_EQ(SanitizeFieldName("@#$"), "____"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this PR mostly generated by AI? This is obviously wrong. I'd suggest to read the Java implementation in detail and be responsible of every line of code to avoid wasting time of reviewers.
avro schema add sanitize field name issue #192