Skip to content

Conversation

MisterRaindrop
Copy link
Contributor

@MisterRaindrop MisterRaindrop commented Aug 22, 2025

avro schema add sanitize field name issue #192

@wgtmac
Copy link
Member

wgtmac commented Aug 23, 2025

Please add test cases. Thanks!

@MisterRaindrop
Copy link
Contributor Author

Please add test cases. Thanks!

Ok

@MisterRaindrop
Copy link
Contributor Author

MisterRaindrop commented Aug 24, 2025

@wgtmac Hi, cpp-linter report an error, maybe I forgot to open an issue cause it. Can you retry it? Thanks!

INFO:CPP Linter:0 clang-format-checks-failed
  INFO:CPP Linter:1 clang-tidy-checks-failed
  INFO:CPP Linter:1 checks-failed
  ERROR:CPP Linter:response returned 403 from POST https://api.github.com/repos/apache/iceberg-cpp/issues/190/comments with message: {"message":"Resource not accessible by integration","documentation_url":"https://docs.github.com/rest/issues/comments#create-an-issue-comment","status":"403"}
  Traceback (most recent call last):
    File "/home/runner/work/_actions/cpp-linter/cpp-linter-action/f91c446a32ae3eb9f98fef8c9ed4c7cb613a4f8a/venv/bin/cpp-linter", line 8, in <module>
      sys.exit(main())
    File "/home/runner/work/_actions/cpp-linter/cpp-linter-action/f91c446a32ae3eb9f98fef8c9ed4c7cb613a4f8a/venv/lib/python3.10/site-packages/cpp_linter/__init__.py", line 81, in main
      rest_api_client.post_feedback(files=files, args=args, clang_versions=clang_versions)
    File "/home/runner/work/_actions/cpp-linter/cpp-linter-action/f91c446a32ae3eb9f98fef8c9ed4c7cb613a4f8a/venv/lib/python3.10/site-packages/cpp_linter/rest_api/github_api.py", line 264, in post_feedback
      self.update_comment(
    File "/home/runner/work/_actions/cpp-linter/cpp-linter-action/f91c446a32ae3eb9f98fef8c9ed4c7cb613a4f8a/venv/lib/python3.10/site-packages/cpp_linter/rest_api/github_api.py", line 361, in update_comment
      self.api_request(url=comments_url, method=req_meth, data=payload)
    File "/home/runner/work/_actions/cpp-linter/cpp-linter-action/f91c446a32ae3eb9f98fef8c9ed4c7cb613a4f8a/venv/lib/python3.10/site-packages/cpp_linter/rest_api/__init__.py", line 124, in api_request
      response.raise_for_status()
    File "/home/runner/work/_actions/cpp-linter/cpp-linter-action/f91c446a32ae3eb9f98fef8c9ed4c7cb613a4f8a/venv/lib/python3.10/site-packages/requests/models.py", line 1026, in raise_for_status
      raise HTTPError(http_error_msg, response=self)
  requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://api.github.com/repos/apache/iceberg-cpp/issues/190/comments
  Error: Process completed with exit code 1.

@wgtmac
Copy link
Member

wgtmac commented Aug 24, 2025

image

You need to go to the files page to check specific linter errors.

@@ -163,4 +163,9 @@ Result<::avro::NodePtr> MakeAvroNodeWithFieldIds(const ::avro::NodePtr& original
Result<::avro::NodePtr> MakeAvroNodeWithFieldIds(const ::avro::NodePtr& original_node,
const NameMapping& mapping);

/// \brief Sanitize a field name to make it compatible with Avro field name requirements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add a bool ValidateFieldName(std::string_view)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, it would be good to explicitly state what are the Avro field name requirements in the comment.

@@ -65,6 +65,31 @@ ::avro::CustomAttributes GetAttributesWithFieldId(int32_t field_id) {

} // namespace

std::string SanitizeFieldName(std::string_view field_name) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is obviously wrong. Please double check the Java impl.

// TODO(gangwu): sanitize field name
(*node)->addName(std::string(sub_field.name()));
// Sanitize field name to ensure it follows Avro field name requirements
std::string sanitized_name = SanitizeFieldName(sub_field.name());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, field names do not need sanitization. This may introduce a large regression.

std::string sanitized_name = SanitizeFieldName(sub_field.name());

// Store original name as a custom attribute if it was modified
if (sanitized_name != sub_field.name()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, we can avoid string comparison by validating the original field name at the beginning.

@@ -839,7 +922,15 @@ Result<::avro::NodePtr> CreateRecordNodeWithFieldIds(const ::avro::NodePtr& orig
// Recursively apply field IDs to nested fields
ICEBERG_ASSIGN_OR_RAISE(auto new_nested_node,
MakeAvroNodeWithFieldIds(field_node, *nested_field));
new_record_node->addName(field_name);
std::string sanitized_name = SanitizeFieldName(field_name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worths a common function.

EXPECT_EQ(SanitizeFieldName(""), "_empty");

// Field name with only special characters
EXPECT_EQ(SanitizeFieldName("@#$"), "____");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR mostly generated by AI? This is obviously wrong. I'd suggest to read the Java implementation in detail and be responsible of every line of code to avoid wasting time of reviewers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants