Skip to content

Conversation

nullccxsy
Copy link
Contributor

  1. add insensitive way to find schemafield(list, struct, map)
  2. change the complexity of find name to O(1)
  3. test insensitive way to find schemafield(list, struct, map)

@nullccxsy nullccxsy force-pushed the feature/case-insensitive-field-find branch from adc4817 to ec73749 Compare August 15, 2025 11:16
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplicate of #145, perhaps @mapleFU can take a look as well.

@nullccxsy nullccxsy force-pushed the feature/case-insensitive-field-find branch 4 times, most recently from 3b0653a to ae4515e Compare August 20, 2025 14:34
@nullccxsy nullccxsy changed the title feat(type): add insensitive way to find schemafield & test feat(type): add insensitive way to find schemafield Aug 21, 2025
@nullccxsy nullccxsy changed the title feat(type): add insensitive way to find schemafield feat(schema): add insensitive way to find schemafield Aug 21, 2025
@nullccxsy nullccxsy force-pushed the feature/case-insensitive-field-find branch 2 times, most recently from 0394c01 to ef4b2de Compare August 22, 2025 06:33
@nullccxsy nullccxsy requested a review from wgtmac August 22, 2025 07:03
NametoIdVisitor::GetPath(path, std::string(field.name()), case_sensitive_);
short_full_path = GetPath(short_path, std::string(field.name()), case_sensitive_);
name_to_id_[full_path] = field.field_id();
name_to_id_.emplace(short_full_path, field.field_id());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the return value and error out if failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it fails, it means that the short name conflicts with the canonical name. In this case, the short name is discarded, so don't need to check the return value.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I think we are close to the finish line.

@wgtmac
Copy link
Member

wgtmac commented Aug 23, 2025

Please rename to title to feat: add find field (by id and name) support to schema

@nullccxsy nullccxsy changed the title feat(schema): add insensitive way to find schemafield feat: add find field (by id and name) support to schema Aug 23, 2025
@wgtmac
Copy link
Member

wgtmac commented Aug 23, 2025

Please note that the java impl has Function<String, String> quotingFunc. This is not required in this PR and we need to make sure the current implementation is not too hard to add the quoting function.

Comment on lines 88 to 90
Result<Status> InitIdToIndexMap() const;
Result<Status> InitNameToIndexMap() const;
Result<Status> InitLowerCaseNameToIndexMap() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they "Ensure" rather than Init?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May name it LazyIdToIndexMap ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose here is not "Init" ( Maybe reinitialize ) Mapping, it do Ensure the mapping to be "initialized", I guess EnusreIdToIndexMap() is ok to me.

Generally, LazyIdToIndexMap seems to be another api like Result<const map<...>&> LazyIdToIndexMap

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I feel a bit afraid about lazy initialize mapping, since it would be thread-unsafetly. But maybe it's not a severe problem here.

@wgtmac
Copy link
Member

wgtmac commented Aug 23, 2025

Personally I feel a bit afraid about lazy initialize mapping, since it would be thread-unsafetly. But maybe it's not a severe problem here.

For lazy initialization, we may leverage std::once_flag for thread-safety.

@mapleFU
Copy link
Member

mapleFU commented Aug 23, 2025

For lazy initialization, we may leverage std::once_flag for thread-safety.

So this requires a mutex and three once_flag? It might be huge but it's ok to me...

@nullccxsy
Copy link
Contributor Author

For lazy initialization, we may leverage std::once_flag for thread-safety.

So this requires a mutex and three once_flag? It might be huge but it's ok to me...

ok, this will be fixed

@mapleFU
Copy link
Member

mapleFU commented Aug 24, 2025

ok, this will be fixed

We can support this is another patch, currently we can add a todo here.

@nullccxsy nullccxsy force-pushed the feature/case-insensitive-field-find branch 3 times, most recently from 0f85748 to f0dbff4 Compare August 25, 2025 09:28
@nullccxsy nullccxsy force-pushed the feature/case-insensitive-field-find branch from 9de800c to c3b4bd3 Compare August 25, 2025 12:28
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @nullccxsy!

@nullccxsy nullccxsy force-pushed the feature/case-insensitive-field-find branch 2 times, most recently from d5f8d92 to 4df8eb7 Compare August 27, 2025 08:29
///
/// Enables std::unordered_map to directly accept std::string_view lookup keys
/// without creating temporary std::string objects, using C++20's transparent lookup.
struct string_hash {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to move it to string_utils.h so type.h can reuse it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can do this altogether in #194

nullccxsy added 10 commits August 27, 2025 18:52
- Add GetFieldByName method to support nested field queries with dot notation
- Implement InitNameToIndexMap and InitIdToIndexMap for field mapping
- Introduce SchemaFieldVisitor for schema traversal
- Implement name pruning logic for element and value in nested structure
- Introduce IdVisitor for schema traversal to init id_to_index_(Map)
- Introduce NameVisitor for schema traversal to init name_to_index_(Map), lowercase_name_to_index_(map)
…ava impl

1. Refactor IdVisitor: Add VisitNested() to handle NestedType, make Visit() generic for all Type hierarchy (avoid anti-pattern)
2. Merge id_to_index_ + full_schemafield_ into id_to_field_ (std::unordered_map<int32_t, const ref<SchemaField>>) to reduce memory footprint
3. Align with Java impl's idToField design (support future alias/identifier fields without refactor)
- change the type of id from size_t to int_32
- rename GetPath to BuildPath
- fix bug new_short_path should build from short_path, not path
…sult call in macros

- Added logic to detect and handle duplicate names in NameToIdVisitor to prevent invalid schema errors.
- Fixed duplicate result call issue in src/iceberg/util/macros.h by optimizing ICEBERG_RETURN_UNEXPECTED macro.
… conversion function

- Implemented detection for duplicate field IDs, Names in visitors, returning kInvalidSchema error with message.
- Updated lowercase conversion in NameToIdVisitor
…reduce copies and unify initialization

- Switched field_ and schema_ members in test classes to std::unique_ptr for unique ownership and to avoid duplicate and avoid creating duplicate objects.
@nullccxsy nullccxsy force-pushed the feature/case-insensitive-field-find branch from 17bcac2 to 9aa221d Compare August 27, 2025 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants