-
Notifications
You must be signed in to change notification settings - Fork 154
refactor: proper class for field info #1730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
pierrecamilleri
wants to merge
23
commits into
main
Choose a base branch
from
refactor/field_info
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Test passes, surprisingly. No special effort has been made to support `header_case` option, or "required" columns with `schema_sync`
996f64d to
8b72ae8
Compare
TODO still some tidy up : - Remove FieldsInfo, use header instead - Less error-prone way for `_normalize`
pierrecamilleri
commented
Feb 7, 2025
| ) | ||
| report = resource.validate() | ||
| assert report.valid | ||
| assert resource.schema.to_descriptor() == { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is removed as the schema is not modified anymore.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a refactoring PR.
Currently, a complex private object
field_infois created in "Table.__open_row_stream" (resources/table.py) and used in the (non-public) Row__init__method.In addition, taking into account the
schema_syncoption leads to a lot of changes at many places, and it is very intricate and error-prone.This PR introduces the same functionality without the field info, and with a proper implementation for
schema_sync.Sorry for the complicated review ! I try to provide a clear explanation to help with it.
Details
As a reminder, the
schema_syncoption allows to change the order of columns in the data, to drop columns (except if required) and to add extra columns. Even if it will soon be probably deprecated (better way to control this in the v2 spec), the changes introduced here will help to implement the v2 changes.First, the
schema_syncoption modifies the schema itself, which is a bad idea because 1. it deceives the expectation to find the schema as provided and not modified, 2. some schema fields need to be kept on hand, e.g. missing required columns, to be able to properly raise appropriate errors. This leads to very intricate code, where these fields would be kept for the header validation and dropped for the row validation.Taking into account the
schema_syncoption needs to happen with both schema and labels on hand : so allschema_syncspecific code has been moved from the "detector.py" to the "header.py" file.The
Headerclass can now directly deal with identifying missing required columns (_get_missing_fieldsmethod) or extra labels (_get_extra_labels). Before this change, they were determined by comparing the (possibly modified) schema and the data. The header now also provides schema fields associated to the columns expected in the data (which depend on schema sync option), in a single step (get_expected_fieldsmethod). These methods deal withschema_sync, but will be able to deal with a large range of expectations as with the v2 specfieldsMatchproperty in the future.These expected fields are provided to the Row, and serve the same role as the former
FieldInfo.Changes orthogonal to the refactoring
Row.__str__andRow.__repr__were having side effects - the row was processed if it was observed. This is error-prone, and bit me as setting breakpoints for a debugger would change the behavior because of this.WIP notes
Next steps / investigation:
create_cell_readerfunction, instead of a more directread_cell, which at first glance would simplify the logic.create_cell_reader(maybe to reuse thevalue_reader). This does not seem the right place.value_readeronce and for all (but same question, why create avalue_readerinstead of aread_rowmethod..indexeach time it is needed.schema_sync=True, instead, create a separate list or mapping of the actual data fields.I found a (bad) reason for why there is these
create_value_readerandcreate_cell_reader. My attempt was to replace with the following mechanism: define what needs to be defined once and for all at field creation, and then use directread_cellandparse_value. This fails because fields are not initialized in a valid or definitive state, but some properties are changed after its initialization. For instance, fields are mutated at Schema initialization, that changes their behavior (for instance, taking into account "Schema.missing_fields" property). This is hard to change, because of the comboattrs+Metadata.from_descriptor, that make changing the initialization painful.