refactor: proper class for field info #1730

pierrecamilleri · 2025-01-24T14:47:02Z

This is a refactoring PR.

Currently, a complex private object field_info is created in "Table.__open_row_stream" (resources/table.py) and used in the (non-public) Row __init__ method.

In addition, taking into account the schema_sync option leads to a lot of changes at many places, and it is very intricate and error-prone.

This PR introduces the same functionality without the field info, and with a proper implementation for schema_sync.

Sorry for the complicated review ! I try to provide a clear explanation to help with it.

Details

As a reminder, the schema_sync option allows to change the order of columns in the data, to drop columns (except if required) and to add extra columns. Even if it will soon be probably deprecated (better way to control this in the v2 spec), the changes introduced here will help to implement the v2 changes.
First, the schema_sync option modifies the schema itself, which is a bad idea because 1. it deceives the expectation to find the schema as provided and not modified, 2. some schema fields need to be kept on hand, e.g. missing required columns, to be able to properly raise appropriate errors. This leads to very intricate code, where these fields would be kept for the header validation and dropped for the row validation.
Taking into account the schema_sync option needs to happen with both schema and labels on hand : so all schema_sync specific code has been moved from the "detector.py" to the "header.py" file.
The Header class can now directly deal with identifying missing required columns (_get_missing_fields method) or extra labels (_get_extra_labels). Before this change, they were determined by comparing the (possibly modified) schema and the data. The header now also provides schema fields associated to the columns expected in the data (which depend on schema sync option), in a single step (get_expected_fields method). These methods deal with schema_sync, but will be able to deal with a large range of expectations as with the v2 spec fieldsMatch property in the future.
These expected fields are provided to the Row, and serve the same role as the former FieldInfo.

Changes orthogonal to the refactoring

Row.__str__ and Row.__repr__ were having side effects - the row was processed if it was observed. This is error-prone, and bit me as setting breakpoints for a debugger would change the behavior because of this.

WIP notes

Next steps / investigation:

Explore the reason why there is a create_cell_reader function, instead of a more direct read_cell, which at first glance would simplify the logic.
1. Some constraints parsing happens in create_cell_reader (maybe to reuse the value_reader). This does not seem the right place.
2. for creating the value_reader once and for all (but same question, why create a value_reader instead of a read_row method.
Can (should?) the perf be improved by not find the field number with .index each time it is needed.
do not mess with schema fields when schema_sync=True, instead, create a separate list or mapping of the actual data fields.

I found a (bad) reason for why there is these create_value_reader and create_cell_reader. My attempt was to replace with the following mechanism: define what needs to be defined once and for all at field creation, and then use direct read_cell and parse_value. This fails because fields are not initialized in a valid or definitive state, but some properties are changed after its initialization. For instance, fields are mutated at Schema initialization, that changes their behavior (for instance, taking into account "Schema.missing_fields" property). This is hard to change, because of the combo attrs + Metadata.from_descriptor, that make changing the initialization painful.

Test passes, surprisingly. No special effort has been made to support `header_case` option, or "required" columns with `schema_sync`

TODO still some tidy up : - Remove FieldsInfo, use header instead - Less error-prone way for `_normalize`

pierrecamilleri · 2025-02-07T15:48:03Z

frictionless/resource/__spec__/test_validate.py

    )
    report = resource.validate()
    assert report.valid
-    assert resource.schema.to_descriptor() == {


This test is removed as the schema is not modified anymore.

fix: parallel file validation for a datapackage

bef5fdc

pierrecamilleri marked this pull request as draft January 24, 2025 14:47

pierrecamilleri added 11 commits January 27, 2025 11:35

🔵 rename local variable

b0554a2

🔵 improve documentation

f20aab1

Mv validate methods & soft deprecate Validator

bc27f9b

fix: repair tests of parallel validation

0498e9b

Dispatching tests according to method change

e2f2a33

first attempt

ee4ad2f

squash! first attempt

39aa0f2

🔵 Mv to resources/table

70aadac

🔵 rename

add8dac

Schema sync functionnality inside FieldsInfo

e26393b

Test passes, surprisingly. No special effort has been made to support `header_case` option, or "required" columns with `schema_sync`

🔵 remove empty / unused file

8b72ae8

pierrecamilleri force-pushed the refactor/field_info branch from 996f64d to 8b72ae8 Compare January 29, 2025 14:18

pierrecamilleri changed the base branch from main to fix/parallel-datapackage January 29, 2025 14:18

pierrecamilleri added 2 commits January 29, 2025 18:10

🟢 Test passes

c7479c6

TODO still some tidy up : - Remove FieldsInfo, use header instead - Less error-prone way for `_normalize`

🟢 get rid of FieldInfo

9b7d0ad

pierrecamilleri commented Feb 7, 2025

View reviewed changes

pierrecamilleri added 9 commits February 7, 2025 16:51

remove unnecessary review noise

a80e79a

remove unused function

b7f693d

remove unused FieldInfo

77a2c9f

linting

305d2b9

Information on processing for Row.__str__ and Row.__repr__

91d3ffb

typo

575885b

unintended rename

c41590f

Remove __repr__ change as it is used for tests

37fd709

fix: oopsie

c6c90a2

Base automatically changed from fix/parallel-datapackage to main March 25, 2025 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: proper class for field info #1730

refactor: proper class for field info #1730

Uh oh!

pierrecamilleri commented Jan 24, 2025 •

edited

Loading

Uh oh!

pierrecamilleri Feb 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

refactor: proper class for field info #1730

Are you sure you want to change the base?

refactor: proper class for field info #1730

Uh oh!

Conversation

pierrecamilleri commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Changes orthogonal to the refactoring

WIP notes

Uh oh!

pierrecamilleri Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pierrecamilleri commented Jan 24, 2025 •

edited

Loading