You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit updates the proposal based on the suggestions in the PR.
The biggest change is that readers are no longer expected to check
min == max == NaN to find only-NaN pages. Instead, they should check
nan_count + null_count == num_values in pages and nan_pages[x] == true
in the column index. This way, we no longer rely on NaN comparison
rules and readers can and should continue to ignore NaN values they
find in bounds.
However, as was pointed out, we cannot write "no value" into min and max
bounds in the column index, as this would not be compatible with legacy
readers. Instead, writers must write something here. Therefore, they
are now suggested to write NaN here, but readers will still ignore this
and instead must rely on the new nan_pages field.
In addition, two further suggestions were implemented:
* Removed the duplicate explanation from README.md. It now only points
to parquet.thrift as the source of truth.
* Softened the wording from "nan_count fields should always be set" to
"it is suggested to always set the nan_coutn fields".
0 commit comments