Skip to content

Commit e850c1b

Browse files
committed
PARQUET-2249: Update with suggestions
This commit updates the proposal based on the suggestions in the PR. The biggest change is that readers are no longer expected to check min == max == NaN to find only-NaN pages. Instead, they should check nan_count + null_count == num_values in pages and nan_pages[x] == true in the column index. This way, we no longer rely on NaN comparison rules and readers can and should continue to ignore NaN values they find in bounds. However, as was pointed out, we cannot write "no value" into min and max bounds in the column index, as this would not be compatible with legacy readers. Instead, writers must write something here. Therefore, they are now suggested to write NaN here, but readers will still ignore this and instead must rely on the new nan_pages field. In addition, two further suggestions were implemented: * Removed the duplicate explanation from README.md. It now only points to parquet.thrift as the source of truth. * Softened the wording from "nan_count fields should always be set" to "it is suggested to always set the nan_coutn fields".
1 parent 2f3449e commit e850c1b

File tree

2 files changed

+31
-31
lines changed

2 files changed

+31
-31
lines changed

README.md

Lines changed: 1 addition & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -161,28 +161,7 @@ following rules:
161161
* FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
162162
signed zeros. The details are documented in the
163163
[Thrift definition](src/main/thrift/parquet.thrift) in the
164-
`ColumnOrder` union. They are summarized here but the Thrift definition
165-
is considered authoritative:
166-
* The following compatibility rules should be applied when reading statistics:
167-
* If the nan_count field is set to > 0 and both min and max are
168-
NaN, a reader can rely on that all non-NULL values are NaN
169-
* Otherwise, if the min or the max is a NaN, it should be ignored.
170-
* When looking for NaN values, min and max should be ignored;
171-
if the nan_count field is set, it should be used to check whether
172-
NaNs are present.
173-
* If the min is +0, the row group may contain -0 values as well.
174-
* If the max is -0, the row group may contain +0 values as well.
175-
* When writing statistics the following rules should be followed:
176-
* The nan_count fields should always be set for FLOAT and DOUBLE columns.
177-
* NaNs should not be written to min or max statistics fields except
178-
when all non-NULL values are NaN, in which case min and max should
179-
both be written as NaN. If the nan_count field is set, this semantics
180-
is mandated and readers may rely on it.
181-
* If the computed max value is zero (whether negative or positive),
182-
`+0.0` should be written into the max statistics field.
183-
* If the computed min value is zero (whether negative or positive),
184-
`-0.0` should be written into the min statistics field.
185-
164+
`ColumnOrder` union.
186165
* BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise
187166
comparison.
188167

src/main/thrift/parquet.thrift

Lines changed: 30 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -223,7 +223,10 @@ struct Statistics {
223223
*/
224224
5: optional binary max_value;
225225
6: optional binary min_value;
226-
/** count of NaN values in the column; only present if type is FLOAT or DOUBLE */
226+
/**
227+
* count of NaN values in the column; only present if physical type is FLOAT
228+
* or DOUBLE
229+
*/
227230
7: optional i64 nan_count;
228231
}
229232

@@ -890,8 +893,11 @@ union ColumnOrder {
890893
* (*) Because the sorting order is not specified properly for floating
891894
* point values (relations vs. total ordering), the following compatibility
892895
* rules should be applied when reading statistics:
893-
* - If the nan_count field is set to > 0 and both min and max are
894-
* NaN, a reader can rely on that all non-NULL values are NaN
896+
* - If the min is a NaN, it should be ignored.
897+
* - If the max is a NaN, it should be ignored.
898+
* - If the nan_count field is set, a reader can compute
899+
* nan_count + null_count == num_values to deduce whether all non-NULL
900+
* values are NaN.
895901
* - Otherwise, if the min or the max is a NaN, it should be ignored.
896902
* - When looking for NaN values, min and max should be ignored;
897903
* if the nan_count field is set, it can be used to check whether
@@ -900,11 +906,11 @@ union ColumnOrder {
900906
* - If the max is -0, the row group may contain +0 values as well.
901907
*
902908
* When writing statistics the following rules should be followed:
903-
* - The nan_count fields should always be set for FLOAT and DOUBLE columns.
909+
* - It is suggested to always set the nan_count fields for FLOAT and
910+
DOUBLE columns.
904911
* - NaNs should not be written to min or max statistics fields except
905-
* when all non-NULL values are NaN, in which case min and max should
906-
* both be written as NaN. If the nan_count field is set, this semantics
907-
* is mandated and readers may rely on it.
912+
* in the column index, where a valid value has to be written in
913+
* case of only-NaN pages.
908914
* - If the computed max value is zero (whether negative or positive),
909915
* `+0.0` should be written into the max statistics field.
910916
* - If the computed min value is zero (whether negative or positive),
@@ -963,7 +969,9 @@ struct ColumnIndex {
963969
* using them by inspecting null_pages.
964970
* For columns of type FLOAT and DOUBLE, NaN values are not to be included
965971
* in these bounds unless all non-null values in a page are NaN, in which
966-
* case min and max are to be set to NaN.
972+
* case min and max should be set to NaN. Readers should always ignore NaN
973+
* values in the bounds; they should check nan_pages to detect the "all
974+
* non-null values are NaN" case.
967975
*/
968976
2: required list<binary> min_values
969977
3: required list<binary> max_values
@@ -979,9 +987,22 @@ struct ColumnIndex {
979987
/** A list containing the number of null values for each page **/
980988
5: optional list<i64> null_counts
981989

990+
/**
991+
* A list of Boolean values to determine pages that contain only NaNs. Only
992+
* present for columns of type FLOAT and DOUBLE. If true, all non-null
993+
* values in a page are NaN. Writers are suggested to set the corresponding
994+
* entries in min_values and max_values to NaN, so that all lists have the same
995+
* length and contain valid values. If false, then either all values in the
996+
* page are null or there is at least one non-null non-NaN value in the page.
997+
* As readers are supposed to ignore all NaN values in bounds, legacy readers
998+
* who do not consider nan_pages yet are still able to use the column index
999+
* but are not able to skip only-NaN pages.
1000+
*/
1001+
6: optional list<bool> nan_pages
1002+
9821003
/** A list containing the number of NaN values for each page. Only present
9831004
* for columns of type FLOAT and DOUBLE. **/
984-
6: optional list<i64> nan_counts
1005+
7: optional list<i64> nan_counts
9851006
}
9861007

9871008
struct AesGcmV1 {

0 commit comments

Comments
 (0)