Skip to content

Commit aecffd7

Browse files
committed
PARQUET-2249: Update with suggestions
This commit updates the proposal based on the suggestions in the PR. The biggest change is that readers are no longer expected to check min == max == NaN to find only-NaN pages. Instead, they should check nan_count + null_count == num_values in pages and nan_pages[x] == true in the column index. This way, we no longer rely on NaN comparison rules and readers can and should continue to ignore NaN values they find in bounds. However, as was pointed out, we cannot write "no value" into min and max bounds in the column index, as this would not be compatible with legacy readers. Instead, writers must write something here. Therefore, they are now suggested to write NaN here, but readers will still ignore this and instead must rely on the new nan_pages field. In addition, two further suggestions were implemented: * Removed the duplicate explanation from README.md. It now only points to parquet.thrift as the source of truth. * Softened the wording from "nan_count fields should always be set" to "it is suggested to always set the nan_coutn fields".
1 parent 2f3449e commit aecffd7

File tree

2 files changed

+33
-34
lines changed

2 files changed

+33
-34
lines changed

README.md

Lines changed: 1 addition & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -161,28 +161,7 @@ following rules:
161161
* FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
162162
signed zeros. The details are documented in the
163163
[Thrift definition](src/main/thrift/parquet.thrift) in the
164-
`ColumnOrder` union. They are summarized here but the Thrift definition
165-
is considered authoritative:
166-
* The following compatibility rules should be applied when reading statistics:
167-
* If the nan_count field is set to > 0 and both min and max are
168-
NaN, a reader can rely on that all non-NULL values are NaN
169-
* Otherwise, if the min or the max is a NaN, it should be ignored.
170-
* When looking for NaN values, min and max should be ignored;
171-
if the nan_count field is set, it should be used to check whether
172-
NaNs are present.
173-
* If the min is +0, the row group may contain -0 values as well.
174-
* If the max is -0, the row group may contain +0 values as well.
175-
* When writing statistics the following rules should be followed:
176-
* The nan_count fields should always be set for FLOAT and DOUBLE columns.
177-
* NaNs should not be written to min or max statistics fields except
178-
when all non-NULL values are NaN, in which case min and max should
179-
both be written as NaN. If the nan_count field is set, this semantics
180-
is mandated and readers may rely on it.
181-
* If the computed max value is zero (whether negative or positive),
182-
`+0.0` should be written into the max statistics field.
183-
* If the computed min value is zero (whether negative or positive),
184-
`-0.0` should be written into the min statistics field.
185-
164+
`ColumnOrder` union.
186165
* BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise
187166
comparison.
188167

src/main/thrift/parquet.thrift

Lines changed: 32 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -223,7 +223,10 @@ struct Statistics {
223223
*/
224224
5: optional binary max_value;
225225
6: optional binary min_value;
226-
/** count of NaN values in the column; only present if type is FLOAT or DOUBLE */
226+
/**
227+
* count of NaN values in the column; only present if physical type is FLOAT
228+
* or DOUBLE
229+
*/
227230
7: optional i64 nan_count;
228231
}
229232

@@ -890,21 +893,23 @@ union ColumnOrder {
890893
* (*) Because the sorting order is not specified properly for floating
891894
* point values (relations vs. total ordering), the following compatibility
892895
* rules should be applied when reading statistics:
893-
* - If the nan_count field is set to > 0 and both min and max are
894-
* NaN, a reader can rely on that all non-NULL values are NaN
895-
* - Otherwise, if the min or the max is a NaN, it should be ignored.
896-
* - When looking for NaN values, min and max should be ignored;
897-
* if the nan_count field is set, it can be used to check whether
896+
* - If the min is a NaN, it should be ignored.
897+
* - If the max is a NaN, it should be ignored.
898+
* - If the nan_count field is set, a reader can compute
899+
* nan_count + null_count == num_values to deduce whether all non-NULL
900+
* values are NaN.
901+
* - When looking for NaN values, min and max should be ignored.
902+
* If the nan_count field is set, it can be used to check whether
898903
* NaNs are present.
899904
* - If the min is +0, the row group may contain -0 values as well.
900905
* - If the max is -0, the row group may contain +0 values as well.
901906
*
902907
* When writing statistics the following rules should be followed:
903-
* - The nan_count fields should always be set for FLOAT and DOUBLE columns.
908+
* - It is suggested to always set the nan_count fields for FLOAT and
909+
DOUBLE columns.
904910
* - NaNs should not be written to min or max statistics fields except
905-
* when all non-NULL values are NaN, in which case min and max should
906-
* both be written as NaN. If the nan_count field is set, this semantics
907-
* is mandated and readers may rely on it.
911+
* in the column index, where a value has to be written incase of
912+
* only-NaN pages.
908913
* - If the computed max value is zero (whether negative or positive),
909914
* `+0.0` should be written into the max statistics field.
910915
* - If the computed min value is zero (whether negative or positive),
@@ -963,7 +968,9 @@ struct ColumnIndex {
963968
* using them by inspecting null_pages.
964969
* For columns of type FLOAT and DOUBLE, NaN values are not to be included
965970
* in these bounds unless all non-null values in a page are NaN, in which
966-
* case min and max are to be set to NaN.
971+
* case min and max should be set to NaN. Readers should always ignore NaN
972+
* values in the bounds; they should check nan_pages to detect the "all
973+
* non-null values are NaN" case.
967974
*/
968975
2: required list<binary> min_values
969976
3: required list<binary> max_values
@@ -979,9 +986,22 @@ struct ColumnIndex {
979986
/** A list containing the number of null values for each page **/
980987
5: optional list<i64> null_counts
981988

989+
/**
990+
* A list of Boolean values to determine pages that contain only NaNs. Only
991+
* present for columns of type FLOAT and DOUBLE. If true, all non-null
992+
* values in a page are NaN. Writers are suggested to set the corresponding
993+
* entries in min_values and max_values to NaN, so that all lists have the same
994+
* length and contain valid values. If false, then either all values in the
995+
* page are null or there is at least one non-null non-NaN value in the page.
996+
* As readers are supposed to ignore all NaN values in bounds, legacy readers
997+
* who do not consider nan_pages yet are still able to use the column index
998+
* but are not able to skip only-NaN pages.
999+
*/
1000+
6: optional list<bool> nan_pages
1001+
9821002
/** A list containing the number of NaN values for each page. Only present
9831003
* for columns of type FLOAT and DOUBLE. **/
984-
6: optional list<i64> nan_counts
1004+
7: optional list<i64> nan_counts
9851005
}
9861006

9871007
struct AesGcmV1 {

0 commit comments

Comments
 (0)