Skip to content

Commit c2116b0

Browse files
committed
PARQUET-2249: Introduce IEEE 754 total order & NaN-counts
This commit is a combination of the following PRs: * Introduce IEEE 754 total order apache#221 * Add nan_count to handle NaNs in statistics apache#196 Both these PRs try to solve the same problems; read the description of the respective PRs for explanation. This PR is the result of an extended discussion in which it was repeatedly brought up that another possible solution to the problem could be the combination of the two approaches. Please refer to this discussion on the mailing list and in the two PRs for details. the mailing list discussion can be found here: https://lists.apache.org/thread/lzh0dvrvnsy8kvflvl61nfbn6f9js81s The contents of this PR are basically a straightforward combination of the two approaches: * IEEE total order is introduced as a new order for floating point types * nan_count and nan_counts fields are added Legacy writers may not write nan_count(s) fields, so readers have to handle them being absent. Also, legacy writers may have included NaNs into min/max bounds, so readers also have to handle that. As there are no legacy writers writing IEEE total order, nan_count(s) are defined to be mandatory if this order is used, so readers can assume their presense when this order is used. This commit removes `nan_pages` from the ColumnIndex, which the nan_counts PR mandated. We don't need them anymore, as IEEE total order solves this: As this is a new order and there are thus no legacy writers and readers, we have the freedom to define that for only-NaN pages using this order, we can actually write NaNs into min & max bounds in this case, and readers can assume that isNaN(min) signals an only-NaN page.
1 parent 1dbc814 commit c2116b0

File tree

3 files changed

+129
-9
lines changed

3 files changed

+129
-9
lines changed

LogicalTypes.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,10 @@ Used in contexts where precision is traded off for smaller footprint and potenti
254254

255255
The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`.
256256

257-
The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`.
257+
The type-defined sort order for `FLOAT16` is signed (with special handling of NaNs and signed zeros),
258+
as for `FLOAT` and `DOUBLE`. It is recommended that writers use IEEE754TotalOrder when writing columns
259+
of this type for a well-defined handling of NaNs and signed zeros. See the `ColumnOrder` union in the
260+
[Thrift definition](src/main/thrift/parquet.thrift) for details.
258261

259262
## Temporal Types
260263

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,9 @@ documented in [LogicalTypes.md][logical-types].
158158
Parquet stores min/max statistics at several levels (such as Column Chunk,
159159
Column Index, and Data Page). These statistics are according to a sort order,
160160
which is defined for each column in the file footer. Parquet supports common
161-
sort orders for logical and primitve types. The details are documented in the
161+
sort orders for logical and primitve types and also special orders for types
162+
where the common sort order is not unambiguously defined (e.g., NaN ordering
163+
for floating point types). The details are documented in the
162164
[Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union.
163165

164166
## Nested Encoding

src/main/thrift/parquet.thrift

Lines changed: 122 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -309,6 +309,13 @@ struct Statistics {
309309
7: optional bool is_max_value_exact;
310310
/** If true, min_value is the actual minimum value for a column */
311311
8: optional bool is_min_value_exact;
312+
/**
313+
* count of NaN values in the column; only present if physical type is FLOAT
314+
* or DOUBLE, or logical type is FLOAT16.
315+
* Readers MUST distinguish between nan_count not being present and nan_count == 0.
316+
* If nan_count is not present, readers MUST NOT assume nan_count == 0.
317+
*/
318+
9: optional i64 nan_count;
312319
}
313320

314321
/** Empty structs to use as logical type annotations */
@@ -670,7 +677,7 @@ enum BoundaryOrder {
670677
/** Data page header */
671678
struct DataPageHeader {
672679
/**
673-
* Number of values, including NULLs, in this data page.
680+
* Number of values, including nulls, in this data page.
674681
*
675682
* If a OffsetIndex is present, a page must begin at a row
676683
* boundary (repetition_level = 0). Otherwise, pages may begin
@@ -717,9 +724,9 @@ struct DictionaryPageHeader {
717724
* The remaining section containing the data is compressed if is_compressed is true
718725
**/
719726
struct DataPageHeaderV2 {
720-
/** Number of values, including NULLs, in this data page. **/
727+
/** Number of values, including nulls, in this data page. **/
721728
1: required i32 num_values
722-
/** Number of NULL values, in this data page.
729+
/** Number of null values, in this data page.
723730
Number of non-null = num_values - num_nulls which is also the number of values in the data section **/
724731
2: required i32 num_nulls
725732
/**
@@ -1030,6 +1037,9 @@ struct RowGroup {
10301037
/** Empty struct to signal the order defined by the physical or logical type */
10311038
struct TypeDefinedOrder {}
10321039

1040+
/** Empty struct to signal IEEE 754 total order for floating point types */
1041+
struct IEEE754TotalOrder {}
1042+
10331043
/**
10341044
* Union to specify the order used for the min_value and max_value fields for a
10351045
* column. This union takes the role of an enhanced enum that allows rich
@@ -1038,6 +1048,7 @@ struct TypeDefinedOrder {}
10381048
* Possible values are:
10391049
* * TypeDefinedOrder - the column uses the order defined by its logical or
10401050
* physical type (if there is no logical type).
1051+
* * IEEE754TotalOrder - the floating point column uses IEEE 754 total order.
10411052
*
10421053
* If the reader does not support the value of this union, min and max stats
10431054
* for this column should be ignored.
@@ -1082,23 +1093,105 @@ union ColumnOrder {
10821093
* BYTE_ARRAY - unsigned byte-wise comparison
10831094
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
10841095
*
1085-
* (*) Because the sorting order is not specified properly for floating
1086-
* point values (relations vs. total ordering) the following
1096+
* (*) Because the precise sorting order is ambiguous for floating
1097+
* point types due to underspecified handling of NaN and -0/+0,
1098+
* it is recommended that writers use IEEE_754_TOTAL_ORDER
1099+
* for these types.
1100+
*
1101+
* If TYPE_ORDER is used for floating point types, then the following
10871102
* compatibility rules should be applied when reading statistics:
10881103
* - If the min is a NaN, it should be ignored.
10891104
* - If the max is a NaN, it should be ignored.
1105+
* - If the nan_count field is set, a reader can compute
1106+
* nan_count + null_count == num_values to deduce whether all non-null
1107+
* values are NaN.
1108+
* - When looking for NaN values, min and max should be ignored.
1109+
* If the nan_count field is set, it can be used to check whether
1110+
* NaNs are present.
10901111
* - If the min is +0, the row group may contain -0 values as well.
10911112
* - If the max is -0, the row group may contain +0 values as well.
10921113
* - When looking for NaN values, min and max should be ignored.
10931114
*
10941115
* When writing statistics the following rules should be followed:
1095-
* - NaNs should not be written to min or max statistics fields.
1116+
* - It is suggested to always set the nan_count field for floating
1117+
* point types, especially also if it is zero.
1118+
* - NaNs should not be written to min or max statistics fields except
1119+
* in the column index, where min_values and max_values are not optional
1120+
* so a NaN value must be written if all non-null values in a page
1121+
* are NaN.
10961122
* - If the computed max value is zero (whether negative or positive),
10971123
* `+0.0` should be written into the max statistics field.
10981124
* - If the computed min value is zero (whether negative or positive),
10991125
* `-0.0` should be written into the min statistics field.
11001126
*/
11011127
1: TypeDefinedOrder TYPE_ORDER;
1128+
1129+
/*
1130+
* The floating point type is ordered according to the totalOrder predicate,
1131+
* as defined in section 5.10 of IEEE-754 (2008 revision). Only columns of
1132+
* physical type FLOAT or DOUBLE, or logical type FLOAT16 may use this ordering.
1133+
*
1134+
* Intuitively, this orders floats mathematically, but defines -0 to be less
1135+
* than +0, -NaN to be less than anything else, and +NaN to be greater than
1136+
* anything else. It also defines an order between different bit representations
1137+
* of the same value.
1138+
*
1139+
* The formal definition is as follows:
1140+
* a) If x<y, totalOrder(x, y) is true.
1141+
* b) If x>y, totalOrder(x, y) is false.
1142+
* c) If x=y:
1143+
* 1) totalOrder(−0, +0) is true.
1144+
* 2) totalOrder(+0, −0) is false.
1145+
* 3) If x and y represent the same floating-point datum:
1146+
* i) If x and y have negative sign, totalOrder(x, y) is true if and
1147+
* only if the exponent of x ≥ the exponent of y
1148+
* ii) otherwise totalOrder(x, y) is true if and only if the exponent
1149+
* of x ≤ the exponent of y.
1150+
* d) If x and y are unordered numerically because x or y is NaN:
1151+
* 1) totalOrder(−NaN, y) is true where −NaN represents a NaN with
1152+
* negative sign bit and y is a non-NaN floating-point number.
1153+
* 2) totalOrder(x, +NaN) is true where +NaN represents a NaN with
1154+
* positive sign bit and x is a non-NaN floating-point number.
1155+
* 3) If x and y are both NaNs, then totalOrder reflects a total ordering
1156+
* based on:
1157+
* i) negative sign orders below positive sign
1158+
* ii) signaling orders below quiet for +NaN, reverse for −NaN
1159+
* iii) lesser payload, when regarded as an integer, orders below
1160+
* greater payload for +NaN, reverse for −NaN.
1161+
*
1162+
* Note that this ordering can be implemented efficiently in software by bit-wise
1163+
* operations on the integer representation of the floating point values.
1164+
* E.g., this is a possible implementation for DOUBLE in Rust:
1165+
*
1166+
* pub fn totalOrder(x: f64, y: f64) -> bool {
1167+
* let mut x_int = x.to_bits() as i64;
1168+
* let mut y_int = y.to_bits() as i64;
1169+
* x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
1170+
* y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
1171+
* return x_int <= y_int;
1172+
* }
1173+
*
1174+
* When writing statistics for columns with this order, the following rules
1175+
* must be followed:
1176+
* - Writing the nan_count field is mandatory when using this ordering,
1177+
* especialy also if it is zero.
1178+
* - NaNs should not be written to min or max statistics fields except
1179+
* in the column index, where min_values and max_values are not optional
1180+
* so a NaN value must be written if all non-null values in a page
1181+
* are NaN. In this case, the min_values[i] and max_values[i] fields
1182+
* should be set to the smallest and largest NaN values contained
1183+
* in the page, as defined by the IEEE 754 total order.
1184+
*
1185+
* When reading statistics for columns with this order, the following rules
1186+
* should be followed:
1187+
* - Readers should consult the nan_count field to determine whether NaNs
1188+
* are present.
1189+
* - A reader can compute nan_count + null_count == num_values to deduce
1190+
* whether all non-null values are NaN. In the page index, which does not
1191+
* have a num_values field, the presence of a NaN value in min_values
1192+
* or max_values indicates that all non-null values are NaN.
1193+
*/
1194+
2: IEEE754TotalOrder IEEE_754_TOTAL_ORDER;
11021195
}
11031196

11041197
struct PageLocation {
@@ -1170,6 +1263,19 @@ struct ColumnIndex {
11701263
* Such more compact values must still be valid values within the column's
11711264
* logical type. Readers must make sure that list entries are populated before
11721265
* using them by inspecting null_pages.
1266+
* For columns of physical type FLOAT or DOUBLE, or logical type FLOAT16,
1267+
* NaN values are not to be included in these bounds. If all non-null values
1268+
* of a page are NaN, then a writer must do the following:
1269+
* - If the order of this column is TypeDefinedOrder, then no column index
1270+
* must be written for this column chunk. While this is unfortunate for
1271+
* performance, it is necessary to avoid conflict with legacy files that
1272+
* still included NaN in min_values and max_values even if the page had
1273+
* non-NaN values. To mitigate this, IEEE754_TOTAL_ORDER is recommended.
1274+
* - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]
1275+
* * If IEEE754_TOTAL_ORDER is used for the column and all non-null values
1276+
* of a page are NaN, then min_values[i] and max_values[i] must be set to
1277+
* the smallest and largest NaN value contained in the page, as defined
1278+
* by the IEEE 754 total order.
11731279
*/
11741280
2: required list<binary> min_values
11751281
3: required list<binary> max_values
@@ -1193,7 +1299,6 @@ struct ColumnIndex {
11931299
* null counts are 0.
11941300
*/
11951301
5: optional list<i64> null_counts
1196-
11971302
/**
11981303
* Contains repetition level histograms for each page
11991304
* concatenated together. The repetition_level_histogram field on
@@ -1211,6 +1316,16 @@ struct ColumnIndex {
12111316
* Same as repetition_level_histograms except for definitions levels.
12121317
**/
12131318
7: optional list<i64> definition_level_histograms;
1319+
1320+
/**
1321+
* A list containing the number of NaN values for each page. Only present
1322+
* for columns of physical type FLOAT or DOUBLE, or logical type FLOAT16.
1323+
* If this field is not present, readers MUST assume that there might or
1324+
* might not be NaN values in any page, as NaNs should not be included
1325+
* in min_values or max_values.
1326+
*/
1327+
8: optional list<i64> nan_counts
1328+
12141329
}
12151330

12161331
struct AesGcmV1 {

0 commit comments

Comments
 (0)