Skip to content

Commit aedf129

Browse files
JFinisjfinis-salesforce
authored andcommitted
PARQUET-2249: Introduce IEEE 754 total order for floats
This commit adds a new column order `IEEE754TotalOrder`, which can be used for floating point types (FLOAT, DOUBLE, FLOAT16). The advantage of the new order is a well-defined ordering between -0,+0 and the various possible bit patterns of NaNs. Thus, every single possible bit pattern of a floating point value has a well-defined order now, so there are no possibilities where two implementations might apply different orders when the new column order is used. With the default column order, there were many problems w.r.t. NaN values which lead to reading engines not being able to use statistics of floating point columns for scan pruning even in the case where no NaNs were in the data set. The problems are discussed in detail in the next section. This solution to the problem is the result of the extended discussion in apache#196, which ended with the consensus that IEEE 754 total ordering is the best approach to solve the problem in a simple manner without introducing special fields for floating point columns (such as `nan_counts`, which was proposed in that PR). Please refer to the discussion in that PR for all the details why this solution was chosen over various design alternatives. Note that this solution is fully backward compatible and should not break neither old readers nor writers, as a new column order is added. Legacy writers can continue not writing this new order and instead writing the default type defined order. Legacy readers should avoid using any statistics on columns that have a column order they do not understand and therefore should just not use the statistics for columns ordered using the new order. The remainder of this message explains in detail what the problems are and how the proposed solution fixes them. Problem Description =================== Currently, the way NaN values are to be handled in statistics inhibits most scan pruning once NaN values are present in DOUBLE or FLOAT columns. Concretely the following problems exist: Statistics don't tell whether NaNs are present ---------------------------------------------- As NaN values are not to be incorporated in min/max bounds, a reader cannot know whether NaN values are present. This might seem to be not too problematic, as most queries will not filter for NaNs. However, NaN is ordered in most database systems. For example, Postgres, DB2, and Oracle treat NaN as greater than any other value, while MSSQL and MySQL treat it as less than any other value. An overview over what different systems are doing can be found here. The gist of it is that different systems with different semantics exist w.r.t. NaNs and most of the systems do order NaNs; either less than or greater than all other values. For example, if the semantics of the reading query engine mandate that NaN is to be treated greater than all other values, the predicate x > 1.0 should include NaN values. If a page has max = 0.0 now, the engine would not be able to skip the page, as the page might contain NaNs which would need to be included in the query result. Likewise, the predicate x < 1.0 should include NaN if NaN is treated to be less than all other values by the reading engine. Again, a page with min = 2.0 couldn't be skipped in this case by the reader. Thus, even if a user doesn't query for NaN explicitly, they might use other predictes that need to filter or retain NaNs in the semantics of the reading engine, so the fact that we currently can't know whether a page or row group contains NaN is a bigger problem than it might seem on first sight. Currently, any predicate that needs to retain NaNs cannot use min and max bounds in Parquet and therefore cannot be used for scan pruning at all. And as state, that can be many seemingly innocuous greater than or less than predicates in most databases systems. Conversely, it would be nice if Parquet would enable scan pruning in these cases, regardless of whether the reader and writer agree upon whether NaN is smaller, greater, or incomparable to all other values. Note that the problem exists especially if the Parquet file doesn't include any NaNs, so this is not only a problem in the edge case where NaNs are present; it is a problem in the way more common case of NaNs not being present. Handling NaNs in a ColumnIndex ------------------------------ There is currently no well-defined way to write a spec-conforming ColumnIndex once a page has only NaN (and possibly null) values. NaN values should not be included in min/max bounds, but if a page contains only NaN values, then there is no other value to put into the min/max bounds. However, bounds in a ColumnIndex are non-optional, so we have to put something in here. The spec does not describe what engines should do in this case. Parquet-mr takes the safe route and does not write a column index once NaNs are present. But this is a huge pessimization, as a single page containing NaNs will prevent writing a column index for the column chunk containing that page, so even pages in that chunk that don't contain NaNs will not be indexed. It would be nice if there was a defined way of writing the ColumnIndex when NaNs (and especially only-NaN pages) are present. Handling only-NaN pages & column chunks --------------------------------------- Note: Hereinafter, whenever the term only-NaN is used, it refers to a page or column chunk, whose only non-null values are NaNs. E.g., an only-NaN page is allowed to have a mixture of null values and NaNs or only NaNs, but no non-NaN non-null values. The Statistics objects stored in page headers and in the file footer have a similar, albeit smaller problem: min_value and max_value are optional here, so it is easier to not include NaNs in the min/max in case of an only-NaN page or column chunk: Simply omit these optional fields. However, this brings a semantic ambiguity with it, as it is now unclear whether the min/max value wasn't written because there were only NaNs, or simply because the writing engine did decide to omit them for whatever other reason, which is allowed by the spec as the field is optional. Consequently, a reader cannot know whether missing min_value and max_value means "only NaNs, you can skip this page if you are looking for only non-NaN values" or "no stats written, you have to read this page as it is undefined what values it contains". It would be nice if we could handle NaNs in a way that would allow scan pruning for these only-NaN pages. Solution ======== IEEE 754 total order solves all the mentioned problems. As NaNs now have a defined place in the ordering, they can be incorporated into min and max bounds. In fact, in contrast to the default ordering, they do not need any special casing anymore, so all the remarks how readers and writers should special-handle NaNs and -0/+0 no longer apply to the new ordering. As NaNs are incorporated into min and max, a reader can now see whether NaNs are contained through the statistics. Thus, a reading engine just has to map its NaN semantics to the NaN semantics of total ordering. For example, if the semantics of the reading engine treat all NaNs (also -NaNs) as greater than all other values, a reading engine having a predicate `x > 5.0` (which should include NaNs) may not filter any pages / row groups if either min or max are (+/-)NaN. Only-NaN pages can now also be included in the column index, as they are no longer a special case. In conclusion, all mentioned problems are solved by using IEEE 754 total ordering.
1 parent 066f981 commit aedf129

File tree

3 files changed

+83
-49
lines changed

3 files changed

+83
-49
lines changed

LogicalTypes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -253,7 +253,7 @@ Used in contexts where precision is traded off for smaller footprint and potenti
253253

254254
The primitive type is a 2-byte fixed length binary.
255255

256-
The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`.
256+
The type-defined sort order for `FLOAT16` is signed (with special handling of NaNs and signed zeros), as for `FLOAT` and `DOUBLE`. It is recommended that writers use IEEE754TotalOrder when writing columns of this type for a well-defined handling of NaNs and signed zeros. See the `ColumnOrder` union in the [Thrift definition](src/main/thrift/parquet.thrift) for details.
257257

258258
## Temporal Types
259259

README.md

Lines changed: 6 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -146,40 +146,13 @@ documented in [LogicalTypes.md][logical-types].
146146
[logical-types]: LogicalTypes.md
147147

148148
### Sort Order
149-
150149
Parquet stores min/max statistics at several levels (such as Column Chunk,
151-
Column Index and Data Page). Comparison for values of a type obey the
152-
following rules:
153-
154-
1. Each logical type has a specified comparison order. If a column is
155-
annotated with an unknown logical type, statistics may not be used
156-
for pruning data. The sort order for logical types is documented in
157-
the [LogicalTypes.md][logical-types] page.
158-
2. For primitive types, the following rules apply:
159-
160-
* BOOLEAN - false, true
161-
* INT32, INT64 - Signed comparison.
162-
* FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
163-
signed zeros. The details are documented in the
164-
[Thrift definition](src/main/thrift/parquet.thrift) in the
165-
`ColumnOrder` union. They are summarized here but the Thrift definition
166-
is considered authoritative:
167-
* NaNs should not be written to min or max statistics fields.
168-
* If the computed max value is zero (whether negative or positive),
169-
`+0.0` should be written into the max statistics field.
170-
* If the computed min value is zero (whether negative or positive),
171-
`-0.0` should be written into the min statistics field.
172-
173-
For backwards compatibility when reading files:
174-
* If the min is a NaN, it should be ignored.
175-
* If the max is a NaN, it should be ignored.
176-
* If the min is +0, the row group may contain -0 values as well.
177-
* If the max is -0, the row group may contain +0 values as well.
178-
* When looking for NaN values, min and max should be ignored.
179-
180-
* BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise
181-
comparison.
182-
150+
Column Index, and Data Page). These statistics are according to a sort order,
151+
which is defined for each column in the file footer. Parquet supports common
152+
sort orders for logical and primitve types and also special orders for types
153+
where the common sort order is not unambiguously defined (e.g., NaN ordering
154+
for floating point types). The details are documented in the
155+
[Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union.
183156

184157
## Nested Encoding
185158
To encode nested columns, Parquet uses the Dremel encoding with definition and

src/main/thrift/parquet.thrift

Lines changed: 76 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -288,7 +288,7 @@ struct MapType {} // see LogicalTypes.md
288288
struct ListType {} // see LogicalTypes.md
289289
struct EnumType {} // allowed for BINARY, must be encoded with UTF-8
290290
struct DateType {} // allowed for INT32
291-
struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
291+
struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes (see LogicalTypes.md)
292292

293293
/**
294294
* Logical type to annotate a column that is always null.
@@ -788,7 +788,7 @@ struct ColumnMetaData {
788788
/** total byte size of all uncompressed pages in this column chunk (including the headers) **/
789789
6: required i64 total_uncompressed_size
790790

791-
/** total byte size of all compressed, and potentially encrypted, pages
791+
/** total byte size of all compressed, and potentially encrypted, pages
792792
* in this column chunk (including the headers) **/
793793
7: required i64 total_compressed_size
794794

@@ -903,17 +903,20 @@ struct RowGroup {
903903
* in this row group **/
904904
5: optional i64 file_offset
905905

906-
/** Total byte size of all compressed (and potentially encrypted) column data
906+
/** Total byte size of all compressed (and potentially encrypted) column data
907907
* in this row group **/
908908
6: optional i64 total_compressed_size
909-
909+
910910
/** Row group ordinal in the file **/
911911
7: optional i16 ordinal
912912
}
913913

914914
/** Empty struct to signal the order defined by the physical or logical type */
915915
struct TypeDefinedOrder {}
916916

917+
/** Empty struct to signal IEEE 754 total order for floating point types */
918+
struct IEEE754TotalOrder {}
919+
917920
/**
918921
* Union to specify the order used for the min_value and max_value fields for a
919922
* column. This union takes the role of an enhanced enum that allows rich
@@ -922,6 +925,7 @@ struct TypeDefinedOrder {}
922925
* Possible values are:
923926
* * TypeDefinedOrder - the column uses the order defined by its logical or
924927
* physical type (if there is no logical type).
928+
* * IEEE754TotalOrder - the floating point column uses IEEE 754 total order.
925929
*
926930
* If the reader does not support the value of this union, min and max stats
927931
* for this column should be ignored.
@@ -941,6 +945,7 @@ union ColumnOrder {
941945
* UINT64 - unsigned comparison
942946
* DECIMAL - signed comparison of the represented value
943947
* DATE - signed comparison
948+
* FLOAT16 - signed comparison of the represented value (*)
944949
* TIME_MILLIS - signed comparison
945950
* TIME_MICROS - signed comparison
946951
* TIMESTAMP_MILLIS - signed comparison
@@ -962,15 +967,19 @@ union ColumnOrder {
962967
* BYTE_ARRAY - unsigned byte-wise comparison
963968
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
964969
*
965-
* (*) Because the sorting order is not specified properly for floating
966-
* point values (relations vs. total ordering) the following
970+
* (*) Because the precise sorting order is ambiguous for floating
971+
* point types due to underspecified handling of NaN and -0/+0,
972+
* it is recommended that writers use IEEE_754_TOTAL_ORDER
973+
* for these types.
974+
*
975+
* If TYPE_ORDER is used for floating point types, then the following
967976
* compatibility rules should be applied when reading statistics:
968977
* - If the min is a NaN, it should be ignored.
969978
* - If the max is a NaN, it should be ignored.
970979
* - If the min is +0, the row group may contain -0 values as well.
971980
* - If the max is -0, the row group may contain +0 values as well.
972981
* - When looking for NaN values, min and max should be ignored.
973-
*
982+
*
974983
* When writing statistics the following rules should be followed:
975984
* - NaNs should not be written to min or max statistics fields.
976985
* - If the computed max value is zero (whether negative or positive),
@@ -979,6 +988,58 @@ union ColumnOrder {
979988
* `-0.0` should be written into the min statistics field.
980989
*/
981990
1: TypeDefinedOrder TYPE_ORDER;
991+
992+
/*
993+
* The floating point type is ordered according to the totalOrder predicate,
994+
* as defined in section 5.10 of IEEE-754 (2008 revision). Only columns of
995+
* physical type FLOAT or DOUBLE, or logical type FLOAT16 may use this ordering.
996+
997+
* Intuitively, this orders floats mathematically, but defines -0 to be less
998+
* than +0, -NaN to be less than anything else, and +NaN to be greater than
999+
* anything else. It also defines an order between different bit representations
1000+
* of the same value.
1001+
*
1002+
* The formal definition is as follows:
1003+
* a) If x<y, totalOrder(x, y) is true.
1004+
* b) If x>y, totalOrder(x, y) is false.
1005+
* c) If x=y:
1006+
* 1) totalOrder(−0, +0) is true.
1007+
* 2) totalOrder(+0, −0) is false.
1008+
* 3) If x and y represent the same floating-point datum:
1009+
* i) If x and y have negative sign, totalOrder(x, y) is true if and
1010+
* only if the exponent of x ≥ the exponent of y
1011+
* ii) otherwise totalOrder(x, y) is true if and only if the exponent
1012+
* of x ≤ the exponent of y.
1013+
* d) If x and y are unordered numerically because x or y is NaN:
1014+
* 1) totalOrder(−NaN, y) is true where −NaN represents a NaN with
1015+
* negative sign bit and y is a floating-point number.
1016+
* 2) totalOrder(x, +NaN) is true where +NaN represents a NaN with
1017+
* positive sign bit and x is a floating-point number.
1018+
* 3) If x and y are both NaNs, then totalOrder reflects a total ordering
1019+
* based on:
1020+
* i) negative sign orders below positive sign
1021+
* ii) signaling orders below quiet for +NaN, reverse for −NaN
1022+
* iii) lesser payload, when regarded as an integer, orders below
1023+
* greater payload for +NaN, reverse for −NaN.
1024+
*
1025+
* Note that this ordering can be implemented efficiently in software
1026+
* by flipping all non-sign bits in case of a set sign bit to achieve a
1027+
* two's-complement-like representation and then performing a signed
1028+
* integer comparison on the resulting bits.
1029+
* E.g., this is a possible implementation for DOUBLE in Rust:
1030+
*
1031+
* pub fn totalOrder(x: f64, y: f64) -> bool {
1032+
* // view bits as signed integers
1033+
* let mut x_int = x.to_bits() as i64;
1034+
* let mut y_int = y.to_bits() as i64;
1035+
* // flip all non-sign bits if sign bit is set
1036+
* x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
1037+
* y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
1038+
* // perform signed integer comparison
1039+
* return x_int <= y_int;
1040+
* }
1041+
*/
1042+
2: IEEE754TotalOrder IEEE_754_TOTAL_ORDER;
9821043
}
9831044

9841045
struct PageLocation {
@@ -1148,30 +1209,30 @@ struct FileMetaData {
11481209
*/
11491210
7: optional list<ColumnOrder> column_orders;
11501211

1151-
/**
1212+
/**
11521213
* Encryption algorithm. This field is set only in encrypted files
11531214
* with plaintext footer. Files with encrypted footer store algorithm id
11541215
* in FileCryptoMetaData structure.
11551216
*/
11561217
8: optional EncryptionAlgorithm encryption_algorithm
11571218

1158-
/**
1159-
* Retrieval metadata of key used for signing the footer.
1160-
* Used only in encrypted files with plaintext footer.
1161-
*/
1219+
/**
1220+
* Retrieval metadata of key used for signing the footer.
1221+
* Used only in encrypted files with plaintext footer.
1222+
*/
11621223
9: optional binary footer_signing_key_metadata
11631224
}
11641225

11651226
/** Crypto metadata for files with encrypted footer **/
11661227
struct FileCryptoMetaData {
1167-
/**
1228+
/**
11681229
* Encryption algorithm. This field is only used for files
11691230
* with encrypted footer. Files with plaintext footer store algorithm id
11701231
* inside footer (FileMetaData structure).
11711232
*/
11721233
1: required EncryptionAlgorithm encryption_algorithm
1173-
1174-
/** Retrieval metadata of key used for encryption of footer,
1234+
1235+
/** Retrieval metadata of key used for encryption of footer,
11751236
* and (possibly) columns **/
11761237
2: optional binary key_metadata
11771238
}

0 commit comments

Comments
 (0)