You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PARQUET-2249: Introduce IEEE 754 total order & NaN-counts
This commit is a combination of the following PRs:
* Introduce IEEE 754 total order
apache#221
* Add nan_count to handle NaNs in statistics
apache#196
Both these PRs try to solve the same problems; read
the description of the respective PRs for explanation.
This PR is the result of an extended discussion in
which it was repeatedly brought up that another possible
solution to the problem could be the combination of
the two approaches. Please refer to this discussion
on the mailing list and in the two PRs for details.
the mailing list discussion can be found here:
https://lists.apache.org/thread/lzh0dvrvnsy8kvflvl61nfbn6f9js81s
The contents of this PR are basically a straightforward
combination of the two approaches:
* IEEE total order is introduced as a new order for floating
point types
* nan_count and nan_counts fields are added
Legacy writers may not write nan_count(s) fields,
so readers have to handle them being absent. Also, legacy
writers may have included NaNs into min/max bounds, so readers
also have to handle that.
As there are no legacy writers writing IEEE total order,
nan_count(s) are defined to be mandatory if this order is used,
so readers can assume their presense when this order is used.
This commit removes `nan_pages` from the ColumnIndex,
which the nan_counts PR mandated. We don't need them anymore,
as IEEE total order solves this: As this is a new order and
there are thus no legacy writers and readers, we have the
freedom to define that for only-NaN pages using this order,
we can actually write NaNs into min & max bounds in this case,
and readers can assume that isNaN(min) signals an only-NaN
page.
Copy file name to clipboardExpand all lines: LogicalTypes.md
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -254,7 +254,10 @@ Used in contexts where precision is traded off for smaller footprint and potenti
254
254
255
255
The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`.
256
256
257
-
The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`.
257
+
The type-defined sort order for `FLOAT16` is signed (with special handling of NaNs and signed zeros),
258
+
as for `FLOAT` and `DOUBLE`. It is recommended that writers use IEEE754TotalOrder when writing columns
259
+
of this type for a well-defined handling of NaNs and signed zeros. See the `ColumnOrder` union in the
260
+
[Thrift definition](src/main/thrift/parquet.thrift) for details.
0 commit comments