PARQUET-2249: Update with suggestions

JFinis · JFinis · commit aecffd7c3c84 · 2023-06-12T12:55:31.000+02:00
This commit updates the proposal based on the suggestions in the PR.

The biggest change is that readers are no longer expected to check
min == max == NaN to find only-NaN pages. Instead, they should check
nan_count + null_count == num_values in pages and nan_pages[x] == true
in the column index. This way, we no longer rely on NaN comparison
rules and readers can and should continue to ignore NaN values they
find in bounds.

However, as was pointed out, we cannot write "no value" into min and max
bounds in the column index, as this would not be compatible with legacy
readers. Instead, writers must write something here. Therefore, they
are now suggested to write NaN here, but readers will still ignore this
and instead must rely on the new nan_pages field.

In addition, two further suggestions were implemented:

* Removed the duplicate explanation from README.md. It now only points
  to parquet.thrift as the source of truth.
* Softened the wording from "nan_count fields should always be set" to
  "it is suggested to always set the nan_coutn fields".
diff --git a/README.md b/README.md
@@ -161,28 +161,7 @@ following rules:
     * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
       signed zeros.   The details are documented in the
       [Thrift definition](src/main/thrift/parquet.thrift) in the
-      `ColumnOrder` union. They are summarized here but the Thrift definition
-      is considered authoritative:
-      * The following compatibility rules should be applied when reading statistics:
-        * If the nan_count field is set to > 0 and both min and max are
-          NaN, a reader can rely on that all non-NULL values are NaN
-        * Otherwise, if the min or the max is a NaN, it should be ignored.
-        * When looking for NaN values, min and max should be ignored;
-          if the nan_count field is set, it should be used to check whether
-          NaNs are present.
-        * If the min is +0, the row group may contain -0 values as well.
-        * If the max is -0, the row group may contain +0 values as well.
-      * When writing statistics the following rules should be followed:
-        * The nan_count fields should always be set for FLOAT and DOUBLE columns.
-        * NaNs should not be written to min or max statistics fields except
-          when all non-NULL values are NaN, in which case min and max should
-          both be written as NaN. If the nan_count field is set, this semantics
-          is mandated and readers may rely on it.
-        * If the computed max value is zero (whether negative or positive),
-          `+0.0` should be written into the max statistics field.
-        * If the computed min value is zero (whether negative or positive),
-          `-0.0` should be written into the min statistics field.
-      
+      `ColumnOrder` union.
     * BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise
       comparison.
 
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
@@ -223,7 +223,10 @@ struct Statistics {
     */
    5: optional binary max_value;
    6: optional binary min_value;
-   /** count of NaN values in the column; only present if type is FLOAT or DOUBLE */
+   /** 
+    * count of NaN values in the column; only present if physical type is FLOAT
+    * or DOUBLE
+    */
    7: optional i64 nan_count;
 }
 
@@ -890,21 +893,23 @@ union ColumnOrder {
    * (*) Because the sorting order is not specified properly for floating
    *     point values (relations vs. total ordering), the following compatibility
    *     rules should be applied when reading statistics:
-   *     - If the nan_count field is set to > 0 and both min and max are
-   *       NaN, a reader can rely on that all non-NULL values are NaN
-   *     - Otherwise, if the min or the max is a NaN, it should be ignored.
-   *     - When looking for NaN values, min and max should be ignored;
-   *       if the nan_count field is set, it can be used to check whether
+   *     - If the min is a NaN, it should be ignored.
+   *     - If the max is a NaN, it should be ignored.
+   *     - If the nan_count field is set, a reader can compute
+   *       nan_count + null_count == num_values to deduce whether all non-NULL
+   *       values are NaN.
+   *     - When looking for NaN values, min and max should be ignored.
+   *       If the nan_count field is set, it can be used to check whether
    *       NaNs are present.
    *     - If the min is +0, the row group may contain -0 values as well.
    *     - If the max is -0, the row group may contain +0 values as well.
    * 
    *     When writing statistics the following rules should be followed:
-   *     - The nan_count fields should always be set for FLOAT and DOUBLE columns.
+   *     - It is suggested to always set the nan_count fields for FLOAT and
+           DOUBLE columns.
    *     - NaNs should not be written to min or max statistics fields except
-   *       when all non-NULL values are NaN, in which case min and max should
-   *       both be written as NaN. If the nan_count field is set, this semantics
-   *       is mandated and readers may rely on it.
+   *       in the column index, where a value has to be written incase of
+   *       only-NaN pages.
    *     - If the computed max value is zero (whether negative or positive),
    *       `+0.0` should be written into the max statistics field.
    *     - If the computed min value is zero (whether negative or positive),
@@ -963,7 +968,9 @@ struct ColumnIndex {
    * using them by inspecting null_pages.
    * For columns of type FLOAT and DOUBLE, NaN values are not to be included
    * in these bounds unless all non-null values in a page are NaN, in which
-   * case min and max are to be set to NaN.
+   * case min and max should be set to NaN. Readers should always ignore NaN
+   * values in the bounds; they should check nan_pages to detect the "all
+   * non-null values are NaN" case.
    */
   2: required list<binary> min_values
   3: required list<binary> max_values
@@ -979,9 +986,22 @@ struct ColumnIndex {
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
 
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list<bool> nan_pages
+
   /** A list containing the number of NaN values for each page. Only present
    *  for columns of type FLOAT and DOUBLE. **/
-  6: optional list<i64> nan_counts
+  7: optional list<i64> nan_counts
 }
 
 struct AesGcmV1 {