Skip to content

[BUG] Parquet float/double statistic is wrong when float/double column contains NaN #13948

@res-life

Description

@res-life

Describe the bug
Parquet float/double statistic is wrong when float/double column contains NaN.
For example, a double column contains 2 values [NaN, 1.0d].

The CPU org.apache.parquet.format.Statistics: min_value 1.0, max_value: NaN.

max_value = b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'

The GPU org.apache.parquet.format.Statistics: min_value 1.0, max_value: 1.0.

max_value = b'\x00\x00\x00\x00\x00\x00\xf0?'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'

Steps/Code to reproduce bug
CPU:

import org.apache.hadoop.fs.Path;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.example.data.simple.SimpleGroupFactory;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.example.ExampleParquetWriter;
import org.apache.parquet.schema.MessageType;
import org.apache.parquet.schema.MessageTypeParser;
import org.junit.Test;

import java.io.File;
import java.io.IOException;

public class TestDoubleStatistic {

  @Test
  public void testDouble() throws IOException {
    MessageType schema = MessageTypeParser.parseMessageType(
      "message schema {\n" +
        "    repeated double d;\n" +
        "}");

    File file = new File("/tmp/test-001.parquet");
    if (file.exists()) {
      file.delete();
    }
    Path fsPath = new Path(file.getAbsolutePath());
    SimpleGroupFactory factory = new SimpleGroupFactory(schema);

    ParquetWriter writer = ExampleParquetWriter.builder(fsPath)
      .withType(schema)
      .build();

    Group group = factory.newGroup()
      .append("d", Double.NaN)
      .append("d", 1.0d);
    writer.write(group);
    writer.close();

    System.out.println("ok");
  }
}

parquet-tools inspect --detail /tmp/test-001.parquet

max_value = b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'

GPU:

TEST_F(MyDebugTests, TestDoubleStatistic)
{
  auto nan = 0.0 / 0.0;
  auto d = std::vector<double> {nan, 1.0};
  float64_col dc(d.begin(), d.end());
  cudf::table_view expected({dc});

  cudf::io::table_input_metadata expected_metadata(expected);
  expected_metadata.column_metadata[0].set_name("c1");

  auto filepath = "/tmp/TestDoubleStatistic.parquet";
  cudf::io::parquet_writer_options out_opts =
    cudf::io::parquet_writer_options::builder(cudf::io::sink_info{filepath}, expected)
      .metadata(expected_metadata);
  cudf::io::write_parquet(out_opts);
}

parquet-tools inspect --detail TestDoubleStatistic.parquet

max_value = b'\x00\x00\x00\x00\x00\x00\xf0?'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'

Expected behavior
Make float/double stat consistent with CPU.

Environment overview (please complete the following information)

Environment details
cuDF: branch-23.10
parquet: apache-parquet-1.12.2

Additional context
parquet-tools link

Parquet convert from org.apache.parquet.format.Statistics to org.apache.parquet.column.statistics:

https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.2/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java#L123-L126

        if (min.isNaN() || max.isNaN()) {
          stats.setMinMax(0.0, 0.0);
          ((Statistics<?>) stats).hasNonNullValue = false;
        }

If there are NaN, then org.apache.parquet.column.statistics min/max are converted to 0.0/0.0 and make they are invalid:

hasNonNullValue = false

So GPU should make sure min/max contains a NaN.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcuIOcuIO issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions