-
Notifications
You must be signed in to change notification settings - Fork 981
Description
Describe the bug
Parquet float/double statistic is wrong when float/double column contains NaN.
For example, a double column contains 2 values [NaN, 1.0d].
The CPU org.apache.parquet.format.Statistics: min_value 1.0, max_value: NaN.
max_value = b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'
The GPU org.apache.parquet.format.Statistics: min_value 1.0, max_value: 1.0.
max_value = b'\x00\x00\x00\x00\x00\x00\xf0?'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'
Steps/Code to reproduce bug
CPU:
import org.apache.hadoop.fs.Path;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.example.data.simple.SimpleGroupFactory;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.example.ExampleParquetWriter;
import org.apache.parquet.schema.MessageType;
import org.apache.parquet.schema.MessageTypeParser;
import org.junit.Test;
import java.io.File;
import java.io.IOException;
public class TestDoubleStatistic {
@Test
public void testDouble() throws IOException {
MessageType schema = MessageTypeParser.parseMessageType(
"message schema {\n" +
" repeated double d;\n" +
"}");
File file = new File("/tmp/test-001.parquet");
if (file.exists()) {
file.delete();
}
Path fsPath = new Path(file.getAbsolutePath());
SimpleGroupFactory factory = new SimpleGroupFactory(schema);
ParquetWriter writer = ExampleParquetWriter.builder(fsPath)
.withType(schema)
.build();
Group group = factory.newGroup()
.append("d", Double.NaN)
.append("d", 1.0d);
writer.write(group);
writer.close();
System.out.println("ok");
}
}parquet-tools inspect --detail /tmp/test-001.parquet
max_value = b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'
GPU:
TEST_F(MyDebugTests, TestDoubleStatistic)
{
auto nan = 0.0 / 0.0;
auto d = std::vector<double> {nan, 1.0};
float64_col dc(d.begin(), d.end());
cudf::table_view expected({dc});
cudf::io::table_input_metadata expected_metadata(expected);
expected_metadata.column_metadata[0].set_name("c1");
auto filepath = "/tmp/TestDoubleStatistic.parquet";
cudf::io::parquet_writer_options out_opts =
cudf::io::parquet_writer_options::builder(cudf::io::sink_info{filepath}, expected)
.metadata(expected_metadata);
cudf::io::write_parquet(out_opts);
}parquet-tools inspect --detail TestDoubleStatistic.parquet
max_value = b'\x00\x00\x00\x00\x00\x00\xf0?'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'
Expected behavior
Make float/double stat consistent with CPU.
Environment overview (please complete the following information)
Environment details
cuDF: branch-23.10
parquet: apache-parquet-1.12.2
Additional context
parquet-tools link
Parquet convert from org.apache.parquet.format.Statistics to org.apache.parquet.column.statistics:
if (min.isNaN() || max.isNaN()) {
stats.setMinMax(0.0, 0.0);
((Statistics<?>) stats).hasNonNullValue = false;
}
If there are NaN, then org.apache.parquet.column.statistics min/max are converted to 0.0/0.0 and make they are invalid:
hasNonNullValue = false
So GPU should make sure min/max contains a NaN.