[PYTHON] Add `nan_count` to `RowGroupMetaData`

### Describe the enhancement requested

Iceberg relies on statistics (called Metrics in Iceberg) to speed up the queries. Most of the metrics are available and can be easily extracted using the MetadataCollector, except for the NaN counts. If someone does an `isNaN` expression on a FLOAT/DOUBLE field, Iceberg tries to skip Parquet files by looking at the metrics that it has stored in the manifest files. It would be awesome if next to `null_count` also `nan_count` can be added:

```python
➜  Desktop python3 
Python 3.11.3 (main, Apr  7 2023, 20:13:31) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> metadata_collector = []
>>> import pyarrow.parquet as pq
>>> pq.write_to_dataset(
...     table, '/tmp/table',
...      metadata_collector=metadata_collector)
>>> metadata_collector
[<pyarrow._parquet.FileMetaData object at 0x11f955850>
  created_by: parquet-cpp-arrow version 11.0.0
  num_columns: 2
  num_rows: 6
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 0]

>>> metadata_collector[0].row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x105837d80>
  num_columns: 2
  num_rows: 6
  total_byte_size: 256

>>> metadata_collector[0].row_group(0).to_dict()
{
	'num_columns': 2,
	'num_rows': 6,
	'total_byte_size': 256,
	'columns': [{
		'file_offset': 119,
		'file_path': 'c569c5eaf90c4395885f31e012068b69-0.parquet',
		'physical_type': 'INT64',
		'num_values': 6,
		'path_in_schema': 'n_legs',
		'is_stats_set': True,
		'statistics': {
			'has_min_max': True,
			'min': 2,
			'max': 100,
			'null_count': 0,
			'distinct_count': 0,
			'num_values': 6,
			'physical_type': 'INT64'
		},
		'compression': 'SNAPPY',
		'encodings': ('PLAIN_DICTIONARY', 'PLAIN', 'RLE'),
		'has_dictionary_page': True,
		'dictionary_page_offset': 4,
		'data_page_offset': 46,
		'total_compressed_size': 115,
		'total_uncompressed_size': 117
	}, {
		'file_offset': 359,
		'file_path': 'c569c5eaf90c4395885f31e012068b69-0.parquet',
		'physical_type': 'BYTE_ARRAY',
		'num_values': 6,
		'path_in_schema': 'animal',
		'is_stats_set': True,
		'statistics': {
			'has_min_max': True,
			'min': 'Brittle stars',
			'max': 'Parrot',
			'null_count': 0,
			'distinct_count': 0,
			'num_values': 6,
			'physical_type': 'BYTE_ARRAY'
		},
		'compression': 'SNAPPY',
		'encodings': ('PLAIN_DICTIONARY', 'PLAIN', 'RLE'),
		'has_dictionary_page': True,
		'dictionary_page_offset': 215,
		'data_page_offset': 302,
		'total_compressed_size': 144,
		'total_uncompressed_size': 139
	}]
}
```

In addition to this, Parquet itself is also looking into this: https://github.com/apache/parquet-format/pull/196


### Component(s)

Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PYTHON] Add `nan_count` to `RowGroupMetaData` #36068

Describe the enhancement requested

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[PYTHON] Add nan_count to RowGroupMetaData #36068

Description

Describe the enhancement requested

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[PYTHON] Add `nan_count` to `RowGroupMetaData` #36068