Add ParquetFileMerger for efficient row-group level file merging #14324

shangxinli · 2025-10-14T03:41:03Z

Why this change?

This implementation provides significant performance improvements for Parquet
file merging operations by eliminating serialization/deserialization overhead.
Benchmark results show 13x faster file merging compared to traditional
read-rewrite approaches.

The change leverages existing Parquet library capabilities (ParquetFileWriter
appendFile API) to perform zero-copy row-group merging, making it ideal for
compaction and maintenance operations on large Iceberg tables.

Encrypted tables are not supported yet.

What changed?

Added ParquetFileMerger class for row-group level file merging
- Performs zero-copy merging using ParquetFileWriter.appendFile()
- Validates schema compatibility across all input files
- Supports merging multiple Parquet files into a single output file
Reuses existing Apache Parquet library functionality instead of custom implementation
Strict schema validation ensures data integrity during merge operations
Added comprehensive error handling for schema mismatches

Testing

Validated in staging test environment
Verified schema compatibility checks work correctly
Confirmed 13x performance improvement over traditional approach
Tested with various file sizes and row group configurations

## Why this change? This implementation provides significant performance improvements for Parquet file merging operations by eliminating serialization/deserialization overhead. Benchmark results show **13x faster** file merging compared to traditional read-rewrite approaches. The change leverages existing Parquet library capabilities (ParquetFileWriter appendFile API) to perform zero-copy row-group merging, making it ideal for compaction and maintenance operations on large Iceberg tables. Encrypted tables are not supported yet. ## What changed? - Added ParquetFileMerger class for row-group level file merging - Performs zero-copy merging using ParquetFileWriter.appendFile() - Validates schema compatibility across all input files - Supports merging multiple Parquet files into a single output file - Reuses existing Apache Parquet library functionality instead of custom implementation - Strict schema validation ensures data integrity during merge operations - Added comprehensive error handling for schema mismatches ## Testing - Validated in staging test environment - Verified schema compatibility checks work correctly - Confirmed 13x performance improvement over traditional approach - Tested with various file sizes and row group configurations

github-actions bot added API spark parquet labels Oct 14, 2025

shangxinli force-pushed the rewrite_data_files branch from 29a6462 to 441642e Compare October 14, 2025 03:52

Fix the format issue

38d4ae8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ParquetFileMerger for efficient row-group level file merging #14324

Add ParquetFileMerger for efficient row-group level file merging #14324

Uh oh!

shangxinli commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add ParquetFileMerger for efficient row-group level file merging #14324

Are you sure you want to change the base?

Add ParquetFileMerger for efficient row-group level file merging #14324

Uh oh!

Conversation

shangxinli commented Oct 14, 2025

Why this change?

What changed?

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant