Skip to content

Conversation

@shangxinli
Copy link
Contributor

Why this change?

This implementation provides significant performance improvements for Parquet
file merging operations by eliminating serialization/deserialization overhead.
Benchmark results show 13x faster file merging compared to traditional
read-rewrite approaches.

The change leverages existing Parquet library capabilities (ParquetFileWriter
appendFile API) to perform zero-copy row-group merging, making it ideal for
compaction and maintenance operations on large Iceberg tables.

Encrypted tables are not supported yet.

What changed?

  • Added ParquetFileMerger class for row-group level file merging
    • Performs zero-copy merging using ParquetFileWriter.appendFile()
    • Validates schema compatibility across all input files
    • Supports merging multiple Parquet files into a single output file
  • Reuses existing Apache Parquet library functionality instead of custom implementation
  • Strict schema validation ensures data integrity during merge operations
  • Added comprehensive error handling for schema mismatches

Testing

  • Validated in staging test environment
  • Verified schema compatibility checks work correctly
  • Confirmed 13x performance improvement over traditional approach
  • Tested with various file sizes and row group configurations

  ## Why this change?

  This implementation provides significant performance improvements for Parquet
  file merging operations by eliminating serialization/deserialization overhead.
  Benchmark results show **13x faster** file merging compared to traditional
  read-rewrite approaches.

  The change leverages existing Parquet library capabilities (ParquetFileWriter
  appendFile API) to perform zero-copy row-group merging, making it ideal for
  compaction and maintenance operations on large Iceberg tables.

  Encrypted tables are not supported yet.

  ## What changed?

  - Added ParquetFileMerger class for row-group level file merging
    - Performs zero-copy merging using ParquetFileWriter.appendFile()
    - Validates schema compatibility across all input files
    - Supports merging multiple Parquet files into a single output file
  - Reuses existing Apache Parquet library functionality instead of custom implementation
  - Strict schema validation ensures data integrity during merge operations
  - Added comprehensive error handling for schema mismatches

  ## Testing

  - Validated in staging test environment
  - Verified schema compatibility checks work correctly
  - Confirmed 13x performance improvement over traditional approach
  - Tested with various file sizes and row group configurations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant