feat(io): UnpartitionedWriter + TaskWriter #1769

CTTY · 2025-10-20T23:40:27Z

Which issue does this PR close?

Closes Add TaskWriter #1770

What changes are included in this PR?

Are these changes tested?

CTTY · 2025-10-20T23:55:15Z

crates/iceberg/src/writer/task/mod.rs

+    /// - A partitioned table is provided without a partition splitter
+    pub fn new(
+        writer: W,
+        partition_splitter: Option<RecordBatchPartitionSplitter>,


It may make more sense if we just construct partition_splitter within new?

Looking at Splitter::new, it takes a input_schema: ArrowSchemaRef, but in reality we can get the input_schema directly from the RecordBatch. If we initialize the partition splitter lazily then we don't need to ask users to build a partition_splitter themselves

maybe it's a good idea to change RecordBatchPartitionSplitter to PartitionSplitter<I = RecordBatch>? This way the TaskWriter implementation can also be generic. Haven't explored too much on that front tho. What I have in mind rn:

trait PartitionSplitter<I = DefaultInput> { fn split(&self, input: I) -> Result<Vec<(PartitionKey, I)>> } impl<I: PositionalDeletes> PartitionSplitter<I> for PositionalDeletePartitionSplitter

cc @ZENOTME

crates/iceberg/src/writer/task/mod.rs

CTTY · 2025-10-21T20:20:35Z

crates/iceberg/src/arrow/record_batch_partition_splitter.rs

 /// 2. Split the input record batch into multiple record batches based on the partitioned record batch.
 // # TODO
 // Remove this after partition writer supported.
 #[allow(dead_code)]


Remove this

CTTY · 2025-10-21T20:20:41Z

crates/iceberg/src/arrow/record_batch_partition_splitter.rs


 // # TODO
 // Remove this after partition writer supported.
 #[allow(dead_code)]


Remove this

crates/iceberg/src/writer/partitioning/unpartitioned_writer.rs

crates/iceberg/src/writer/task/mod.rs

crates/iceberg/src/writer/partitioning/unpartitioned_writer.rs

liurenjie1024

I think we are on the right track. I left some comments, and we need to split them into smaller prs.

liurenjie1024 · 2025-10-23T09:52:19Z

crates/iceberg/src/arrow/record_batch_partition_splitter.rs

    schema: SchemaRef,
    partition_spec: PartitionSpecRef,
-    projector: RecordBatchProjector,
+    projector: Option<RecordBatchProjector>,


This change is somehow ugly. The splitter could be split into two parts:

Calculate partition value.

Split record batch according to partition value.

We could abstract out the process of calculating partition value.

I think we can reuse the PartititonValueCalculator here.

Are you suggesting that we should have "calculate" and "split" two functions within the splitter? I think that way it would be hard for people to tell when should they call calculate before calling split.

Maybe this would look better?

impl RecordBatchPartitionSplitter { pub fn new( iceberg_schema: SchemaRef, partition_spec: PartitionSpecRef, // if some, then calculate the partition value, otherwise use `_partition` calculator: Option<PartitionValueCalculator>, ) }

liurenjie1024 · 2025-10-23T09:54:45Z

crates/integrations/datafusion/src/writer/task.rs

+    Fanout(FanoutWriter<B>),
+    /// Writer for partitioned tables with sorted data (maintains single active writer)
+    Clustered(ClusteredWriter<B>),


We could simplify this as

Partitioned { splitter: RecordBatchSplitter, partitioned_writer: Arc<dyn PartitionedWriter> }

CTTY commented Oct 20, 2025

View reviewed changes

crates/iceberg/src/writer/task/mod.rs Outdated Show resolved Hide resolved

CTTY commented Oct 21, 2025

View reviewed changes

crates/iceberg/src/writer/partitioning/unpartitioned_writer.rs Outdated Show resolved Hide resolved

liurenjie1024 reviewed Oct 22, 2025

View reviewed changes

crates/iceberg/src/writer/task/mod.rs Outdated Show resolved Hide resolved

crates/iceberg/src/writer/partitioning/unpartitioned_writer.rs Outdated Show resolved Hide resolved

CTTY added 7 commits October 22, 2025 16:25

Return partition key in the partition splitter

650ad7e

Add unpartitioned writer

03ea356

Add TaskWriter trait

e87d19f

Implement BaseTaskWriter

9ea5b30

fix clippy for unpartitioned writer

b91844b

Add unit tests for DefaultTaskWriter

2cd60b0

add a flag in splitter to skip projection

f4b72ef

CTTY force-pushed the ctty/task-writer branch from e875e8e to f4b72ef Compare October 23, 2025 03:05

CTTY added 4 commits October 22, 2025 21:31

better naming

58b2ab4

Implement actual datafusion task writer

9eb3bc2

fmt and cleanup

4ecc496

minor

039cc86

liurenjie1024 reviewed Oct 23, 2025

View reviewed changes

trying to separate calculator and splitter

c5061e2

CTTY mentioned this pull request Oct 23, 2025

refactor(arrow,datafusion): Reuse PartitionValueCalculator in RecordBatchPartitionSplitter #1781

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(io): UnpartitionedWriter + TaskWriter #1769

feat(io): UnpartitionedWriter + TaskWriter #1769

Uh oh!

CTTY commented Oct 20, 2025 •

edited

Loading

Uh oh!

CTTY Oct 20, 2025

Uh oh!

CTTY Oct 21, 2025

Uh oh!

Uh oh!

CTTY Oct 21, 2025

Uh oh!

CTTY Oct 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 left a comment

Uh oh!

liurenjie1024 Oct 23, 2025

Uh oh!

CTTY Oct 23, 2025

Uh oh!

liurenjie1024 Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(io): UnpartitionedWriter + TaskWriter #1769

Are you sure you want to change the base?

feat(io): UnpartitionedWriter + TaskWriter #1769

Uh oh!

Conversation

CTTY commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

CTTY Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

CTTY Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CTTY Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

CTTY Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

CTTY Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CTTY commented Oct 20, 2025 •

edited

Loading