-
Notifications
You must be signed in to change notification settings - Fork 525
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed
Description
Environment
Delta-rs version: 1.0.2
Binding: Python (deltalake
)
Environment:
- Cloud provider: AWS (S3 tested)
- OS: Ubuntu 22.04
Bug
What happened:
When using write_deltalake()
with a generator, memory usage grows continuously during the write process. This happens both locally and when writing to S3, suggesting batches are not being released or flushed as expected.
What you expected to happen:
I expected memory usage to remain stable or plateau over time, as only one batch is yielded at a time by the generator. Each batch is 100,000 rows, and there are 1,000 batches in total. The generator should allow batch-wise memory usage, not cumulative growth.
How to reproduce it:
import os
import psutil
import pyarrow as pa
from deltalake import write_deltalake
def log_memory():
process = psutil.Process(os.getpid())
print(f"Memory usage: {process.memory_info().rss / 1024**2:.2f} MB")
schema = pa.schema([("id", pa.int64()), ("value", pa.string())])
nbatch = 1_000
batch_size = 100_000
def generate_batches():
for i in range(nbatch):
log_memory()
batch = pa.record_batch(
{"id": [i * 10 + j for j in range(batch_size)],
"value": [f"value_{j}" for j in range(batch_size)]},
schema=schema,
)
yield batch
reader = pa.RecordBatchReader.from_batches(schema, generate_batches())
write_deltalake(
"./test",
reader,
mode="overwrite",
)
More details:
- The issue is not tied to local storage; we see the same behavior when writing to S3 using the AWS bindings.
- Memory usage (via psutil and CloudWatch) steadily increases batch-by-batch, even though the generator should not accumulate state.
- It appears that write_deltalake() is buffering or retaining batches in memory instead of processing and releasing them incrementally.
rsmb7zavillalba-elastic
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed