-
Notifications
You must be signed in to change notification settings - Fork 351
Description
Apache Iceberg version
version = "0.9.1"
Please describe the bug 🐞
It seems like there is a memory leak in the avro/reader.py
I have a long running service that keeps crashing. I tried to replicate the issue locally and it seems it also has this issue.
The following code creates a Memory catalog and generates some random data for ingestion into iceberg.
from pyiceberg.catalog.memory import InMemoryCatalog
import tracemalloc
from datetime import datetime, timezone
import polars as pl
def generate_df():
df = pl.DataFrame(
{
"event_type": ["playback"] * 1000,
"event_origin": ["origin1"] * 1000,
"event_send_at": [datetime.now(timezone.utc)] * 1000,
"event_saved_at": [datetime.now(timezone.utc)] * 1000,
"data": [
{
"calendarKey": "calendarKey",
"id": str(i),
"referenceId": f"ref-{i}",
}
for i in range(1000)
],
}
)
return df
df = generate_df()
catalog = InMemoryCatalog("default", warehouse="/tmp/iceberg")
catalog.create_namespace("default")
df = generate_df()
catalog = InMemoryCatalog("default", warehouse="/tmp/iceberg")
catalog.create_namespace("default")
table = iceberg_table = catalog.create_table(
"default.leak", schema=df.to_arrow().schema, location="/tmp/iceberg/leak"
)
df = pl.DataFrame()
tracemalloc.start()
for i in range(1000):
df = generate_df()
df.write_iceberg(table, mode="append")
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:10]:
print(stat)
Slowly but steadily the outputs for the avro reader memory size increases
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:330: size=370 KiB, count=3782, average=100 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:190: size=222 KiB, count=1891, average=120 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:133: size=184 KiB, count=5673, average=33 B
After some more writes the output looks like this
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:330: size=420 KiB, count=4290, average=100 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:190: size=251 KiB, count=2145, average=120 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:133: size=208 KiB, count=6435, average=33 B
If we take a look at the AvroFile class it uses the enter and exit dunder methods. The enter method assigns the reader to a variable on the instance but it seems like the different reader classes sticks around.
https://github.com/apache/iceberg-python/blob/main/pyiceberg/avro/file.py#L192
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time