Skip to content

Avro reader memory leak #2325

@Declow

Description

@Declow

Apache Iceberg version

version = "0.9.1"

Please describe the bug 🐞

It seems like there is a memory leak in the avro/reader.py
I have a long running service that keeps crashing. I tried to replicate the issue locally and it seems it also has this issue.

The following code creates a Memory catalog and generates some random data for ingestion into iceberg.

from pyiceberg.catalog.memory import InMemoryCatalog
import tracemalloc
from datetime import datetime, timezone
import polars as pl

def generate_df():
    df = pl.DataFrame(
        {
            "event_type": ["playback"] * 1000,
            "event_origin": ["origin1"] * 1000,
            "event_send_at": [datetime.now(timezone.utc)] * 1000,
            "event_saved_at": [datetime.now(timezone.utc)] * 1000,
            "data": [
                {
                    "calendarKey": "calendarKey",
                    "id": str(i),
                    "referenceId": f"ref-{i}",
                }
                for i in range(1000)
            ],
        }
    )
    return df

df = generate_df()
catalog = InMemoryCatalog("default", warehouse="/tmp/iceberg")
catalog.create_namespace("default")

df = generate_df()
catalog = InMemoryCatalog("default", warehouse="/tmp/iceberg")
catalog.create_namespace("default")
table = iceberg_table = catalog.create_table(
    "default.leak", schema=df.to_arrow().schema, location="/tmp/iceberg/leak"
)

df = pl.DataFrame()

tracemalloc.start()
for i in range(1000):
    df = generate_df()
    df.write_iceberg(table, mode="append")
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics("lineno")
    for stat in top_stats[:10]:
        print(stat)

Slowly but steadily the outputs for the avro reader memory size increases

/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:330: size=370 KiB, count=3782, average=100 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:190: size=222 KiB, count=1891, average=120 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:133: size=184 KiB, count=5673, average=33 B

After some more writes the output looks like this

/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:330: size=420 KiB, count=4290, average=100 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:190: size=251 KiB, count=2145, average=120 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:133: size=208 KiB, count=6435, average=33 B

If we take a look at the AvroFile class it uses the enter and exit dunder methods. The enter method assigns the reader to a variable on the instance but it seems like the different reader classes sticks around.
https://github.com/apache/iceberg-python/blob/main/pyiceberg/avro/file.py#L192

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions