Skip to content

Very slow performance when using iter_lines on an s3 object with long lines #2774

@mike-roberts-healx

Description

@mike-roberts-healx

Describe the bug

StreamingBody.iter_lines seems to be extremely slow when dealing with a file with very long lines. We were using this in ECS to parse a ~500mb json lines file containing fairly large objects and it was taking upwards of 10 minutes to run.

Expected Behavior

Performance should be comparable to read or iter_chunks, regardless of how long the lines are.

Current Behavior

It is way slower than reading the file in other ways.

Reproduction Steps

Minimal repro (requires moto) that reads a 10MB file with no line breaks:

from moto import mock_s3
import boto3
import time

BUCKET = "test_bucket"
KEY = "long_lines.txt"
KILOBYTES = 1024
MEGABYTES = 1024 * KILOBYTES

@mock_s3
def slow_iter_lines():
    s3 = boto3.resource("s3")
    s3.create_bucket(Bucket=BUCKET)
    obj = s3.Object(BUCKET, KEY)
    obj.put(Body=b"a" * (10*MEGABYTES))

    start1 = time.perf_counter()
    obj.get()["Body"].read()
    end1 = time.perf_counter()
    print(f"Normal read took {end1-start1}s")

    start2 = time.perf_counter()
    list(obj.get()["Body"].iter_chunks())
    end2 = time.perf_counter()
    print(f"Chunk iterator took {end2-start2}s")

    start3 = time.perf_counter()
    list(obj.get()["Body"].iter_lines())
    end3 = time.perf_counter()
    print(f"Line iterator took {end3-start3}s")

slow_iter_lines()

Output on botocore==1.27.87:

Normal read took 0.003736798000318231s
Chunk iterator took 0.008655662000819575s
Line iterator took 26.232641663998947s

Possible Solution

The implementation of iter_lines looks to be quadratic in the length of the lines:

    def iter_lines(self, chunk_size=_DEFAULT_CHUNK_SIZE, keepends=False):
        pending = b''
        for chunk in self.iter_chunks(chunk_size):
            lines = (pending + chunk).splitlines(True)
            for line in lines[:-1]:
                yield line.splitlines(keepends)[0]
            pending = lines[-1]
        if pending:
            yield pending.splitlines(keepends)[0]

If there are no line breaks in any of the chunks, then every time it goes round in this loop it is doing:

  • (pending + chunk), which requires allocating + copying into a new buffer
  • splitlines, which requires iterating through the whole buffer again looking for line breaks

So pending keeps growing longer, and every time we have to copy it and iterate through it, so it gets slower quadratically until there is a line break.

A better implementation would probably be to maintain a list of pending chunks and concatenate them only when a line break is reached.

Additional Information/Context

Increasing chunk_size to 1MB fixed the immediate problem we were having.

SDK version used

1.27.87

Environment details (OS name and version, etc.)

Ubuntu 22.04, python 3.10.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThis issue is a confirmed bug.needs-reviewThis issue or pull request needs review from a core team member.p3This is a minor priority issues3

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions