-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
StreamingBody.iter_lines
seems to be extremely slow when dealing with a file with very long lines. We were using this in ECS to parse a ~500mb json lines file containing fairly large objects and it was taking upwards of 10 minutes to run.
Expected Behavior
Performance should be comparable to read
or iter_chunks
, regardless of how long the lines are.
Current Behavior
It is way slower than reading the file in other ways.
Reproduction Steps
Minimal repro (requires moto
) that reads a 10MB file with no line breaks:
from moto import mock_s3
import boto3
import time
BUCKET = "test_bucket"
KEY = "long_lines.txt"
KILOBYTES = 1024
MEGABYTES = 1024 * KILOBYTES
@mock_s3
def slow_iter_lines():
s3 = boto3.resource("s3")
s3.create_bucket(Bucket=BUCKET)
obj = s3.Object(BUCKET, KEY)
obj.put(Body=b"a" * (10*MEGABYTES))
start1 = time.perf_counter()
obj.get()["Body"].read()
end1 = time.perf_counter()
print(f"Normal read took {end1-start1}s")
start2 = time.perf_counter()
list(obj.get()["Body"].iter_chunks())
end2 = time.perf_counter()
print(f"Chunk iterator took {end2-start2}s")
start3 = time.perf_counter()
list(obj.get()["Body"].iter_lines())
end3 = time.perf_counter()
print(f"Line iterator took {end3-start3}s")
slow_iter_lines()
Output on botocore==1.27.87
:
Normal read took 0.003736798000318231s
Chunk iterator took 0.008655662000819575s
Line iterator took 26.232641663998947s
Possible Solution
The implementation of iter_lines
looks to be quadratic in the length of the lines:
def iter_lines(self, chunk_size=_DEFAULT_CHUNK_SIZE, keepends=False):
pending = b''
for chunk in self.iter_chunks(chunk_size):
lines = (pending + chunk).splitlines(True)
for line in lines[:-1]:
yield line.splitlines(keepends)[0]
pending = lines[-1]
if pending:
yield pending.splitlines(keepends)[0]
If there are no line breaks in any of the chunks, then every time it goes round in this loop it is doing:
(pending + chunk)
, which requires allocating + copying into a new buffersplitlines
, which requires iterating through the whole buffer again looking for line breaks
So pending keeps growing longer, and every time we have to copy it and iterate through it, so it gets slower quadratically until there is a line break.
A better implementation would probably be to maintain a list of pending chunks and concatenate them only when a line break is reached.
Additional Information/Context
Increasing chunk_size
to 1MB fixed the immediate problem we were having.
SDK version used
1.27.87
Environment details (OS name and version, etc.)
Ubuntu 22.04, python 3.10.4