-
-
Notifications
You must be signed in to change notification settings - Fork 387
Description
Problem description
smart_open
is effectively crashing SeaweedFS/S3 and Ceph/S3 filesystems when doing many small reads over large files (ex: 32k read on a 4GB file).
On a SeaweedFS/S3 filesystem (also demonstrated on Ceph/S3), using the code shown in the reproduce section below, I am reading a small segment of data (32k in this example) from a large file (4GB in this example). This simple request causes SeaweedFS to move the full file for just a 32k range read. It appears that this is expected behavior based on a reading of the protocol specifications. Notably boto3
does not trigger the same behavior.
When I run the code and look at the HTTP traffic being generated we see the following GET request:
GET /hengenlab/CAF77/Neural_Data/highpass_750/Headstages_256_Channels_int16_2021-02-02_14-28-24.bin HTTP/1.1
Host: seaweed-filer.seaweedfs:8333
Accept-Encoding: identity
Range: bytes=0-
User-Agent: Boto3/1.24.59 Python/3.8.2 Linux/5.4.0-125-generic Botocore/1.27.59
X-Amz-Date: 20220921T204638Z
X-Amz-Content-SHA256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Authorization: AWS4-HMAC-SHA256 Credential=KL2PPBIZ4OYKR420C28D/20220921/us-east-1/s3/aws4_request, SignedHeaders=host;range;x-amz-content-sha256;x-amz-date, Signature=8009c4fdc85311977066c6988047a72658579e02c02b544fa8d48d8d8b9e8d57
amz-sdk-invocation-id: eace9bc1-a1d1-4244-84ee-1caa164bc294
amz-sdk-request: attempt=1
Notably Range: bytes=0-
is our culprit. My assumption is that smart_open
intends to open the file for streaming and read data from the stream as dictated by calls to f.read(...)
.
When performing a ranged read with just boto3
code the header looks like this:
GET /hengenlab/CAF77/Neural_Data/highpass_750/Headstages_256_Channels_int16_2021-02-02_14-28-24.bin?partNumber=1 HTTP/1.1
Host: seaweed-filer.seaweedfs:8333
Accept-Encoding: identity
Range: bytes=1-32768
User-Agent: Boto3/1.24.59 Python/3.8.2 Linux/5.4.0-125-generic Botocore/1.27.59
X-Amz-Date: 20220921T205137Z
X-Amz-Content-SHA256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Authorization: AWS4-HMAC-SHA256 Credential=KL2PPBIZ4OYKR420C28D/20220921/us-east-1/s3/aws4_request, SignedHeaders=host;range;x-amz-content-sha256;x-amz-date, Signature=5d3398396c217f4a87284479bc6bc947344c256a30552fe98b6057167d7143fb
amz-sdk-invocation-id: 489f165b-cb4d-4ca0-bc49-cc1e70618518
amz-sdk-request: attempt=1
Using boto3
does not cause any issue. With smart_open the SeaweedFS/S3 filesystem is interpreting the lack of a to-bytes
value as the full file. It is then moving the full (4 GB) data file from a volume server to an S3 server, where just 32k are passed to the end user job. This has the effect of very quickly oversaturating the network.
The protocol specifications seem to agree that the behavior by SeaweedFS/S3 is the correct way to interpret this Range header. E.g. how can the filesystem know that the user won't need to read the whole file given this header.
If the last-byte-pos value is absent, or if the value is greater than or equal to the current length of the entity-body, last-byte-pos is taken to be equal to one less than the current length of the entity- body in bytes.
https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35
Steps/code to reproduce the problem
This code triggers the full (4 GB) file transfer (internally, not to the end user application) for a small 32k ranged read.
import import smart_open
while True:
with smart_open.open('s3://bucket/path/to/4gb/file.bin', 'rb') as f:
b = f.read(32768)
print(f'Read {len(b)}')
This boto3
version of the code does not trigger the same issue:
import boto3
while True:
obj = boto3.resource('s3', endpoint_url='https://swfs-s3.endpoint').Object('bucket', 'path/to/4gb/file.bin')
stream = obj.get(Range='bytes=1-32768')['Body']
res = stream.read()
print(f'Read: {len(res)} bytes')
I'd like to open a discussion to determine how to properly interpret the S3 protocol and figure out whether an issue like this should be in the domain of the filesystem(s) or should be changed in smart_open
.
Versions
>>> print(platform.platform())
Linux-5.19.0-76051900-generic-x86_64-with-glibc2.17
>>> print("Python", sys.version)
Python 3.8.10 (default, Jun 4 2021, 15:09:15)
[GCC 7.5.0]
>>> print("smart_open", smart_open.__version__)
smart_open 6.0.0