Skip to content

[s3] Ranged read http header leaves TO bytes blank, causing S3 filesystems to read full file length #725

@davidparks21

Description

@davidparks21

Problem description

smart_open is effectively crashing SeaweedFS/S3 and Ceph/S3 filesystems when doing many small reads over large files (ex: 32k read on a 4GB file).

On a SeaweedFS/S3 filesystem (also demonstrated on Ceph/S3), using the code shown in the reproduce section below, I am reading a small segment of data (32k in this example) from a large file (4GB in this example). This simple request causes SeaweedFS to move the full file for just a 32k range read. It appears that this is expected behavior based on a reading of the protocol specifications. Notably boto3 does not trigger the same behavior.

When I run the code and look at the HTTP traffic being generated we see the following GET request:

GET /hengenlab/CAF77/Neural_Data/highpass_750/Headstages_256_Channels_int16_2021-02-02_14-28-24.bin HTTP/1.1
Host: seaweed-filer.seaweedfs:8333
Accept-Encoding: identity
Range: bytes=0-
User-Agent: Boto3/1.24.59 Python/3.8.2 Linux/5.4.0-125-generic Botocore/1.27.59
X-Amz-Date: 20220921T204638Z
X-Amz-Content-SHA256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Authorization: AWS4-HMAC-SHA256 Credential=KL2PPBIZ4OYKR420C28D/20220921/us-east-1/s3/aws4_request, SignedHeaders=host;range;x-amz-content-sha256;x-amz-date, Signature=8009c4fdc85311977066c6988047a72658579e02c02b544fa8d48d8d8b9e8d57
amz-sdk-invocation-id: eace9bc1-a1d1-4244-84ee-1caa164bc294
amz-sdk-request: attempt=1

Notably Range: bytes=0- is our culprit. My assumption is that smart_open intends to open the file for streaming and read data from the stream as dictated by calls to f.read(...).

When performing a ranged read with just boto3 code the header looks like this:

GET /hengenlab/CAF77/Neural_Data/highpass_750/Headstages_256_Channels_int16_2021-02-02_14-28-24.bin?partNumber=1 HTTP/1.1
Host: seaweed-filer.seaweedfs:8333
Accept-Encoding: identity
Range: bytes=1-32768
User-Agent: Boto3/1.24.59 Python/3.8.2 Linux/5.4.0-125-generic Botocore/1.27.59
X-Amz-Date: 20220921T205137Z
X-Amz-Content-SHA256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Authorization: AWS4-HMAC-SHA256 Credential=KL2PPBIZ4OYKR420C28D/20220921/us-east-1/s3/aws4_request, SignedHeaders=host;range;x-amz-content-sha256;x-amz-date, Signature=5d3398396c217f4a87284479bc6bc947344c256a30552fe98b6057167d7143fb
amz-sdk-invocation-id: 489f165b-cb4d-4ca0-bc49-cc1e70618518
amz-sdk-request: attempt=1

Using boto3 does not cause any issue. With smart_open the SeaweedFS/S3 filesystem is interpreting the lack of a to-bytes value as the full file. It is then moving the full (4 GB) data file from a volume server to an S3 server, where just 32k are passed to the end user job. This has the effect of very quickly oversaturating the network.

The protocol specifications seem to agree that the behavior by SeaweedFS/S3 is the correct way to interpret this Range header. E.g. how can the filesystem know that the user won't need to read the whole file given this header.

If the last-byte-pos value is absent, or if the value is greater than or equal to the current length of the entity-body, last-byte-pos is taken to be equal to one less than the current length of the entity- body in bytes.

https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35

Steps/code to reproduce the problem

This code triggers the full (4 GB) file transfer (internally, not to the end user application) for a small 32k ranged read.

import import smart_open

while True:
    with smart_open.open('s3://bucket/path/to/4gb/file.bin', 'rb') as f:
        b = f.read(32768)
        print(f'Read {len(b)}')

This boto3 version of the code does not trigger the same issue:

import boto3

while True:
    obj = boto3.resource('s3', endpoint_url='https://swfs-s3.endpoint').Object('bucket', 'path/to/4gb/file.bin')
    stream = obj.get(Range='bytes=1-32768')['Body']
    res = stream.read()
    print(f'Read: {len(res)} bytes')

I'd like to open a discussion to determine how to properly interpret the S3 protocol and figure out whether an issue like this should be in the domain of the filesystem(s) or should be changed in smart_open.

Versions

>>> print(platform.platform())
Linux-5.19.0-76051900-generic-x86_64-with-glibc2.17
>>> print("Python", sys.version)
Python 3.8.10 (default, Jun  4 2021, 15:09:15) 
[GCC 7.5.0]
>>> print("smart_open", smart_open.__version__)
smart_open 6.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions