Skip to content

Conversation

@klaudworks
Copy link
Contributor

@klaudworks klaudworks commented Oct 18, 2025

Current Approach

The current approach forwards the input stream from the incoming request to the azure sdk. I used a Flux<ByteBuffer> to work around the limitation that our input stream does not support mark / reset. I looked into the azure-sdk-for-java after seeing your issue: Azure/azure-sdk-for-java#42603. It seems like this is the only workaround.

Current Limitations

  1. Given, that we don't really have a place to store md5-based Etags for each part, we return them but don't actually use them to complete the multipart upload. Instead, we try to find all parts belonging to a multipart upload by using the upload id. For the finally assembled blob, we just return azure's Etag.
  2. We can't make use of the azure-java-sdk retry logic on network issues. There is no way to solve this without persisting the input stream. The problem is pushed down to the s3-proxy client's aws sdk, which will retry uploading failed parts.
  3. We don't handle if-match, if-none-match headers for conditional writes. This is supported by S3 for CompleteMultipartUpload requests. EDIT: I added a commit to support conditional writes for CompleteMultipartUpload requests.
  4. AWS supports up to 5GiB part size. I updated the max part size to 4000 MiB as documented here: https://learn.microsoft.com/en-us/azure/storage/blobs/scalability-targets. The default in the AWS SDK is 8MB so I would rather not implement some complicated chunking of parts on our side to cover the edge case where people upload 4-5 GiB files.

Alternative Approaches

Using a BufferedInputStream

I discarded this solution due to potentially high memory usage when uploading multiple parts in parallel.

Persisting the input stream on disk

Limitation 1: Potentially doubles the time until the part upload completes -> Can be compensated by uploading more files in parallel.

Limitation 2: Need to inform users that they provide sufficient /tmp storage on sufficiently fast SSDs. On an HDD the user would be heavily IO bound.

Benefit: We could actually calculate proper Etags for each part and store the md5 hash, e.g. encoded in the block name. This is not possible with the current solution because we need to already provide the block name when we pass along the input stream from the request to the azure sdk.

s3-tests Update

gaul/s3-tests#4

EDIT: I also running this on a test kubernetes cluster. It seems to work just fine. Some tooling (cnpg, kafka connector) can run their backups through it.

Fixes

#709 #553 #552

@klaudworks
Copy link
Contributor Author

FYI @gaul I'm running this in a stage k8s cluster and it seems to work just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant