Implement multipart upload for azureblob-sdk provider #904

klaudworks · 2025-10-18T13:44:07Z

Current Approach

The current approach forwards the input stream from the incoming request to the azure sdk. I used a Flux<ByteBuffer> to work around the limitation that our input stream does not support mark / reset. I looked into the azure-sdk-for-java after seeing your issue: Azure/azure-sdk-for-java#42603. It seems like this is the only workaround.

Current Limitations

Given, that we don't really have a place to store md5-based Etags for each part, we return them but don't actually use them to complete the multipart upload. Instead, we try to find all parts belonging to a multipart upload by using the upload id. For the finally assembled blob, we just return azure's Etag.
We can't make use of the azure-java-sdk retry logic on network issues. There is no way to solve this without persisting the input stream. The problem is pushed down to the s3-proxy client's aws sdk, which will retry uploading failed parts.
~~We don't handle if-match, if-none-match headers for conditional writes. This is supported by S3 for CompleteMultipartUpload requests.~~ EDIT: I added a commit to support conditional writes for CompleteMultipartUpload requests.
AWS supports up to 5GiB part size. I updated the max part size to 4000 MiB as documented here: https://learn.microsoft.com/en-us/azure/storage/blobs/scalability-targets. The default in the AWS SDK is 8MB so I would rather not implement some complicated chunking of parts on our side to cover the edge case where people upload 4-5 GiB files.

Alternative Approaches

Using a BufferedInputStream

I discarded this solution due to potentially high memory usage when uploading multiple parts in parallel.

Persisting the input stream on disk

Limitation 1: Potentially doubles the time until the part upload completes -> Can be compensated by uploading more files in parallel.

Limitation 2: Need to inform users that they provide sufficient /tmp storage on sufficiently fast SSDs. On an HDD the user would be heavily IO bound.

Benefit: We could actually calculate proper Etags for each part and store the md5 hash, e.g. encoded in the block name. This is not possible with the current solution because we need to already provide the block name when we pass along the input stream from the request to the azure sdk.

s3-tests Update

gaul/s3-tests#4

EDIT: I also running this on a test kubernetes cluster. It seems to work just fine. Some tooling (cnpg, kafka connector) can run their backups through it.

Fixes

#709 #553 #552

klaudworks · 2025-10-24T10:59:38Z

FYI @gaul I'm running this in a stage k8s cluster and it seems to work just fine.

klaudworks added 9 commits October 18, 2025 14:51

implement multipart upload for azureblob-sdk

4e787ed

update s3-tests

8661425

update readme

9bc9ee0

implement conditional writes for azureblob-sdk multipart upload

028a4a6

handle duplicate parts in azureblob-sdk multipart upload completion

4015171

enable more tests for azureblob-sdk multipart upload

c62ea95

remove unused import

98796de

azureblob-sdk increase part size to 4000 MiB

c479ba4

fix integer overflow error for getMaximumMultipartSize

39091d8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implement multipart upload for azureblob-sdk provider #904

Implement multipart upload for azureblob-sdk provider #904

Uh oh!

klaudworks commented Oct 18, 2025 •

edited

Loading

Uh oh!

klaudworks commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Implement multipart upload for azureblob-sdk provider #904

Are you sure you want to change the base?

Implement multipart upload for azureblob-sdk provider #904

Uh oh!

Conversation

klaudworks commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current Approach

Current Limitations

Alternative Approaches

Using a BufferedInputStream

Persisting the input stream on disk

s3-tests Update

Fixes

Uh oh!

klaudworks commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

klaudworks commented Oct 18, 2025 •

edited

Loading