Implement multipart upload for azureblob-sdk provider #904
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Current Approach
The current approach forwards the input stream from the incoming request to the azure sdk. I used a
Flux<ByteBuffer>to work around the limitation that our input stream does not support mark / reset. I looked into the azure-sdk-for-java after seeing your issue: Azure/azure-sdk-for-java#42603. It seems like this is the only workaround.Current Limitations
We don't handle if-match, if-none-match headers for conditional writes. This is supported by S3 for CompleteMultipartUpload requests.EDIT: I added a commit to support conditional writes for CompleteMultipartUpload requests.Alternative Approaches
Using a BufferedInputStream
I discarded this solution due to potentially high memory usage when uploading multiple parts in parallel.
Persisting the input stream on disk
Limitation 1: Potentially doubles the time until the part upload completes -> Can be compensated by uploading more files in parallel.
Limitation 2: Need to inform users that they provide sufficient /tmp storage on sufficiently fast SSDs. On an HDD the user would be heavily IO bound.
Benefit: We could actually calculate proper Etags for each part and store the md5 hash, e.g. encoded in the block name. This is not possible with the current solution because we need to already provide the block name when we pass along the input stream from the request to the azure sdk.
s3-tests Update
gaul/s3-tests#4
EDIT: I also running this on a test kubernetes cluster. It seems to work just fine. Some tooling (cnpg, kafka connector) can run their backups through it.
Fixes
#709 #553 #552