Skip to content

Conversation

carlopi
Copy link
Collaborator

@carlopi carlopi commented Sep 1, 2025

We currently follow always same path on uploading to S3-like storage:

  • 1 POST to start upload
  • 1 or more PUT requests to upload buffers
  • 1 POST to finalize upload

While this is general and battle tested, there is an improvement, we could check the count of buffers to be uploaded, and if that is at 1 we could perform a single PUT (moving from 3 requests over the network to a single one).

This is particularly significant for writes to data lakes in general and Iceberg in particular, due to the fact that both JSON and AVRO files have to be uploaded (and that means a INSERT INTO <table> VALUES () costs currently 10+ sequential network requests)

Next up: figuring out how to cut the HEAD request currently performed when opening a FileHandle, that could properly move to single request.

For the reviewer(s): I am assuming we always fully upload a file, and never just a single part of an existing file. It would be much better to check that explicitly, unsure if there is else needed.

@carlopi carlopi requested a review from samansmink September 1, 2025 07:20
Copy link
Collaborator

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! as agreed while discussing this in person: we should expand the httpfs ci before merging this

@@ -403,7 +418,7 @@ void S3FileSystem::UploadBuffer(S3FileHandle &file_handle, shared_ptr<S3WriteBuf
// Free up space for another thread to acquire an S3WriteBuffer
write_buffer.reset();

NotifyUploadsInProgress(file_handle);
//NotifyUploadsInProgress(file_handle);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: uncomment?

@carlopi carlopi requested a review from Tmonster September 1, 2025 08:10
@carlopi
Copy link
Collaborator Author

carlopi commented Sep 1, 2025

Leaving a comment for later:

        bool use_sse_kms = auth_params.kms_key_id.length() > 0 && (method == "POST" || method == "PUT") &&
                           query.find("uploadId") == std::string::npos;

code-path needs also to be looked at in light of this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants