Skip to content

Conversation

@carlopi
Copy link
Collaborator

@carlopi carlopi commented Sep 1, 2025

We currently follow always same path on uploading to S3-like storage:

  • 1 POST to start upload
  • 1 or more PUT requests to upload buffers
  • 1 POST to finalize upload

While this is general and battle tested, there is an improvement, we could check the count of buffers to be uploaded, and if that is at 1 we could perform a single PUT (moving from 3 requests over the network to a single one).

This is particularly significant for writes to data lakes in general and Iceberg in particular, due to the fact that both JSON and AVRO files have to be uploaded (and that means a INSERT INTO <table> VALUES () costs currently 10+ sequential network requests)

Next up: figuring out how to cut the HEAD request currently performed when opening a FileHandle, that could properly move to single request.

For the reviewer(s): I am assuming we always fully upload a file, and never just a single part of an existing file. It would be much better to check that explicitly, unsure if there is else needed.

@carlopi carlopi requested a review from samansmink September 1, 2025 07:20
Copy link
Collaborator

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! as agreed while discussing this in person: we should expand the httpfs ci before merging this

@carlopi carlopi requested a review from Tmonster September 1, 2025 08:10
@carlopi

This comment was marked as outdated.

@carlopi carlopi changed the base branch from main to v1.4-andium September 16, 2025 09:18
@carlopi carlopi marked this pull request as ready for review September 16, 2025 09:18
@carlopi carlopi changed the title Single PUT when uploading "small" files to S3 Single PUT when uploading single buffer-files to S3 Sep 19, 2025
Copy link
Collaborator

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


multipart_upload_id = s3fs.InitializeMultipartUpload(*this);
initialized_multipart_upload = false;
//multipart_upload_id = s3fs.InitializeMultipartUpload(*this);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove altogether?

@samansmink
Copy link
Collaborator

Before merging though, we should probably add some tests here like https://github.com/duckdb/duckdb-httpfs/pull/113/files#diff-6eb7d9e04e92888e7c8615ffe9fab2e708520b65455c7225a7a3833b419037ce that confirm this works as expected

@carlopi
Copy link
Collaborator Author

carlopi commented Sep 19, 2025

Thanks for the review, I forgot I had a better version of this lying around (see #128), closing since I believe that one is better / more proper

@carlopi carlopi closed this Sep 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants