Skip to content

Conversation

npow
Copy link
Contributor

@npow npow commented Sep 23, 2025

Currently, it's difficult to tell why transient S3 operations are retrying. Added the original error code to the logs.

Example:

Transient S3 failure (attempt #1) -- total success: 1, last attempt 1/2 -- remaining: 1 (ParamValidationError)
Transient S3 failure (attempt #2) -- total success: 1, last attempt 0/1 -- remaining: 1 (ParamValidationError)
Upload result: []

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances S3 operation error logging by including the original error code in transient error messages. This makes it easier to debug and understand why S3 operations are retrying instead of just knowing they failed transiently.

  • Modified error handling to capture and log original error codes for transient failures
  • Updated result parsing to handle the additional error code information
  • Enhanced retry logging to display the specific error type causing the transient failure

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
metaflow/plugins/datatools/s3/s3op.py Modified error handling to write error codes to result files and updated parsing logic to extract transient error types
metaflow/plugins/datatools/s3/s3.py Enhanced retry logging to display the original error code that caused the transient failure

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

romain-intel
romain-intel previously approved these changes Oct 21, 2025
Copy link
Contributor

@romain-intel romain-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor nits. Did all the tests run internally too? If so, we can merge.

@npow
Copy link
Contributor Author

npow commented Oct 21, 2025

Two minor nits. Did all the tests run internally too? If so, we can merge.

yes all good: https://github.netflix.net/corp/mli-metaflow-custom/pull/1388

@npow npow merged commit 53ab4a8 into master Oct 21, 2025
34 of 45 checks passed
@npow npow deleted the npow/s3-log-original-error-code branch October 21, 2025 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants