-
Notifications
You must be signed in to change notification settings - Fork 930
[S3] Log original error code for transient errors #2616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances S3 operation error logging by including the original error code in transient error messages. This makes it easier to debug and understand why S3 operations are retrying instead of just knowing they failed transiently.
- Modified error handling to capture and log original error codes for transient failures
- Updated result parsing to handle the additional error code information
- Enhanced retry logging to display the specific error type causing the transient failure
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
metaflow/plugins/datatools/s3/s3op.py | Modified error handling to write error codes to result files and updated parsing logic to extract transient error types |
metaflow/plugins/datatools/s3/s3.py | Enhanced retry logging to display the original error code that caused the transient failure |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor nits. Did all the tests run internally too? If so, we can merge.
yes all good: https://github.netflix.net/corp/mli-metaflow-custom/pull/1388 |
Currently, it's difficult to tell why transient S3 operations are retrying. Added the original error code to the logs.
Example: