Skip to content

Add Readiness/Liveness EC2 API call probe to Controller Service #2590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AndrewSirenko
Copy link
Contributor

@AndrewSirenko AndrewSirenko commented Jul 30, 2025

What type of PR is this?

/kind feature

What is this PR about? / Why do we need it?

Lets customers realize that their EBS CSI Driver ebs-csi-controller pods aren't actually ready due to some Auth/networking error.

Should really help customer troubleshooting for new customers (and ensuring networking and auth issues aren't root cause of user problems)

Should work for other COs because this is done through CSI Probe RPC

Fixes #1551

How was this change tested?

Deploy driver without right IAM role, see not ready

Add IAM role, see ready

Remove role again, see driver eventually become not ready

Add role back and driver becomes ready again

Does this PR introduce a user-facing change?

Add Readiness probe via EC2 API call to Controller Service. 

Warning: Ensure that the IAM policy associated with your EBS CSI Driver has permission for ec2:DescribeAvailabilityZones. Clusters with missing IAM roles or networking issues may see ebs-csi-controller pod restarts.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Jul 30, 2025
@k8s-ci-robot k8s-ci-robot requested a review from ConnorJC3 July 30, 2025 17:45
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 30, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 30, 2025
@AndrewSirenko AndrewSirenko marked this pull request as draft July 30, 2025 17:50
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 30, 2025
@AndrewSirenko AndrewSirenko marked this pull request as ready for review July 30, 2025 18:04
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 30, 2025
@k8s-ci-robot k8s-ci-robot requested a review from torredil July 30, 2025 18:04
Copy link

github-actions bot commented Jul 30, 2025

Code Coverage Diff

This PR does not change the code coverage

@AndrewSirenko
Copy link
Contributor Author

/retest

@ConnorJC3
Copy link
Contributor

ConnorJC3 commented Jul 30, 2025

Nitpick: Can we add "Fixes #1551" to the PR description?

Also this isn't a breaking change, that should be removed from the changelog entry.

@AndrewSirenko AndrewSirenko force-pushed the readiness branch 2 times, most recently from 77e6ea6 to 3d5f467 Compare July 30, 2025 20:45
@AndrewSirenko
Copy link
Contributor Author

Also this isn't a breaking change, that should be removed from the changelog entry.

Can we switch to warning during changelog? I fear some customers with misconfigured drivers will be very concerned for a second after upgrading 😓

Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 31, 2025
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 11, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from torredil. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2025
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 11, 2025
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 19, 2025
@ElijahQuinones
Copy link
Member

/hold

holding for release

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 19, 2025
Copy link
Contributor

@ConnorJC3 ConnorJC3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly lgtm, one question: Have we tested what this looks like for a timeout scenario? Do we need to raise this value to avoid the dry run cutting out?

https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/templates/controller.yaml#L187

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 20, 2025
@AndrewSirenko
Copy link
Contributor Author

Have we tested what this looks like for a timeout scenario? Do we need to raise this value to avoid the dry run cutting out?

Have tested for timeout scenario (purposefull broke my policy AND also tested breaking networking)

However good callout, I'll raise timeout to 10s for live and readiness, and then raise liveness failure threshold to 10 (so pod becomes not ready first, THEN starts restarting)

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2025
@AndrewSirenko
Copy link
Contributor Author

/hold

For final manual test

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSI controller Health Check doesn't properly validate controller health.
5 participants