Add Readiness/Liveness EC2 API call probe to Controller Service #2590

AndrewSirenko · 2025-07-30T17:45:50Z

What type of PR is this?

/kind feature

What is this PR about? / Why do we need it?

Lets customers realize that their EBS CSI Driver ebs-csi-controller pods aren't actually ready due to some Auth/networking error.

Should really help customer troubleshooting for new customers (and ensuring networking and auth issues aren't root cause of user problems)

Should work for other COs because this is done through CSI Probe RPC

Fixes #1551

How was this change tested?

Deploy driver without right IAM role, see not ready

Add IAM role, see ready

Remove role again, see driver eventually become not ready

Add role back and driver becomes ready again

Does this PR introduce a user-facing change?

Add Readiness probe via EC2 API call to Controller Service. 

Warning: Ensure that the IAM policy associated with your EBS CSI Driver has permission for ec2:DescribeAvailabilityZones. Clusters with missing IAM roles or networking issues may see ebs-csi-controller pod restarts.

pkg/driver/identity.go

pkg/cloud/cloud.go

pkg/driver/identity.go

github-actions · 2025-07-30T18:05:58Z

Code Coverage Diff

This PR does not change the code coverage

pkg/cloud/cloud_test.go

AndrewSirenko · 2025-07-30T18:35:36Z

/retest

pkg/cloud/cloud.go

pkg/driver/identity.go

ConnorJC3 · 2025-07-30T19:53:20Z

Nitpick: Can we add "Fixes #1551" to the PR description?

Also this isn't a breaking change, that should be removed from the changelog entry.

AndrewSirenko · 2025-07-30T20:45:47Z

Also this isn't a breaking change, that should be removed from the changelog entry.

Can we switch to warning during changelog? I fear some customers with misconfigured drivers will be very concerned for a second after upgrading 😓

torredil

/lgtm

k8s-ci-robot · 2025-08-11T16:58:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from torredil. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ElijahQuinones · 2025-08-19T17:00:48Z

/hold

holding for release

ConnorJC3

Mostly lgtm, one question: Have we tested what this looks like for a timeout scenario? Do we need to raise this value to avoid the dry run cutting out?

https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/templates/controller.yaml#L187

AndrewSirenko · 2025-08-22T16:22:22Z

Have we tested what this looks like for a timeout scenario? Do we need to raise this value to avoid the dry run cutting out?

Have tested for timeout scenario (purposefull broke my policy AND also tested breaking networking)

However good callout, I'll raise timeout to 10s for live and readiness, and then raise liveness failure threshold to 10 (so pod becomes not ready first, THEN starts restarting)

AndrewSirenko · 2025-08-22T16:29:49Z

/hold

For final manual test

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Jul 30, 2025

k8s-ci-robot requested a review from ConnorJC3 July 30, 2025 17:45

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 30, 2025

k8s-ci-robot requested a review from ElijahQuinones July 30, 2025 17:45

AndrewSirenko commented Jul 30, 2025

View reviewed changes

pkg/driver/identity.go Outdated Show resolved Hide resolved

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 30, 2025

AndrewSirenko force-pushed the readiness branch from ce7000d to 3b8beba Compare July 30, 2025 17:48

AndrewSirenko marked this pull request as draft July 30, 2025 17:50

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 30, 2025

AndrewSirenko commented Jul 30, 2025

View reviewed changes

pkg/cloud/cloud.go Show resolved Hide resolved

pkg/driver/identity.go Outdated Show resolved Hide resolved

AndrewSirenko force-pushed the readiness branch from 3b8beba to cde4347 Compare July 30, 2025 18:04

AndrewSirenko marked this pull request as ready for review July 30, 2025 18:04

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 30, 2025

k8s-ci-robot requested a review from torredil July 30, 2025 18:04

torredil reviewed Jul 30, 2025

View reviewed changes

pkg/cloud/cloud_test.go Outdated Show resolved Hide resolved

ConnorJC3 reviewed Jul 30, 2025

View reviewed changes

AndrewSirenko force-pushed the readiness branch from cde4347 to b9aa2a5 Compare July 30, 2025 19:13

AndrewSirenko force-pushed the readiness branch 2 times, most recently from 77e6ea6 to 3d5f467 Compare July 30, 2025 20:45

torredil approved these changes Jul 31, 2025

View reviewed changes

k8s-ci-robot assigned torredil Jul 31, 2025

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 31, 2025

AndrewSirenko force-pushed the readiness branch from 3d5f467 to 35d1a57 Compare August 11, 2025 16:58

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 11, 2025

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2025

torredil approved these changes Aug 11, 2025

View reviewed changes

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 11, 2025

AndrewSirenko force-pushed the readiness branch from 35d1a57 to 7f4de30 Compare August 19, 2025 16:49

k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 19, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 19, 2025

ConnorJC3 reviewed Aug 20, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 20, 2025

Add Readiness/Liveness EC2 API call probe to Controller Service

5dd713b

AndrewSirenko force-pushed the readiness branch from 7f4de30 to 5dd713b Compare August 22, 2025 16:29

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2025

torredil approved these changes Aug 22, 2025

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Readiness/Liveness EC2 API call probe to Controller Service #2590

Add Readiness/Liveness EC2 API call probe to Controller Service #2590

Uh oh!

AndrewSirenko commented Jul 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jul 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

AndrewSirenko commented Jul 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ConnorJC3 commented Jul 30, 2025 •

edited

Loading

Uh oh!

AndrewSirenko commented Jul 30, 2025

Uh oh!

torredil left a comment

Uh oh!

k8s-ci-robot commented Aug 11, 2025

Uh oh!

ElijahQuinones commented Aug 19, 2025

Uh oh!

ConnorJC3 left a comment

Uh oh!

AndrewSirenko commented Aug 22, 2025

Uh oh!

AndrewSirenko commented Aug 22, 2025

Uh oh!

Uh oh!

Add Readiness/Liveness EC2 API call probe to Controller Service #2590

Are you sure you want to change the base?

Add Readiness/Liveness EC2 API call probe to Controller Service #2590

Uh oh!

Conversation

AndrewSirenko commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What is this PR about? / Why do we need it?

How was this change tested?

Does this PR introduce a user-facing change?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage Diff

Uh oh!

Uh oh!

AndrewSirenko commented Jul 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ConnorJC3 commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndrewSirenko commented Jul 30, 2025

Uh oh!

torredil left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Aug 11, 2025

Uh oh!

ElijahQuinones commented Aug 19, 2025

Uh oh!

ConnorJC3 left a comment

Choose a reason for hiding this comment

Uh oh!

AndrewSirenko commented Aug 22, 2025

Uh oh!

AndrewSirenko commented Aug 22, 2025

Uh oh!

Uh oh!

AndrewSirenko commented Jul 30, 2025 •

edited

Loading

github-actions bot commented Jul 30, 2025 •

edited

Loading

ConnorJC3 commented Jul 30, 2025 •

edited

Loading