Update docs/design.md #2242

AndrewSirenko · 2024-11-25T14:44:15Z

What type of PR is this?

/kind documentation

What is this PR about? / Why do we need it?

This PR updates docs/design.md so that it is no longer out of date. It also provides a high-level overview of both components of the EBS CSI Driver and adds a few sequence diagrams to help visualize the most common driver workflows.

Note: Look at PR in 'rich diff' mode to see new sequence diagrams.

How was this change tested?

n/a

Does this PR introduce a user-facing change?

Update docs/design.md and add high-level overview.

k8s-ci-robot · 2024-11-25T14:44:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andrewsirenko. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2024-11-25T14:46:13Z

Code Coverage Diff

This PR does not change the code coverage

docs/design.md

AndrewSirenko · 2024-11-26T16:51:06Z

we use RPC alot in these docs but we actually mean gRPC

I was following the convention laid out in spec/spec.md where I use RPC anytime I mean "magic call that may happen on another computer, but written as if it were a local procedure call", and gRPC is more of an implementation detail (the specific implementation of RPC that CSI Spec relies upon).

But happy to change this if other folks agree.

Perhaps a repository glossary page is in order, like SOCI Snapshotter

torredil

Largely lgtm. Especially the sequence diagrams, this revision significantly elevates the doc.

/lgtm!

torredil · 2024-11-29T16:28:07Z

docs/design.md

+#### Not Implemented
+
+- ListVolumes
+- GetCapacity
+- ControllerGetVolume


Suggest removing this snippet since its duplicating information.

docs/design.md

ConnorJC3 · 2024-12-06T19:40:42Z

docs/design.md


 ### Timeouts

-gRPC always passes a timeout together with a request. After this timeout, the gRPC client call actually returns. The server (=CSI driver) can continue processing the call and finish the operation, however it has no means how to inform the client about the result.
+gRPC always passes a timeout together with a request. After this timeout, the gRPC client call actually returns. The server (i.e. ebs-plugin) can continue processing the call and finish the operation, however it has no means how to inform the client about the result.


Ditto above, drop this diff.

docs/design.md

ConnorJC3 · 2024-12-06T19:51:48Z

docs/design.md


-Kubernetes will retry failed calls, usually after some exponential backoff. Kubernetes heavily relies on idempotency here - i.e. when the driver finished an operation after the client timed out, the driver will get the same call again and it should return success/error based on success/failure of the previous operation.
+Kubernetes sidecars will retry failed calls after exponential backoff. These sidecars rely on idempotency here - i.e. when ebs-plugin finished an operation after the client timed out, the ebs-plugin will get the same call again, and it should return success/error based on success/failure of the previous operation.


Ditto above, drop the "driver" -> "ebs-plugin" part of this diff.

docs/design.md

ConnorJC3 · 2024-12-06T19:56:30Z

docs/design.md

-Very rarely a node gets too busy and kubelet starves for CPU. It does not unmount a volume when it should and Kubernetes initiates detach of the volume.
+## EBS CSI Driver on Kubernetes
+
+### High Level Summary


This section is dangerously close to documenting CSI and/or CSI sidecar design rather than EBS CSI design. For example, if the information here can be found in the CSI spec's architecture section (https://github.com/container-storage-interface/spec/blob/master/spec.md#architecture) we are better off linking there than making our own version.

Respectfully, I think that this section is worth keeping, but I will gladly suggestions for edits.

We already reference/link the CSI Spec in this section. But I will retitle the section to make it clear I'm talking about EBS volume management on Kubernetes, not just EBS CSI Driver.

The CSI Spec's architecture section lists four different possible architectures for CSI drivers and it is not immediately clear which one applies to the EBS CSI installation. Figure one mentions:

a centralized Controller Plugin is available on the CO master host and the Node Plugin is available on all of the CO Nodes.

For EKS customers, the controller plugin is on SOME node, not the CO master host.

Secondly, this section explicitly explains the difference between ebs-csi-controller and ebs-csi-node Pods, which prevents common mistakes many folks make when ramping up on EBS CSI Driver (e.g. I got the "ebs csi driver logs", but it didn't include any Create/AttachVolume calls, only mount operations).

Most importantly, multiple CSEs/customers/TAMs have explicitly said that this summary is more helpful than linking multiple different sources of Kubernetes/CSI/EBS documentation for understanding and troubleshooting EBS-backed volumes on their cluster.

Discussed offline, totally agree on putting more emphasis on ebs csi DRIVER instead of the kubernetes components. Will do some rework/cutting.

docs/design.md

k8s-triage-robot · 2025-04-03T00:04:18Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2025-05-03T00:46:50Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

AndrewSirenko · 2025-05-16T14:12:51Z

/remove-lifecycle rotten

k8s-triage-robot · 2025-08-14T14:32:52Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Update docs/design.md

c525b5a

k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Nov 25, 2024

k8s-ci-robot requested review from ElijahQuinones and torredil November 25, 2024 14:44

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 25, 2024

AndrewSirenko commented Nov 25, 2024

View reviewed changes

docs/design.md Show resolved Hide resolved

Add EC2 ModifyVolume request coalescing diagram to docs/design.md

0fff4f3

AndrewSirenko commented Nov 25, 2024

View reviewed changes

docs/design.md Show resolved Hide resolved

ElijahQuinones reviewed Nov 26, 2024

View reviewed changes

docs/design.md Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

fixup! Update docs/design.md

6e844a8

torredil approved these changes Nov 29, 2024

View reviewed changes

k8s-ci-robot assigned torredil Nov 29, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 29, 2024

ElijahQuinones approved these changes Nov 29, 2024

View reviewed changes

k8s-ci-robot assigned ElijahQuinones Nov 29, 2024

ConnorJC3 reviewed Dec 6, 2024

View reviewed changes

AndrewSirenko commented Dec 6, 2024

View reviewed changes

docs/design.md Show resolved Hide resolved

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2025

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 3, 2025

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 16, 2025

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 14, 2025


		Kubernetes will retry failed calls, usually after some exponential backoff. Kubernetes heavily relies on idempotency here - i.e. when the driver finished an operation after the client timed out, the driver will get the same call again and it should return success/error based on success/failure of the previous operation.
		Kubernetes sidecars will retry failed calls after exponential backoff. These sidecars rely on idempotency here - i.e. when ebs-plugin finished an operation after the client timed out, the ebs-plugin will get the same call again, and it should return success/error based on success/failure of the previous operation.

Update docs/design.md #2242

Are you sure you want to change the base?

Update docs/design.md #2242

Conversation

AndrewSirenko commented Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What is this PR about? / Why do we need it?

How was this change tested?

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Nov 25, 2024

Uh oh!

github-actions bot commented Nov 25, 2024

Code Coverage Diff

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

AndrewSirenko commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

torredil left a comment

Choose a reason for hiding this comment

Uh oh!

torredil Nov 29, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ConnorJC3 Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ConnorJC3 Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ConnorJC3 Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

AndrewSirenko Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndrewSirenko Jan 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

k8s-triage-robot commented Apr 3, 2025

Uh oh!

k8s-triage-robot commented May 3, 2025

Uh oh!

AndrewSirenko commented May 16, 2025

Uh oh!

k8s-triage-robot commented Aug 14, 2025

Uh oh!

Uh oh!

AndrewSirenko commented Nov 25, 2024 •

edited

Loading

AndrewSirenko commented Nov 26, 2024 •

edited

Loading

AndrewSirenko Jan 2, 2025 •

edited

Loading