Skip to content

Update docs/design.md #2242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

AndrewSirenko
Copy link
Contributor

@AndrewSirenko AndrewSirenko commented Nov 25, 2024

What type of PR is this?

/kind documentation

What is this PR about? / Why do we need it?

This PR updates docs/design.md so that it is no longer out of date. It also provides a high-level overview of both components of the EBS CSI Driver and adds a few sequence diagrams to help visualize the most common driver workflows.

Note: Look at PR in 'rich diff' mode to see new sequence diagrams.

How was this change tested?

n/a

Does this PR introduce a user-facing change?

Update docs/design.md and add high-level overview. 

@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Nov 25, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andrewsirenko. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 25, 2024
Copy link

Code Coverage Diff

This PR does not change the code coverage

@ElijahQuinones

This comment was marked as resolved.

@AndrewSirenko
Copy link
Contributor Author

AndrewSirenko commented Nov 26, 2024

we use RPC alot in these docs but we actually mean gRPC

I was following the convention laid out in spec/spec.md where I use RPC anytime I mean "magic call that may happen on another computer, but written as if it were a local procedure call", and gRPC is more of an implementation detail (the specific implementation of RPC that CSI Spec relies upon).

But happy to change this if other folks agree.

Perhaps a repository glossary page is in order, like SOCI Snapshotter

Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely lgtm. Especially the sequence diagrams, this revision significantly elevates the doc.

/lgtm!

Comment on lines +284 to +288
#### Not Implemented

- ListVolumes
- GetCapacity
- ControllerGetVolume
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest removing this snippet since its duplicating information.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 29, 2024

### Timeouts

gRPC always passes a timeout together with a request. After this timeout, the gRPC client call actually returns. The server (=CSI driver) can continue processing the call and finish the operation, however it has no means how to inform the client about the result.
gRPC always passes a timeout together with a request. After this timeout, the gRPC client call actually returns. The server (i.e. ebs-plugin) can continue processing the call and finish the operation, however it has no means how to inform the client about the result.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto above, drop this diff.


Kubernetes will retry failed calls, usually after some exponential backoff. Kubernetes heavily relies on idempotency here - i.e. when the driver finished an operation after the client timed out, the driver will get the same call again and it should return success/error based on success/failure of the previous operation.
Kubernetes sidecars will retry failed calls after exponential backoff. These sidecars rely on idempotency here - i.e. when ebs-plugin finished an operation after the client timed out, the ebs-plugin will get the same call again, and it should return success/error based on success/failure of the previous operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto above, drop the "driver" -> "ebs-plugin" part of this diff.

Very rarely a node gets too busy and kubelet starves for CPU. It does not unmount a volume when it should and Kubernetes initiates detach of the volume.
## EBS CSI Driver on Kubernetes

### High Level Summary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is dangerously close to documenting CSI and/or CSI sidecar design rather than EBS CSI design. For example, if the information here can be found in the CSI spec's architecture section (https://github.com/container-storage-interface/spec/blob/master/spec.md#architecture) we are better off linking there than making our own version.

Copy link
Contributor Author

@AndrewSirenko AndrewSirenko Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Respectfully, I think that this section is worth keeping, but I will gladly suggestions for edits.

We already reference/link the CSI Spec in this section. But I will retitle the section to make it clear I'm talking about EBS volume management on Kubernetes, not just EBS CSI Driver.

The CSI Spec's architecture section lists four different possible architectures for CSI drivers and it is not immediately clear which one applies to the EBS CSI installation. Figure one mentions:

a centralized Controller Plugin is available on the CO master host and the Node Plugin is available on all of the CO Nodes.

For EKS customers, the controller plugin is on SOME node, not the CO master host.

Secondly, this section explicitly explains the difference between ebs-csi-controller and ebs-csi-node Pods, which prevents common mistakes many folks make when ramping up on EBS CSI Driver (e.g. I got the "ebs csi driver logs", but it didn't include any Create/AttachVolume calls, only mount operations).

Most importantly, multiple CSEs/customers/TAMs have explicitly said that this summary is more helpful than linking multiple different sources of Kubernetes/CSI/EBS documentation for understanding and troubleshooting EBS-backed volumes on their cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, totally agree on putting more emphasis on ebs csi DRIVER instead of the kubernetes components. Will do some rework/cutting.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 3, 2025
@AndrewSirenko
Copy link
Contributor Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 16, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. lgtm "Looks good to me", indicates that a PR is ready to be merged. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants