Skip to content

Commit 7175f3a

Browse files
ci-penbot-01nikhilsk
authored andcommitted
Adding missing DCM Spec for documentation (#877) (#890) (#892)
* Adding missing DCM Spec for documentation * changes * changes * comments * Adding DCM systemd integration doc to documentation (cherry picked from commit 300483f0cb7b4fc77c67c7caf2c429d4e261dcc5) Co-authored-by: nikhilsk <[email protected]>
1 parent 21c8d55 commit 7175f3a

File tree

6 files changed

+39
-5
lines changed

6 files changed

+39
-5
lines changed
48.9 KB
Loading

docs/dcm/device-config-manager.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The Device Config Manager can be enabled by setting the `spec/configManager/enab
1414

1515
```yaml
1616
configManager:
17-
# To enable/disable the metrics exporter, enable to partition
17+
# To enable/disable the config manager, enable to partition
1818
enable: True
1919

2020
# image for the device-config-manager container

docs/fulldeviceconfig.rst

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ Below is an example of a full DeviceConfig CR that can be used to install the AM
139139
upgradePolicy:
140140
#(Optional) If no UpgradePolicy is mentioned for any of the components but their image is changed, the daemonset will
141141
# get upgraded according to the defaults, which is `upgradeStrategy` set to `RollingUpdate` and `maxUnavailable` set to 1.
142-
upgradeStrategy: RollingUpdate, # (Optional) Can be either `RollingUpdate` or `OnDelete`
142+
upgradeStrategy: "RollingUpdate" # (Optional) Can be either `RollingUpdate` or `OnDelete`
143143
maxUnavailable: 1 # (Optional) Number of pods that can be unavailable during the upgrade process. 1 is the default value
144144
## AMD GPU Metrics Exporter Configuration ##
145145
metricsExporter:
@@ -156,7 +156,7 @@ Below is an example of a full DeviceConfig CR that can be used to install the AM
156156
upgradePolicy:
157157
#(Optional) If no UpgradePolicy is mentioned for any of the components but their image is changed, the daemonset will
158158
# get upgraded according to the defaults, which is `upgradeStrategy` set to `RollingUpdate` and `maxUnavailable` set to 1.
159-
upgradeStrategy: RollingUpdate, # (Optional) Can be either `RollingUpdate` or `OnDelete`
159+
upgradeStrategy: "RollingUpdate" # (Optional) Can be either `RollingUpdate` or `OnDelete`
160160
maxUnavailable: 1 # (Optional) Number of pods that can be unavailable during the upgrade process. 1 is the default value
161161
# If specifying a node selector here, the metrics exporter will only be deployed on nodes that match the selector
162162
# See Item #6 on https://instinct.docs.amd.com/projects/gpu-operator/en/latest/knownlimitations.html for example usage
@@ -224,6 +224,29 @@ Below is an example of a full DeviceConfig CR that can be used to install the AM
224224
selector:
225225
feature.node.kubernetes.io/amd-gpu: "true" # You must include this again as this selector will overwrite the global selector
226226
amd.com/device-test-runner: "true" # Helpful for when you want to disable the test runner on specific nodes
227+
configManager:
228+
enable: False # False by Default. Set to True to enable the config manager
229+
image: "rocm/device-config-manager:v1.3.1" # image for the device-config-manager container
230+
imagePullPolicy: IfNotPresent # image pull policy for config manager. Accepted values are Always, IfNotPresent, Never
231+
config: # specify configmap name which stores profile config info
232+
name: "config-manager-config"
233+
upgradePolicy:
234+
#(Optional) If no UpgradePolicy is mentioned for any of the components but their image is changed, the daemonset will
235+
# get upgraded according to the defaults, which is `upgradeStrategy` set to `RollingUpdate` and `maxUnavailable` set to 1.
236+
upgradeStrategy: "RollingUpdate" # (Optional) Can be either `RollingUpdate` or `OnDelete`
237+
maxUnavailable: 1 # (Optional) Number of pods that can be unavailable during the upgrade process. 1 is the default value
238+
# DCM pod deployed either as a standalone pod or through the GPU operator will have
239+
# a toleration attached to it. User can specify additional tolerations if required
240+
# key: amd-dcm , value: up , Operator: Equal, effect: NoExecute
241+
# OPTIONAL
242+
# toleration field for dcm pod to bypass nodes with specific taints
243+
configManagerTolerations:
244+
- key: "key1"
245+
operator: "Equal"
246+
value: "value1"
247+
effect: "NoExecute"
248+
selector: # (Optional)
249+
feature.node.kubernetes.io/amd-gpu: "true" # You can include this if you wish to overwrite the global selector
227250
selector:
228251
# Specify the nodes to be managed by this DeviceConfig Custom Resource. This will be applied to all components unless a selector
229252
# is specified in the component configuration. The node labeller will automatically find nodes with AMD GPUs and apply the label

docs/overview.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,3 +85,14 @@ The Test Runner offers hardware validation, diagnostics and benchmarking capabil
8585
- Support manually triggered or scheduled test execution within the Kubernetes cluster.
8686
- Support executing tests as init containers within the GPU workload pod.
8787
- Report test results as Kubernetes events.
88+
89+
### Device Config Manager
90+
91+
The [Device Config Manager](https://github.com/ROCm/device-config-manager) is used to handle AMD GPU Devices' configuration
92+
93+
- DCM will be handling the GPU partitioning configurations
94+
- Different partition types supported are:
95+
- Memory partitions (NPS1, NPS2, NPS4)
96+
- Compute partitions (SPX, DPX, QPX, CPX)
97+
- Supports Systemd integration to start/stop service files
98+
- Report partition results as Kubernetes events.

example/configManager/deviceconfigs_example.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ metadata:
88
spec:
99

1010
configManager:
11-
# To enable/disable the metrics exporter, enable to partition
11+
# To enable/disable the config manager, enable to partition
1212
enable: True
1313

1414
# image for the device-config-manager container

example/deviceconfig_example.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ spec:
114114
#- name: aws-secret
115115

116116
configManager:
117-
# To enable/disable the metrics exporter, enable to partition
117+
# To enable/disable the config manager, enable to partition
118118
enable: True
119119
# image for the device-config-manager container
120120
image: rocm/device-config-manager:v1.3.1

0 commit comments

Comments
 (0)