Skip to content

Commit 7b59f48

Browse files
authored
feat: Add MachineHealthCheck example template (#175)
1 parent 6a30ad7 commit 7b59f48

File tree

3 files changed

+307
-0
lines changed

3 files changed

+307
-0
lines changed

docs/src/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
- [Provision a management cluster with OKE](./gs/mgmt/mgmt-oke.md)
1919
- [Install Cluster API for Oracle Cloud Infrastructure](./gs/install-cluster-api.md)
2020
- [Create Workload Cluster](./gs/create-workload-cluster.md)
21+
- [MachineHealthChecks](./gs/create-mhc-workload-cluster.md)
2122
- [Create GPU Workload Cluster](./gs/create-gpu-workload-cluster.md)
2223
- [Create Workload Templates](./gs/create-workload-templates.md)
2324
- [Using externally managed infrastructure](./gs/externally-managed-cluster-infrastructure.md)
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Create a workload cluster with MachineHealthChecks (MHC)
2+
3+
To better understand MachineHealthChecks please read over [the Cluster-API book][mhc]
4+
and make sure to read the [limitations][mhc-limitations] sections.
5+
6+
## Create a new workload cluster with MHC
7+
8+
In the project's code repository we provide an [example template][mhc-template] that sets up two MachineHealthChecks
9+
at workload creation time. The example sets up two MHCs to allow differing remediation values:
10+
11+
- `control-plane-unhealthy-5m` setups a health check for the control plane machines
12+
- `md-unhealthy-5m` sets up a health check for the workload machines
13+
14+
> NOTE: As a part of the example template the MHCs will start remediating nodes that are `not ready` after 10 minutes.
15+
In order prevent this side effect make sure to [install your CNI][install-a-cni-provider] once the API is available.
16+
This will move the machines into a `Ready` state.
17+
18+
## Add MHC to existing workload cluster
19+
20+
Another approach is to install MHC after the cluster is up and healthy (aka Day-2 Operation). This can prevent
21+
machine remediation while setting up the cluster.
22+
23+
### Add control-plane MHC
24+
25+
We need to add the `controlplane.remediation` label to the `KubeadmControlPlane`.
26+
27+
Create a file named `control-plane-patch.yaml` that has this content:
28+
```yaml
29+
spec:
30+
machineTemplate:
31+
metadata:
32+
labels:
33+
controlplane.remediation: ""
34+
```
35+
36+
Then run `kubectl patch KubeadmControlPlane <your-cluster-name>-control-plane --patch-file control-plane-patch.yaml --type=merge`.
37+
38+
Then add the new label to any existing control-plane node(s)
39+
`kubectl label node <control-plane-name> controlplane.remediation=""`. This will prevent the `KubeadmControlPlane` provisioning
40+
new nodes once the MHC is deployed.
41+
42+
Create a file named `control-plane-mhc.yaml` that has this content:
43+
```yaml
44+
apiVersion: cluster.x-k8s.io/v1beta1
45+
kind: MachineHealthCheck
46+
metadata:
47+
name: "<your-cluster-name>-control-plane-unhealthy-5m"
48+
spec:
49+
clusterName: "<your-cluster-name>"
50+
maxUnhealthy: 100%
51+
nodeStartupTimeout: 10m
52+
selector:
53+
matchLabels:
54+
controlplane.remediation: ""
55+
unhealthyConditions:
56+
- type: Ready
57+
status: Unknown
58+
timeout: 300s
59+
- type: Ready
60+
status: "False"
61+
timeout: 300s
62+
```
63+
64+
Then run `kubectl apply -f control-plane-mhc.yaml`.
65+
66+
Then run `kubectl get machinehealthchecks` to check your MachineHealthCheck sees the expected machines.
67+
68+
### Add machine MHC
69+
70+
We need to add the `machine.remediation` label to the `MachineDeployment`.
71+
72+
Create a file named `machine-patch.yaml` that has this content:
73+
```yaml
74+
spec:
75+
template:
76+
metadata:
77+
labels:
78+
machine.remediation: ""
79+
```
80+
81+
Then run `kubectl patch MachineDeployment oci-cluster-stage-md-0 --patch-file machine-patch.yaml --type=merge`.
82+
83+
Then add the new label to any existing control-plane node(s)
84+
`kubectl label node <machine-name> machine.remediation=""`. This will prevent the `MachineDeployment` provisioning
85+
new nodes once the MHC is deployed.
86+
87+
Create a file named `machine-mhc.yaml` that has this content:
88+
```yaml
89+
apiVersion: cluster.x-k8s.io/v1beta1
90+
kind: MachineHealthCheck
91+
metadata:
92+
name: "<your-cluster-name>-stage-md-unhealthy-5m"
93+
spec:
94+
clusterName: "oci-cluster-stage"
95+
maxUnhealthy: 100%
96+
nodeStartupTimeout: 10m
97+
selector:
98+
matchLabels:
99+
machine.remediation: ""
100+
unhealthyConditions:
101+
- type: Ready
102+
status: Unknown
103+
timeout: 300s
104+
- type: Ready
105+
status: "False"
106+
timeout: 300s
107+
```
108+
109+
Then run `kubectl apply -f machine-mhc.yaml`.
110+
111+
Then run `kubectl get machinehealthchecks` to check your MachineHealthCheck sees the expected machines.
112+
113+
[install-a-cni-provider]: ../gs/create-workload-cluster.md#install-a-cni-provider
114+
[mhc]: https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html
115+
[mhc-limitations]: https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html#limitations-and-caveats-of-a-machinehealthcheck
116+
[mhc-template]: https://github.com/oracle/cluster-api-provider-oci/blob/main/templates/cluster-template-healcheck.yaml
Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
apiVersion: cluster.x-k8s.io/v1beta1
2+
kind: Cluster
3+
metadata:
4+
labels:
5+
cluster.x-k8s.io/cluster-name: "${CLUSTER_NAME}"
6+
name: "${CLUSTER_NAME}"
7+
namespace: "${NAMESPACE}"
8+
spec:
9+
clusterNetwork:
10+
pods:
11+
cidrBlocks:
12+
- ${POD_CIDR:="192.168.0.0/16"}
13+
serviceDomain: ${SERVICE_DOMAIN:="cluster.local"}
14+
services:
15+
cidrBlocks:
16+
- ${SERVICE_CIDR:="10.128.0.0/12"}
17+
infrastructureRef:
18+
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
19+
kind: OCICluster
20+
name: "${CLUSTER_NAME}"
21+
namespace: "${NAMESPACE}"
22+
controlPlaneRef:
23+
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
24+
kind: KubeadmControlPlane
25+
name: "${CLUSTER_NAME}-control-plane"
26+
namespace: "${NAMESPACE}"
27+
---
28+
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
29+
kind: OCICluster
30+
metadata:
31+
labels:
32+
cluster.x-k8s.io/cluster-name: "${CLUSTER_NAME}"
33+
name: "${CLUSTER_NAME}"
34+
spec:
35+
compartmentId: "${OCI_COMPARTMENT_ID}"
36+
---
37+
kind: KubeadmControlPlane
38+
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
39+
metadata:
40+
name: "${CLUSTER_NAME}-control-plane"
41+
namespace: "${NAMESPACE}"
42+
spec:
43+
version: "${KUBERNETES_VERSION}"
44+
replicas: ${CONTROL_PLANE_MACHINE_COUNT}
45+
machineTemplate:
46+
metadata:
47+
labels:
48+
controlplane.remediation: ""
49+
infrastructureRef:
50+
kind: OCIMachineTemplate
51+
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
52+
name: "${CLUSTER_NAME}-control-plane"
53+
namespace: "${NAMESPACE}"
54+
kubeadmConfigSpec:
55+
clusterConfiguration:
56+
kubernetesVersion: ${KUBERNETES_VERSION}
57+
apiServer:
58+
certSANs: [localhost, 127.0.0.1]
59+
dns: {}
60+
etcd: {}
61+
networking: {}
62+
scheduler: {}
63+
initConfiguration:
64+
nodeRegistration:
65+
criSocket: /var/run/containerd/containerd.sock
66+
kubeletExtraArgs:
67+
cloud-provider: external
68+
provider-id: oci://{{ ds["id"] }}
69+
joinConfiguration:
70+
discovery: {}
71+
nodeRegistration:
72+
criSocket: /var/run/containerd/containerd.sock
73+
kubeletExtraArgs:
74+
cloud-provider: external
75+
provider-id: oci://{{ ds["id"] }}
76+
---
77+
kind: OCIMachineTemplate
78+
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
79+
metadata:
80+
name: "${CLUSTER_NAME}-control-plane"
81+
# labels:
82+
# controlplane.remediation: ""
83+
spec:
84+
template:
85+
spec:
86+
imageId: "${OCI_IMAGE_ID}"
87+
compartmentId: "${OCI_COMPARTMENT_ID}"
88+
shape: "${OCI_CONTROL_PLANE_MACHINE_TYPE=VM.Standard.E4.Flex}"
89+
shapeConfig:
90+
ocpus: "${OCI_CONTROL_PLANE_MACHINE_TYPE_OCPUS=1}"
91+
metadata:
92+
ssh_authorized_keys: "${OCI_SSH_KEY}"
93+
isPvEncryptionInTransitEnabled: ${OCI_CONTROL_PLANE_PV_TRANSIT_ENCRYPTION=true}
94+
---
95+
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
96+
kind: OCIMachineTemplate
97+
metadata:
98+
name: "${CLUSTER_NAME}-md-0"
99+
# labels:
100+
# machine.remediation: ""
101+
spec:
102+
template:
103+
spec:
104+
imageId: "${OCI_IMAGE_ID}"
105+
compartmentId: "${OCI_COMPARTMENT_ID}"
106+
shape: "${OCI_NODE_MACHINE_TYPE=VM.Standard.E4.Flex}"
107+
shapeConfig:
108+
ocpus: "${OCI_NODE_MACHINE_TYPE_OCPUS=1}"
109+
metadata:
110+
ssh_authorized_keys: "${OCI_SSH_KEY}"
111+
isPvEncryptionInTransitEnabled: ${OCI_NODE_PV_TRANSIT_ENCRYPTION=true}
112+
---
113+
apiVersion: bootstrap.cluster.x-k8s.io/v1alpha4
114+
kind: KubeadmConfigTemplate
115+
metadata:
116+
name: "${CLUSTER_NAME}-md-0"
117+
spec:
118+
template:
119+
spec:
120+
joinConfiguration:
121+
nodeRegistration:
122+
kubeletExtraArgs:
123+
cloud-provider: external
124+
provider-id: oci://{{ ds["id"] }}
125+
---
126+
apiVersion: cluster.x-k8s.io/v1beta1
127+
kind: MachineDeployment
128+
metadata:
129+
name: "${CLUSTER_NAME}-md-0"
130+
# labels:
131+
# machine.remediation: ""
132+
spec:
133+
clusterName: "${CLUSTER_NAME}"
134+
replicas: ${NODE_MACHINE_COUNT}
135+
selector:
136+
matchLabels:
137+
template:
138+
metadata:
139+
labels:
140+
machine.remediation: ""
141+
spec:
142+
clusterName: "${CLUSTER_NAME}"
143+
version: "${KUBERNETES_VERSION}"
144+
bootstrap:
145+
configRef:
146+
name: "${CLUSTER_NAME}-md-0"
147+
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
148+
kind: KubeadmConfigTemplate
149+
infrastructureRef:
150+
name: "${CLUSTER_NAME}-md-0"
151+
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
152+
kind: OCIMachineTemplate
153+
---
154+
apiVersion: cluster.x-k8s.io/v1beta1
155+
kind: MachineHealthCheck
156+
metadata:
157+
name: "${CLUSTER_NAME}-control-plane-unhealthy-5m"
158+
spec:
159+
clusterName: "${CLUSTER_NAME}"
160+
maxUnhealthy: 100%
161+
nodeStartupTimeout: 10m
162+
selector:
163+
matchLabels:
164+
controlplane.remediation: ""
165+
unhealthyConditions:
166+
- type: Ready
167+
status: Unknown
168+
timeout: 300s
169+
- type: Ready
170+
status: "False"
171+
timeout: 300s
172+
---
173+
apiVersion: cluster.x-k8s.io/v1beta1
174+
kind: MachineHealthCheck
175+
metadata:
176+
name: "${CLUSTER_NAME}-md-unhealthy-5m"
177+
spec:
178+
clusterName: "${CLUSTER_NAME}"
179+
maxUnhealthy: 100%
180+
nodeStartupTimeout: 10m
181+
selector:
182+
matchLabels:
183+
machine.remediation: ""
184+
unhealthyConditions:
185+
- type: Ready
186+
status: Unknown
187+
timeout: 300s
188+
- type: Ready
189+
status: "False"
190+
timeout: 300s

0 commit comments

Comments
 (0)