🐛 Fix MachinePool nodeRef UID mismatch after K8s upgrade #12392

jayesh-srivastava · 2025-06-24T07:45:26Z

What this PR does / why we need it:
When a K8s upgrade is performed on a Managed cluster, new nodes will come up with new UIDs. However, the MachinePool controller has an early return condition that only validates the count of NodeRefs but doesn't check if the UIDs are still valid. This leads to MachinePools retaining stale NodeRef UIDs after upgrades, causing UID mismatches that persist until manual intervention.
This PR adds UID validation logic before the early return condition.

A new name-to-node map with the name nodeNameMap is created.
We iterate over the mp.Status.NodeRefs and using the above map, get each each Node.
If the Node doesn't exists or the fetched Node's UID is not matching to the UID in NodeRef, we break and continue with further reconciliation. This will set the correct nodeRef in the Machine Pool.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #12388

k8s-ci-robot · 2025-06-24T07:46:41Z

@jayesh-srivastava: The label(s) area/pool cannot be applied, because the repository doesn't have them.

In response to this:

/area machine pool

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jayesh-srivastava · 2025-06-24T07:47:40Z

/area machinepool

mboersma

It seems reasonable to add this additional verification in general, but I wonder if this problem has been seen in providers other than CAPZ.

mboersma · 2025-06-24T17:51:59Z

exp/internal/controllers/machinepool_controller_noderef.go

+		// Validate that the UIDs in NodeRefs are still valid
+		if s.nodeRefMap != nil {
+			// Create a name-to-node mapping for efficient lookup
+			nodeNameMap := make(map[string]*corev1.Node)


Suggested change

nodeNameMap := make(map[string]*corev1.Node)

nodeNameMap := make(map[string]*corev1.Node, len(s.nodeRefMap))

I have addressed this suggestion.

sbueringer · 2025-06-25T17:26:48Z

It seems reasonable to add this additional verification in general, but I wonder if this problem has been seen in providers other than CAPZ.

cc @richardcase @justinsb

MadJlzz · 2025-07-02T11:39:02Z

Is it possible that because of such mismatch, after a scale with kubectl downgrading the number of replicas from 3 to 2 instances - capi decides to drain and delete ALL of the nodes of the machine pool? This happened to me this morning. I am using the following versions

> kg bootstrapproviders,controlplaneproviders,coreproviders,infrastructureproviders -A
NAMESPACE                  NAME                                                  INSTALLEDVERSION   READY
kubeadm-bootstrap-system   bootstrapprovider.operator.cluster.x-k8s.io/kubeadm   v1.9.5             True

NAMESPACE                      NAME                                                     INSTALLEDVERSION   READY
kubeadm-control-plane-system   controlplaneprovider.operator.cluster.x-k8s.io/kubeadm   v1.9.5             True

NAMESPACE     NAME                                                 INSTALLEDVERSION   READY
capi-system   coreprovider.operator.cluster.x-k8s.io/cluster-api   v1.9.5             True

NAMESPACE     NAME                                                     INSTALLEDVERSION   READY
capz-system   infrastructureprovider.operator.cluster.x-k8s.io/azure   v1.17.4            True

AndiDog · 2025-07-15T13:09:50Z

It seems reasonable to add this additional verification in general, but I wonder if this problem has been seen in providers other than CAPZ.

cc @richardcase @justinsb

I haven't seen this with CAPA on recently upgraded EC2-only (non-EKS) clusters. But worker nodes typically roll on Kubernetes upgrade due to the changed machine image, and I guess then this wouldn't be an issue since it's not expected that any old nodes remain. I wouldn't know why existing Node objects would be recreated with CAPZ, setting a new UID? Any source/issue for that?

richardcase

Looks good to me @jayesh-srivastava . Just the one comment/suggestion

richardcase · 2025-08-04T14:34:40Z

exp/internal/controllers/machinepool_controller_noderef.go

+				foundNode, exists := nodeNameMap[nodeRef.Name]
+
+				// If node not found or UID doesn't match, mark as invalid
+				if !exists || foundNode.UID != nodeRef.UID {


It might be good to have a higher verbosity log entry if the node names are the same but the UIDs are different.

AndiDog

LGTM

sbueringer · 2025-11-11T17:03:03Z

/assign @richardcase @mboersma

lgtm from your side?

richardcase · 2025-11-12T15:28:50Z

From my side:

/lgtm

k8s-ci-robot · 2025-11-12T15:28:57Z

LGTM label has been added.

Git tree hash: c7fa774464cd9a1993182fdfbafcc15966e27bec

sbueringer · 2025-11-12T18:12:51Z

/test pull-cluster-api-e2e-main-gke

sbueringer · 2025-11-12T18:16:16Z

Thx!

/lgtm
/approve

k8s-ci-robot · 2025-11-12T18:16:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sbueringer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…sigs#12392) * Fix MachinePool nodeRef UID mismatch after K8s upgrade * Add len when creating nodeNameMap * Add logs

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label labels Jun 24, 2025

k8s-ci-robot requested review from enxebre and sbueringer June 24, 2025 07:45

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/machine Issues or PRs related to machine lifecycle management labels Jun 24, 2025

k8s-ci-robot removed the do-not-merge/needs-area PR is missing an area label label Jun 24, 2025

k8s-ci-robot added the area/machinepool Issues or PRs related to machinepools label Jun 24, 2025

mboersma reviewed Jun 24, 2025

View reviewed changes

mboersma mentioned this pull request Jun 26, 2025

Inconsistencies with MachinePools during Kubernetes upgrade kubernetes-sigs/cluster-api-provider-azure#5546

Open

richardcase reviewed Aug 4, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2025

jayesh-srivastava added 3 commits September 22, 2025 00:14

Fix MachinePool nodeRef UID mismatch after K8s upgrade

7d8153f

Add len when creating nodeNameMap

3e66161

Add logs

5944db9

jayesh-srivastava force-pushed the fix-uid-mismatch branch from 089da7d to 5944db9 Compare September 21, 2025 18:53

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 21, 2025

AndiDog reviewed Oct 16, 2025

View reviewed changes

k8s-ci-robot assigned mboersma and richardcase Nov 11, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2025

sbueringer added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Nov 12, 2025

k8s-ci-robot assigned sbueringer Nov 12, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 12, 2025

k8s-ci-robot merged commit 84e2bb6 into kubernetes-sigs:main Nov 12, 2025
21 checks passed

k8s-ci-robot added this to the v1.12 milestone Nov 12, 2025

	nodeNameMap := make(map[string]*corev1.Node)
	nodeNameMap := make(map[string]*corev1.Node, len(s.nodeRefMap))

🐛 Fix MachinePool nodeRef UID mismatch after K8s upgrade #12392

🐛 Fix MachinePool nodeRef UID mismatch after K8s upgrade #12392

Uh oh!

Conversation

jayesh-srivastava commented Jun 24, 2025

Uh oh!

k8s-ci-robot commented Jun 24, 2025

Uh oh!

jayesh-srivastava commented Jun 24, 2025

Uh oh!

mboersma left a comment

Choose a reason for hiding this comment

Uh oh!

mboersma Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

jayesh-srivastava Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

sbueringer commented Jun 25, 2025

Uh oh!

MadJlzz commented Jul 2, 2025

Uh oh!

AndiDog commented Jul 15, 2025

Uh oh!

richardcase left a comment

Choose a reason for hiding this comment

Uh oh!

richardcase Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

jayesh-srivastava Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

AndiDog left a comment

Choose a reason for hiding this comment

Uh oh!

sbueringer commented Nov 11, 2025

Uh oh!

richardcase commented Nov 12, 2025

Uh oh!

k8s-ci-robot commented Nov 12, 2025

Uh oh!

sbueringer commented Nov 12, 2025

Uh oh!

sbueringer commented Nov 12, 2025

Uh oh!

k8s-ci-robot commented Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants