Skip to content

Conversation

bytetwin
Copy link
Collaborator

@bytetwin bytetwin commented Aug 28, 2025

For large clusters with dynamic nodes, frequent node addition/removal causes the controller to reconfigure its topology. This results in controller freezing and unable to handle other signals.

Disabling the tree topology plugin when the cluster size is above 50 dynamic nodes. This may result in suboptimal usage of nodes and job scheduling but prevents slurm being unresponsive.

@bytetwin bytetwin changed the base branch from main to develop August 28, 2025 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant