Skip to content

Conversation

@ACW101
Copy link
Collaborator

@ACW101 ACW101 commented Sep 4, 2025

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@ACW101 ACW101 requested review from a team and samskillman as code owners September 4, 2025 05:41
@ACW101
Copy link
Collaborator Author

ACW101 commented Sep 4, 2025

This PR include using the existing NFS server on the controller to distribute slurm key to nodeset running on GKE. If this is accepted, we can close #4562 as it's no longer needed.

@nick-stroud nick-stroud added the release-improvements Added to release notes under the "Improvements" heading. label Sep 9, 2025
@nick-stroud
Copy link
Collaborator

/gcbrun

Copy link
Collaborator

@nick-stroud nick-stroud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only partial review. I will try to send more by 6pm PT.

@ACW101 ACW101 force-pushed the blueprint branch 5 times, most recently from 9ab8f60 to 9ebb0fe Compare September 12, 2025 17:56
@ACW101 ACW101 requested a review from nick-stroud September 18, 2025 21:04
@ACW101
Copy link
Collaborator Author

ACW101 commented Sep 24, 2025

/gcbrun

@pawloch00
Copy link
Contributor

pawloch00 commented Sep 25, 2025

I deployed the blueprint from this PR. After a few minutes, gke based nodes went to DOWN state. The slurmd log shows:

CPUs=8 Boards=1 Sockets=1 Cores=8 Threads=1 Memory=64309 TmpDisk=96515 Uptime=67206 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2025-09-25T08:43:49.012] error: Security violation, ping RPC from uid 981
[2025-09-25T08:43:49.012] error: Do you have SlurmUser configured as uid 981?

If we have nodes going down for such reason, than we need our own build of slurmd container with SlurmUser=401

@ACW101
Copy link
Collaborator Author

ACW101 commented Sep 25, 2025

I deployed the blueprint from this PR. After a few minutes, gke based nodes went to DOWN state. The slurmd log shows:

CPUs=8 Boards=1 Sockets=1 Cores=8 Threads=1 Memory=64309 TmpDisk=96515 Uptime=67206 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2025-09-25T08:43:49.012] error: Security violation, ping RPC from uid 981
[2025-09-25T08:43:49.012] error: Do you have SlurmUser configured as uid 981?

If we have nodes going down for such reason, than we need our own build of slurmd container with SlurmUser=401

This was caused by uid mismatch in the public slinky image. A custom-built image with the correct UID is required to fix this. I will open a follow-up PR with instructions for building this image.

pawloch00
pawloch00 previously approved these changes Sep 26, 2025
@pawloch00
Copy link
Contributor

pawloch00 commented Sep 30, 2025

From time to time, deployment is failing, pods are stuck in init stage with message:

MountVolume.MountDevice failed for volume "slurm-key-pv" : rpc error: code = DeadlineExceeded desc = context deadline exceeded 
<br class="Apple-interchange-newline">

Also, sometimes ./gcluster deploy has to be run twice, since for the first time below error message appears:

for: "/tmp/608228131kubectl_manifest.yaml": error when patching "/tmp/608228131kubectl_manifest.yaml": PersistentVolume "slurm-key-pv" is invalid: spec.persistentvolumesource: Forbidden: spec.persistentvolumesource is immutable after creation
  core.PersistentVolumeSource{

@nick-stroud nick-stroud self-assigned this Oct 6, 2025
samskillman
samskillman previously approved these changes Oct 6, 2025
samskillman
samskillman previously approved these changes Oct 6, 2025
@ACW101
Copy link
Collaborator Author

ACW101 commented Oct 7, 2025

/gcbrun

nick-stroud
nick-stroud previously approved these changes Oct 7, 2025
@ACW101
Copy link
Collaborator Author

ACW101 commented Oct 8, 2025

/gcbrun

@ACW101 ACW101 merged commit 1f28255 into GoogleCloudPlatform:develop Oct 9, 2025
11 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants