-
Couldn't load subscription status.
- Fork 257
Add slurm-gke blueprint #4607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add slurm-gke blueprint #4607
Conversation
|
This PR include using the existing NFS server on the controller to distribute slurm key to nodeset running on GKE. If this is accepted, we can close #4562 as it's no longer needed. |
|
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only partial review. I will try to send more by 6pm PT.
community/modules/compute/gke-nodeset/templates/nodeset-general.yaml.tftpl
Outdated
Show resolved
Hide resolved
community/modules/compute/gke-nodeset/templates/nodeset-general.yaml.tftpl
Show resolved
Hide resolved
community/modules/compute/gke-nodeset/templates/nodeset-general.yaml.tftpl
Show resolved
Hide resolved
community/modules/compute/gke-nodeset/templates/nodeset-general.yaml.tftpl
Outdated
Show resolved
Hide resolved
9ab8f60 to
9ebb0fe
Compare
|
/gcbrun |
|
I deployed the blueprint from this PR. After a few minutes, gke based nodes went to DOWN state. The slurmd log shows: If we have nodes going down for such reason, than we need our own build of slurmd container with SlurmUser=401 |
This was caused by uid mismatch in the public slinky image. A custom-built image with the correct UID is required to fix this. I will open a follow-up PR with instructions for building this image. |
|
From time to time, deployment is failing, pods are stuck in init stage with message: Also, sometimes ./gcluster deploy has to be run twice, since for the first time below error message appears: |
066b753
|
/gcbrun |
|
/gcbrun |
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.