Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,7 @@ limitations under the License.
| <a name="input_controller_startup_scripts_timeout"></a> [controller\_startup\_scripts\_timeout](#input\_controller\_startup\_scripts\_timeout) | The timeout (seconds) applied to each script in controller\_startup\_scripts. If<br/>any script exceeds this timeout, then the instance setup process is considered<br/>failed and handled accordingly.<br/><br/>NOTE: When set to 0, the timeout is considered infinite and thus disabled. | `number` | `300` | no |
| <a name="input_controller_state_disk"></a> [controller\_state\_disk](#input\_controller\_state\_disk) | A disk that will be attached to the controller instance template to save state of slurm. The disk is created and used by default.<br/> To disable this feature, set this variable to null.<br/><br/> NOTE: This will not save the contents at /opt/apps and /home. To preserve those, they must be saved externally. | <pre>object({<br/> type = string<br/> size = number<br/> })</pre> | <pre>{<br/> "size": 50,<br/> "type": "pd-ssd"<br/>}</pre> | no |
| <a name="input_create_bucket"></a> [create\_bucket](#input\_create\_bucket) | Create GCS bucket instead of using an existing one. | `bool` | `true` | no |
| <a name="input_default_auth_key"></a> [default\_auth\_key](#input\_default\_auth\_key) | Default auth key value ex. slurm.key | `string` | `""` | no |
| <a name="input_deployment_name"></a> [deployment\_name](#input\_deployment\_name) | Name of the deployment. | `string` | n/a | yes |
| <a name="input_disable_controller_public_ips"></a> [disable\_controller\_public\_ips](#input\_disable\_controller\_public\_ips) | DEPRECATED: Use `enable_controller_public_ips` instead. | `bool` | `null` | no |
| <a name="input_disable_default_mounts"></a> [disable\_default\_mounts](#input\_disable\_default\_mounts) | DEPRECATED: Use `enable_default_mounts` instead. | `bool` | `null` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ No modules.
| <a name="input_controller_startup_scripts"></a> [controller\_startup\_scripts](#input\_controller\_startup\_scripts) | List of scripts to be ran on controller VM startup. | <pre>list(object({<br/> filename = string<br/> content = string<br/> }))</pre> | `[]` | no |
| <a name="input_controller_startup_scripts_timeout"></a> [controller\_startup\_scripts\_timeout](#input\_controller\_startup\_scripts\_timeout) | The timeout (seconds) applied to each script in controller\_startup\_scripts. If<br/>any script exceeds this timeout, then the instance setup process is considered<br/>failed and handled accordingly.<br/><br/>NOTE: When set to 0, the timeout is considered infinite and thus disabled. | `number` | `300` | no |
| <a name="input_controller_state_disk"></a> [controller\_state\_disk](#input\_controller\_state\_disk) | A disk that will be attached to the controller instance template to save state of slurm. The disk is created and used by default.<br/> To disable this feature, set this variable to null.<br/><br/> NOTE: This will not save the contents at /opt/apps and /home. To preserve those, they must be saved externally. | <pre>object({<br/> device_name = string<br/> })</pre> | <pre>{<br/> "device_name": null<br/>}</pre> | no |
| <a name="input_default_auth_key"></a> [default\_auth\_key](#input\_default\_auth\_key) | Default auth key value ex. slurm.key | `string` | `""` | no |
| <a name="input_disable_default_mounts"></a> [disable\_default\_mounts](#input\_disable\_default\_mounts) | Disable default global network storage from the controller<br/>- /home<br/>- /apps | `bool` | `false` | no |
| <a name="input_enable_bigquery_load"></a> [enable\_bigquery\_load](#input\_enable\_bigquery\_load) | Enables loading of cluster job usage into big query.<br/><br/>NOTE: Requires Google Bigquery API. | `bool` | `false` | no |
| <a name="input_enable_chs_gpu_health_check_epilog"></a> [enable\_chs\_gpu\_health\_check\_epilog](#input\_enable\_chs\_gpu\_health\_check\_epilog) | Enable a Cluster Health Sacnner(CHS) GPU health check that slurmd executes as an epilog script after completing a job step from a new job allocation.<br/>Compute nodes that fail GPU health check during epilog will be marked as drained. Find more details at:<br/>https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/docs/CHS-Slurm.md | `bool` | `false` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ locals {
# timeouts
controller_startup_scripts_timeout = var.controller_startup_scripts_timeout
compute_startup_scripts_timeout = var.compute_startup_scripts_timeout
default_auth_key = var.default_auth_key

munge_mount = local.munge_mount
slurm_key_mount = var.slurm_key_mount
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import argparse
import logging
import os
import secrets
import shutil
import subprocess
import stat
Expand Down Expand Up @@ -216,8 +217,11 @@ def setup_jwt_key():
util.chown_slurm(jwt_key, mode=0o400)


def _generate_key(p: Path) -> None:
run(f"dd if=/dev/random of={p} bs=1024 count=1")
def _generate_key(p: Path, default_value: str = "") -> None:
if default_value != "":
p.write_text(default_value)
else:
p.write_bytes(secrets.token_bytes(1024))


def setup_key(lkp: util.Lookup) -> None:
Expand All @@ -234,7 +238,7 @@ def setup_key(lkp: util.Lookup) -> None:
# Copy key from persistent state disk
persist = slurmdirs.state / file_name
if not persist.exists():
_generate_key(persist)
_generate_key(persist, lkp.cfg.default_auth_key)

shutil.copyfile(persist, dst)
if lkp.cfg.enable_slurm_auth:
Expand All @@ -247,7 +251,7 @@ def setup_key(lkp: util.Lookup) -> None:
if dst.exists():
log.info("key already exists. Skipping key generation.")
else:
_generate_key(dst)
_generate_key(dst, lkp.cfg.default_auth_key)
if lkp.cfg.enable_slurm_auth:
util.chown_slurm(dst, mode=0o400)
else:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -501,3 +501,9 @@ variable "controller_network_attachment" {
type = string
default = null
}

variable "default_auth_key" {
description = "Default auth key value ex. slurm.key"
type = string
default = ""
}
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ module "slurm_files" {
extra_logging_flags = var.extra_logging_flags

enable_slurm_auth = var.enable_slurm_auth
default_auth_key = var.default_auth_key

enable_bigquery_load = var.enable_bigquery_load
enable_external_prolog_epilog = var.enable_external_prolog_epilog
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -453,6 +453,12 @@ EOD
default = false
}

variable "default_auth_key" {
description = "Default auth key value ex. slurm.key"
type = string
default = ""
}

variable "cloud_parameters" {
description = "cloud.conf options. Defaults inherited from [Slurm GCP repo](https://github.com/GoogleCloudPlatform/slurm-gcp/blob/master/terraform/slurm_cluster/modules/slurm_files/README_TF.md#input_cloud_parameters)"
type = object({
Expand Down
Loading