You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -31,28 +30,29 @@ This guide covers important details and examples for accessing and using Vector'
31
30
-[Tiers](#tiers)
32
31
-[Automatic Restarts](#automatic-restarts)
33
32
-[Checkpoints](#checkpoints)
34
-
-[Useful Links and Resources](#useful-links-and-resources)
35
33
-[Support](#support)
36
34
37
35
# Logging onto Killarney
38
36
39
-
The Alliance documentation provides lots of general information about accessing the Killarney cluster: [https://docs.alliancecan.ca/wiki/SSH](https://docs.alliancecan.ca/wiki/SSH)
37
+
The Alliance documentation provides some general information about the Killarney cluster: [https://docs.alliancecan.ca/wiki/Killarney](https://docs.alliancecan.ca/wiki/Killarney)
40
38
41
39
## Getting an Account
42
40
43
-
Please read the [user account guide](https://support.vectorinstitute.ai/Killarney?action=AttachFile&do=view&target=User+Guide+to+Killarney+for+Vector+Researchers.pdf) for full information about getting a Killarney account.
41
+
To log into Killarney, the first thing you need is an account. Please read the Alliance's page [Apply for a CCDB account](https://www.alliancecan.ca/en/our-services/advanced-research-computing/account-management/apply-account) for all the steps needed here.
44
42
45
43
## Public Key Setup
46
44
47
-
For SSH access, you need to add a public key in your Alliance Canada account.
45
+
For SSH access, you **must**add a public key in your Alliance Canada account. The Alliance provides full instructions for this at [https://docs.alliancecan.ca/wiki/SSH_Keys](https://docs.alliancecan.ca/wiki/SSH_Keys).
48
46
49
-
On the computer you'll be connecting from, generate a SSH key pair with the following command. When prompted, use the default file name and leave the passphrase empty.
47
+
The following steps are a distilled version of these instructions for MacOS and Linux.
48
+
49
+
On the personal computer you'll be connecting from, generate a SSH key pair with the following command. When prompted, use the default file name and leave the passphrase empty.
In addition to your home directory, you have a minimum of additional 250 GB scratch space (up to 2 TB, depending on your user level) available in the following location: `/scratch/$USER` or simply`$SCRATCH.`
124
+
In addition to your home directory, you have a minimum of additional 250 GB scratch space (up to 2 TB, depending on your user level) available in the following location: `/scratch/$USER` or simply`$SCRATCH`.
127
125
128
126
**⚠️ Unlike your home directory, this scratch space is temporary. It will get automatically purged of files that have not been accessed in 60 days.**
129
127
130
128
A detailed description of the scratch purging policy is available on the Alliance Canada website: [https://docs.alliancecan.ca/wiki/Scratch_purging_policy](https://docs.alliancecan.ca/wiki/Scratch_purging_policy)
131
129
132
-
133
130
## Shared projects
134
131
135
-
For collaborative projects where many people need access to the same files, you need a shared project space. These are generally stored at `/project`
132
+
For collaborative projects where many people need access to the same files, you need a shared project space. These are stored at `/project`.
136
133
137
134
To set up a shared project space, send a request to [[email protected]](mailto:[email protected]). Describe what the project is about, which users need access, how much disk space you need, also an end date when it can be removed.
138
135
139
-
140
136
## Shared datasets
141
137
142
-
To reduce the storage footprint for each user, we've made various commonly-used datasets like MIMIC and IMAGENET available for everyone to use. These are generally stored at /datasets
143
-
144
-
Instead of copying these datasets on your home directory, you can create a symlink via
138
+
To reduce the storage footprint for each user, we've made various commonly-used datasets like MIMIC and IMAGENET available for everyone to use. These are stored at `/datasets`
145
139
140
+
Instead of copying these datasets on your home directory, you can create a symlink via:
146
141
147
142
```
148
143
ln -s /dataset/PATH_TO_DATASET ~/PATH_OF_LINK # path of link can be some place in your home directory so that PyTorch/TF can pick up the dataset to these already downloaded directories.
149
144
```
150
145
151
-
152
-
For a list of available datasets please see [Current Datasets](https://support.vectorinstitute.ai/CurrentDatasets)
153
-
154
-
155
146
## Shared model weights
156
147
157
-
Similar to datasets, model weights are typically very large and can be shared among many users. We've made various common model weights such as Llama3, Mixtral and Stable Diffusion available at /`model-weights`
158
-
148
+
Similar to datasets, model weights are typically very large and can be shared among many users. We've made various common model weights such as Llama3, Mixtral and Stable Diffusion available at `/model-weights`
159
149
160
150
## Training checkpoints
161
151
162
152
Unlike the legacy Bon Echo (Vaughan) cluster, there is no dedicated checkpoint space in the Killarney cluster. Now that the `$SCRATCH` space has been greatly expanded, please use this for any training checkpoints.
163
153
164
154
165
-
# Migration from legacy Vaughan (Bon Echo) Cluster
166
-
167
-
The easiest way to migrate data from the legacy Vaughan (Bon Echo) Cluster to Killarney is by using a file transfer command (likely `rsync` or `scp`) from an SSH session.
168
-
169
-
Start by connecting via https://support.vectorinstitute.ai/Killarney?action=AttachFile&do=view&target=User+Guide+to+Killarney+for+Vector+Researchers.pdfsh into the legacy Bon Echo (Vaughan) cluster:
170
-
171
-
172
-
```
173
-
username@my-desktop:~$ ssh v.vectorinstitute.ai
174
-
Password:
175
-
Duo two-factor login for username
176
-
177
-
Enter a passcode or select one of the following options:
178
-
179
-
1. Duo Push to XXX-XXX-3089
180
-
2. SMS passcodes to XXX-XXX-3089
181
-
182
-
Passcode or option (1-2): 1
183
-
Success. Logging you in...
184
-
Welcome to the Vector Institute HPC - Vaughan Cluster
185
-
186
-
Login nodes are shared among many users and therefore
187
-
must not be used to run computationally intensive tasks.
188
-
Those should be submitted to the slurm scheduler which
189
-
will dispatch them on compute nodes.
190
-
191
-
For more information, please consult the wiki at
192
-
https://support.vectorinstitute.ai/Computing
193
-
For issues using this cluster, please contact us at
If you forget your password, please visit our self-
196
-
service portal at https://password.vectorinstitute.ai.
197
-
198
-
Last login: Mon Aug 18 07:28:24 2025 from 184.145.46.175
199
-
```
200
-
201
-
Next, use the `rsync` command to copy files across to the Killarney cluster. In the following example, I'm copying the contents of a folder called `my_projects` to my Killarney home directory.
Enter a passcode or select one of the following options:
209
-
210
-
1. Duo Push to Phone
211
-
212
-
Passcode or option (1-1): 1
213
-
Success. Logging you in...
214
-
sending incremental file list
215
-
[...]
216
-
```
217
-
218
155
# Killarney GPU resources
219
156
220
157
There are two main types of GPU resources on the Killarney cluster: capacity GPUs (NVIDIA L40S) and high-performance GPUs (NVIDIA H100).
@@ -228,13 +165,11 @@ Since the cluster has many users and limited resources, we use the Slurm job sch
228
165
229
166
The Alliance documentation provides lots of general information about submitting jobs using the Slurm job scheduler: [https://docs.alliancecan.ca/wiki/Running_jobs](https://docs.alliancecan.ca/wiki/Running_jobs)
230
167
231
-
For some example Slurm workloads specific to the Killarney cluster (sbatch files, resource configurations, software environments, etc.) see the (../slurm-examples)[slurm-examples] provided in this repo.
168
+
For some example Slurm workloads specific to the Killarney cluster (sbatch files, resource configurations, software environments, etc.) see the [Slurm examples](../slurm-examples) provided in this repo.
232
169
170
+
## View jobs in a Slurm cluster (squeue)
233
171
234
-
## View jobs in the Slurm cluster (squeue)
235
-
236
-
To view all the jobs currently in the cluster, either running, pending or failed, use **squeue**: ([https://slurm.schedmd.com/squeue.html](https://slurm.schedmd.com/squeue.html))
237
-
172
+
To view all the jobs currently in a cluster, either running, pending or failed, use `squeue`: ([https://slurm.schedmd.com/squeue.html](https://slurm.schedmd.com/squeue.html))
Refer to the ([squeue manual page](https://slurm.schedmd.com/squeue.html)) for a full list of options.
265
200
266
-
267
201
## Submit a new Slurm job (sbatch)
268
202
269
-
To ask Slurm to run your jobs in the background so you can have your job running, even after logging off, use sbatch https://slurm.schedmd.com/sbatch.html
203
+
To ask Slurm to run your jobs in the background so you can have your job running, even after logging off, use `sbatch`:https://slurm.schedmd.com/sbatch.html
270
204
271
-
To use sbatch, you need to create a file, specify the configurations within (you can also specify these on the command line) and then run `sbatch my_sbatch_slurm.sh` to get Slurm to schedule it.
205
+
To use sbatch, you need to create a file, specify the configurations within (you can also specify these on the command line) and then run `sbatch my_sbatch_slurm.sh` to get Slurm to schedule it.**Note**: You cannot submit jobs from your home directory. You need to submit them from a scratch or project folder.
272
206
273
207
Example Hello World sbatch file (hello_world.sh):
274
208
275
-
276
209
```
277
210
#!/bin/bash
278
211
#SBATCH --job-name=hello_world_example
@@ -302,7 +235,6 @@ Since Slurm runs your job in the background, it becomes really difficult to see
302
235
303
236
Note that the %j in output and error configuration tells Slurm to substitute the job ID where the %j is. So if your job ID is 1234 then your output file will be `hello_world.1234.out` and your error file will be `hello_world.1234.err`.
304
237
305
-
306
238
## Interactive sessions (srun)
307
239
308
240
If all you want is an interactive session on a GPU node (without the batch job), just use `srun` (https://slurm.schedmd.com/srun.html)
@@ -324,8 +256,7 @@ srun: job 501831 has been allocated resources
324
256
username@kn138:/project
325
257
```
326
258
327
-
After you see $USER@kn###, you can run your script interactively.
328
-
259
+
After you see $USER@kn###, you can use this shell session interactively.
The Killarney cluster has both NVIDIA L40S and H100 GPUs available. To request a specific GPU type, use the `--gres=gpu` flag, for example:
@@ -367,12 +297,12 @@ srun: job 581667 has been allocated resources
367
297
username@kn178:/scratch$
368
298
```
369
299
300
+
Be careful about choosing the correct number of GPUs and correct time limit. The Slurm scheduler will "bill" you for resources used, so a resource-heavy job will reduce your future priority.
370
301
371
302
## View cluster resource utilization (sinfo)
372
303
373
304
To see the availability in a more granular scale use sinfo ([https://slurm.schedmd.com/sinfo.html](https://slurm.schedmd.com/sinfo.html)). For example:
The cluster comes with preinstalled software environments called **modules**. These will allow you to access many different versions of Python, VS Code Server, RStudio Server, NodeJS and many others.
@@ -470,7 +401,7 @@ gpubase_l40s_b5 up 7-00:00:00 17/0/0/17 kn[085-101]
470
401
471
402
## Automatic Restarts
472
403
473
-
All jobs in our Slurm cluster have a time limit, after which they will get stopped. For longer running jobs which need more than a few hours, the [Vaughan Slurm Changes](https://support.vectorinstitute.ai/Computing?action=AttachFile&do=view&target=Vector+Vaughan+HPC+Changes+FAQ+2023.pdf) document describes how to automatically restart these.
404
+
When a job exceeds its time limit, it will get stopped by the Slurm scheduler. For longer running jobs which need more than a few hours, see our [Timeout Requeue](../slurm-examples/timeout-requeue/) example which shows how to automatically requeue your job.
474
405
475
406
## Checkpoints
476
407
@@ -479,19 +410,6 @@ In order to avoid losing your work when your job exits, you will need to impleme
479
410
On the legacy Bon Echo cluster, there was a dedicated checkpoint space in the file system for checkpoints. **⚠️ In Killarney, there is no dedicated checkpoint space.** Users are expected to manage their own checkpoints under their `$SCRATCH` folder.
0 commit comments