Skip to content

Commit dfa241a

Browse files
author
Mark Coatsworth
committed
Big cleanup of introduction-to-vector-compute README
1 parent 213a741 commit dfa241a

File tree

1 file changed

+24
-106
lines changed
  • getting-started/introduction-to-vector-compute

1 file changed

+24
-106
lines changed

getting-started/introduction-to-vector-compute/README.md

Lines changed: 24 additions & 106 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,9 @@ This guide covers important details and examples for accessing and using Vector'
1717
- [Shared datasets](#shared-datasets)
1818
- [Shared model weights](#shared-model-weights)
1919
- [Training checkpoints](#training-checkpoints)
20-
- [Migration from legacy Vaughan (Bon Echo) Cluster](#migration-from-legacy-vaughan-bon-echo-cluster)
2120
- [Killarney GPU resources](#killarney-gpu-resources)
2221
- [Using Slurm](#using-slurm)
23-
- [View jobs in the Slurm cluster (squeue)](#view-jobs-in-the-slurm-cluster-squeue)
22+
- [View jobs in a Slurm cluster (squeue)](#view-jobs-in-a-slurm-cluster-squeue)
2423
- [Submit a new Slurm job (sbatch)](#submit-a-new-slurm-job-sbatch)
2524
- [Interactive sessions (srun)](#interactive-sessions-srun)
2625
- [SSH to sbatch job](#ssh-to-sbatch-job)
@@ -31,28 +30,29 @@ This guide covers important details and examples for accessing and using Vector'
3130
- [Tiers](#tiers)
3231
- [Automatic Restarts](#automatic-restarts)
3332
- [Checkpoints](#checkpoints)
34-
- [Useful Links and Resources](#useful-links-and-resources)
3533
- [Support](#support)
3634

3735
# Logging onto Killarney
3836

39-
The Alliance documentation provides lots of general information about accessing the Killarney cluster: [https://docs.alliancecan.ca/wiki/SSH](https://docs.alliancecan.ca/wiki/SSH)
37+
The Alliance documentation provides some general information about the Killarney cluster: [https://docs.alliancecan.ca/wiki/Killarney](https://docs.alliancecan.ca/wiki/Killarney)
4038

4139
## Getting an Account
4240

43-
Please read the [user account guide](https://support.vectorinstitute.ai/Killarney?action=AttachFile&do=view&target=User+Guide+to+Killarney+for+Vector+Researchers.pdf) for full information about getting a Killarney account.
41+
To log into Killarney, the first thing you need is an account. Please read the Alliance's page [Apply for a CCDB account](https://www.alliancecan.ca/en/our-services/advanced-research-computing/account-management/apply-account) for all the steps needed here.
4442

4543
## Public Key Setup
4644

47-
For SSH access, you need to add a public key in your Alliance Canada account.
45+
For SSH access, you **must** add a public key in your Alliance Canada account. The Alliance provides full instructions for this at [https://docs.alliancecan.ca/wiki/SSH_Keys](https://docs.alliancecan.ca/wiki/SSH_Keys).
4846

49-
On the computer you'll be connecting from, generate a SSH key pair with the following command. When prompted, use the default file name and leave the passphrase empty.
47+
The following steps are a distilled version of these instructions for MacOS and Linux.
48+
49+
On the personal computer you'll be connecting from, generate a SSH key pair with the following command. When prompted, use the default file name and leave the passphrase empty.
5050

5151
```
5252
ssh-keygen -t ed25519 -C "[email protected]"
5353
```
5454

55-
Output the key into your terminal window:
55+
Output the public key into your terminal window:
5656

5757
```
5858
cat ~/.ssh/id_ed25519.pub
@@ -69,8 +69,7 @@ Next, open the SSH Keys page in your Alliance account: [https://ccdb.alliancecan
6969

7070
## SSH Access
7171

72-
From a terminal, use the `ssh` command to log onto the cluster via [killarney.alliancecan.ca](killarney.alliancecan.ca):
73-
72+
From a terminal, use the `ssh` command to log onto the cluster via **killarney.alliancecan.ca**:
7473

7574
```
7675
username@my-desktop:~$ ssh [email protected]
@@ -106,12 +105,11 @@ The hostname **killarney.alliancecan.ca** load balances ssh connections across t
106105

107106
# Killarney File System Intro
108107

109-
110108
## Home directories
111109

112110
When you first log onto the Killarney cluster, you will land in your home directory. This can be accessed at: `/home/username `or just` ~/`
113111

114-
Home directories have 50 GB of storage space. To check the amount of free space in your home directory, use the` diskusage_report `command:
112+
Home directories have 50 GB of storage space. To check the amount of free space in your home directory, use the `diskusage_report` command:
115113

116114

117115
```
@@ -123,98 +121,37 @@ username@klogin02:~$ diskusage_report
123121

124122
## Scratch space
125123

126-
In addition to your home directory, you have a minimum of additional 250 GB scratch space (up to 2 TB, depending on your user level) available in the following location: `/scratch/$USER` or simply` $SCRATCH.`
124+
In addition to your home directory, you have a minimum of additional 250 GB scratch space (up to 2 TB, depending on your user level) available in the following location: `/scratch/$USER` or simply `$SCRATCH`.
127125

128126
**⚠️ Unlike your home directory, this scratch space is temporary. It will get automatically purged of files that have not been accessed in 60 days.**
129127

130128
A detailed description of the scratch purging policy is available on the Alliance Canada website: [https://docs.alliancecan.ca/wiki/Scratch_purging_policy](https://docs.alliancecan.ca/wiki/Scratch_purging_policy)
131129

132-
133130
## Shared projects
134131

135-
For collaborative projects where many people need access to the same files, you need a shared project space. These are generally stored at `/project`
132+
For collaborative projects where many people need access to the same files, you need a shared project space. These are stored at `/project`.
136133

137134
To set up a shared project space, send a request to [[email protected]](mailto:[email protected]). Describe what the project is about, which users need access, how much disk space you need, also an end date when it can be removed.
138135

139-
140136
## Shared datasets
141137

142-
To reduce the storage footprint for each user, we've made various commonly-used datasets like MIMIC and IMAGENET available for everyone to use. These are generally stored at /datasets
143-
144-
Instead of copying these datasets on your home directory, you can create a symlink via
138+
To reduce the storage footprint for each user, we've made various commonly-used datasets like MIMIC and IMAGENET available for everyone to use. These are stored at `/datasets`
145139

140+
Instead of copying these datasets on your home directory, you can create a symlink via:
146141

147142
```
148143
ln -s /dataset/PATH_TO_DATASET ~/PATH_OF_LINK # path of link can be some place in your home directory so that PyTorch/TF can pick up the dataset to these already downloaded directories.
149144
```
150145

151-
152-
For a list of available datasets please see [Current Datasets](https://support.vectorinstitute.ai/CurrentDatasets)
153-
154-
155146
## Shared model weights
156147

157-
Similar to datasets, model weights are typically very large and can be shared among many users. We've made various common model weights such as Llama3, Mixtral and Stable Diffusion available at /`model-weights`
158-
148+
Similar to datasets, model weights are typically very large and can be shared among many users. We've made various common model weights such as Llama3, Mixtral and Stable Diffusion available at `/model-weights`
159149

160150
## Training checkpoints
161151

162152
Unlike the legacy Bon Echo (Vaughan) cluster, there is no dedicated checkpoint space in the Killarney cluster. Now that the `$SCRATCH` space has been greatly expanded, please use this for any training checkpoints.
163153

164154

165-
# Migration from legacy Vaughan (Bon Echo) Cluster
166-
167-
The easiest way to migrate data from the legacy Vaughan (Bon Echo) Cluster to Killarney is by using a file transfer command (likely `rsync` or `scp`) from an SSH session.
168-
169-
Start by connecting via https://support.vectorinstitute.ai/Killarney?action=AttachFile&do=view&target=User+Guide+to+Killarney+for+Vector+Researchers.pdfsh into the legacy Bon Echo (Vaughan) cluster:
170-
171-
172-
```
173-
username@my-desktop:~$ ssh v.vectorinstitute.ai
174-
Password:
175-
Duo two-factor login for username
176-
177-
Enter a passcode or select one of the following options:
178-
179-
1. Duo Push to XXX-XXX-3089
180-
2. SMS passcodes to XXX-XXX-3089
181-
182-
Passcode or option (1-2): 1
183-
Success. Logging you in...
184-
Welcome to the Vector Institute HPC - Vaughan Cluster
185-
186-
Login nodes are shared among many users and therefore
187-
must not be used to run computationally intensive tasks.
188-
Those should be submitted to the slurm scheduler which
189-
will dispatch them on compute nodes.
190-
191-
For more information, please consult the wiki at
192-
https://support.vectorinstitute.ai/Computing
193-
For issues using this cluster, please contact us at
194-
195-
If you forget your password, please visit our self-
196-
service portal at https://password.vectorinstitute.ai.
197-
198-
Last login: Mon Aug 18 07:28:24 2025 from 184.145.46.175
199-
```
200-
201-
Next, use the `rsync` command to copy files across to the Killarney cluster. In the following example, I'm copying the contents of a folder called `my_projects` to my Killarney home directory.
202-
203-
```
204-
username@v4:~$ cd ~/my_projects
205-
username@v4:~/my_projects$ rsync -avz * [email protected]:~/my_projects~
206-
Duo two-factor login for username
207-
208-
Enter a passcode or select one of the following options:
209-
210-
1. Duo Push to Phone
211-
212-
Passcode or option (1-1): 1
213-
Success. Logging you in...
214-
sending incremental file list
215-
[...]
216-
```
217-
218155
# Killarney GPU resources
219156

220157
There are two main types of GPU resources on the Killarney cluster: capacity GPUs (NVIDIA L40S) and high-performance GPUs (NVIDIA H100).
@@ -228,13 +165,11 @@ Since the cluster has many users and limited resources, we use the Slurm job sch
228165

229166
The Alliance documentation provides lots of general information about submitting jobs using the Slurm job scheduler: [https://docs.alliancecan.ca/wiki/Running_jobs](https://docs.alliancecan.ca/wiki/Running_jobs)
230167

231-
For some example Slurm workloads specific to the Killarney cluster (sbatch files, resource configurations, software environments, etc.) see the (../slurm-examples)[slurm-examples] provided in this repo.
168+
For some example Slurm workloads specific to the Killarney cluster (sbatch files, resource configurations, software environments, etc.) see the [Slurm examples](../slurm-examples) provided in this repo.
232169

170+
## View jobs in a Slurm cluster (squeue)
233171

234-
## View jobs in the Slurm cluster (squeue)
235-
236-
To view all the jobs currently in the cluster, either running, pending or failed, use **squeue**: ([https://slurm.schedmd.com/squeue.html](https://slurm.schedmd.com/squeue.html))
237-
172+
To view all the jobs currently in a cluster, either running, pending or failed, use `squeue`: ([https://slurm.schedmd.com/squeue.html](https://slurm.schedmd.com/squeue.html))
238173

239174
```
240175
username@klogin01:~$ squeue
@@ -263,16 +198,14 @@ username@login01:~$ $ squeue --me
263198

264199
Refer to the ([squeue manual page](https://slurm.schedmd.com/squeue.html)) for a full list of options.
265200

266-
267201
## Submit a new Slurm job (sbatch)
268202

269-
To ask Slurm to run your jobs in the background so you can have your job running, even after logging off, use sbatch https://slurm.schedmd.com/sbatch.html
203+
To ask Slurm to run your jobs in the background so you can have your job running, even after logging off, use `sbatch`: https://slurm.schedmd.com/sbatch.html
270204

271-
To use sbatch, you need to create a file, specify the configurations within (you can also specify these on the command line) and then run `sbatch my_sbatch_slurm.sh` to get Slurm to schedule it.
205+
To use sbatch, you need to create a file, specify the configurations within (you can also specify these on the command line) and then run `sbatch my_sbatch_slurm.sh` to get Slurm to schedule it. **Note**: You cannot submit jobs from your home directory. You need to submit them from a scratch or project folder.
272206

273207
Example Hello World sbatch file (hello_world.sh):
274208

275-
276209
```
277210
#!/bin/bash
278211
#SBATCH --job-name=hello_world_example
@@ -302,7 +235,6 @@ Since Slurm runs your job in the background, it becomes really difficult to see
302235

303236
Note that the %j in output and error configuration tells Slurm to substitute the job ID where the %j is. So if your job ID is 1234 then your output file will be `hello_world.1234.out` and your error file will be `hello_world.1234.err`.
304237

305-
306238
## Interactive sessions (srun)
307239

308240
If all you want is an interactive session on a GPU node (without the batch job), just use `srun` (https://slurm.schedmd.com/srun.html)
@@ -324,8 +256,7 @@ srun: job 501831 has been allocated resources
324256
username@kn138:/project
325257
```
326258

327-
After you see $USER@kn###, you can run your script interactively.
328-
259+
After you see $USER@kn###, you can use this shell session interactively.
329260

330261
## SSH to sbatch job
331262

@@ -347,7 +278,6 @@ username@klogin01:~/scratch/imagenet$ srun --pty --overlap --jobid 937373 -w kn0
347278
username@kn060:~/scratch/imagenet$
348279
```
349280

350-
351281
## Accessing specific GPUs
352282

353283
The Killarney cluster has both NVIDIA L40S and H100 GPUs available. To request a specific GPU type, use the `--gres=gpu` flag, for example:
@@ -367,12 +297,12 @@ srun: job 581667 has been allocated resources
367297
username@kn178:/scratch$
368298
```
369299

300+
Be careful about choosing the correct number of GPUs and correct time limit. The Slurm scheduler will "bill" you for resources used, so a resource-heavy job will reduce your future priority.
370301

371302
## View cluster resource utilization (sinfo)
372303

373304
To see the availability in a more granular scale use sinfo ([https://slurm.schedmd.com/sinfo.html](https://slurm.schedmd.com/sinfo.html)). For example:
374305

375-
376306
```
377307
sinfo -N --Format=Partition,CPUsState,GresUsed,Gres
378308
```
@@ -400,6 +330,7 @@ gpubase_l40s_b3 32/32/0/64 gpu:l40s:4(IDX:0-3) gpu:l40s:4
400330
[...]
401331
```
402332

333+
403334
# Software Environments
404335

405336
The cluster comes with preinstalled software environments called **modules**. These will allow you to access many different versions of Python, VS Code Server, RStudio Server, NodeJS and many others.
@@ -470,7 +401,7 @@ gpubase_l40s_b5 up 7-00:00:00 17/0/0/17 kn[085-101]
470401

471402
## Automatic Restarts
472403

473-
All jobs in our Slurm cluster have a time limit, after which they will get stopped. For longer running jobs which need more than a few hours, the [Vaughan Slurm Changes](https://support.vectorinstitute.ai/Computing?action=AttachFile&do=view&target=Vector+Vaughan+HPC+Changes+FAQ+2023.pdf) document describes how to automatically restart these.
404+
When a job exceeds its time limit, it will get stopped by the Slurm scheduler. For longer running jobs which need more than a few hours, see our [Timeout Requeue](../slurm-examples/timeout-requeue/) example which shows how to automatically requeue your job.
474405

475406
## Checkpoints
476407

@@ -479,19 +410,6 @@ In order to avoid losing your work when your job exits, you will need to impleme
479410
On the legacy Bon Echo cluster, there was a dedicated checkpoint space in the file system for checkpoints. **⚠️ In Killarney, there is no dedicated checkpoint space.** Users are expected to manage their own checkpoints under their `$SCRATCH` folder.
480411

481412

482-
# Useful Links and Resources
483-
484-
Computing parent page: https://support.vectorinstitute.ai/Computing
485-
486-
Vaughan valid partition/qos: https://support.vectorinstitute.ai/Vaughan_slurm_changes
487-
488-
Checkpointing: https://support.vectorinstitute.ai/CheckpointExample
489-
490-
Slurm Scheduler: https://support.vectorinstitute.ai/slurm_fairshare
491-
492-
FAQ: https://support.vectorinstitute.ai/FAQ%20about%20the%20cluster
493-
494-
495413
# Support
496414

497415
For any cluster issues please email [email protected].

0 commit comments

Comments
 (0)