Big cleanup of introduction-to-vector-compute README

Mark Coatsworth · Mark Coatsworth · commit dfa241aef0b4 · 2025-10-02T13:54:14.000-04:00
diff --git a/getting-started/introduction-to-vector-compute/README.md b/getting-started/introduction-to-vector-compute/README.md
@@ -17,10 +17,9 @@ This guide covers important details and examples for accessing and using Vector'
   - [Shared datasets](#shared-datasets)
   - [Shared model weights](#shared-model-weights)
   - [Training checkpoints](#training-checkpoints)
-- [Migration from legacy Vaughan (Bon Echo) Cluster](#migration-from-legacy-vaughan-bon-echo-cluster)
 - [Killarney GPU resources](#killarney-gpu-resources)
 - [Using Slurm](#using-slurm)
-  - [View jobs in the Slurm cluster (squeue)](#view-jobs-in-the-slurm-cluster-squeue)
+  - [View jobs in a Slurm cluster (squeue)](#view-jobs-in-a-slurm-cluster-squeue)
   - [Submit a new Slurm job (sbatch)](#submit-a-new-slurm-job-sbatch)
   - [Interactive sessions (srun)](#interactive-sessions-srun)
   - [SSH to sbatch job](#ssh-to-sbatch-job)
@@ -31,28 +30,29 @@ This guide covers important details and examples for accessing and using Vector'
   - [Tiers](#tiers)
   - [Automatic Restarts](#automatic-restarts)
   - [Checkpoints](#checkpoints)
-- [Useful Links and Resources](#useful-links-and-resources)
 - [Support](#support)
 
 # Logging onto Killarney
 
-The Alliance documentation provides lots of general information about accessing the Killarney cluster: [https://docs.alliancecan.ca/wiki/SSH](https://docs.alliancecan.ca/wiki/SSH)
+The Alliance documentation provides some general information about the Killarney cluster: [https://docs.alliancecan.ca/wiki/Killarney](https://docs.alliancecan.ca/wiki/Killarney)
 
 ## Getting an Account
 
-Please read the [user account guide](https://support.vectorinstitute.ai/Killarney?action=AttachFile&do=view&target=User+Guide+to+Killarney+for+Vector+Researchers.pdf) for full information about getting a Killarney account.
+To log into Killarney, the first thing you need is an account. Please read the Alliance's page [Apply for a CCDB account](https://www.alliancecan.ca/en/our-services/advanced-research-computing/account-management/apply-account) for all the steps needed here.
 
 ## Public Key Setup
 
-For SSH access, you need to add a public key in your Alliance Canada account.
+For SSH access, you **must** add a public key in your Alliance Canada account. The Alliance provides full instructions for this at [https://docs.alliancecan.ca/wiki/SSH_Keys](https://docs.alliancecan.ca/wiki/SSH_Keys). 
 
-On the computer you'll be connecting from, generate a SSH key pair with the following command. When prompted, use the default file name and leave the passphrase empty.
+The following steps are a distilled version of these instructions for MacOS and Linux.
+
+On the personal computer you'll be connecting from, generate a SSH key pair with the following command. When prompted, use the default file name and leave the passphrase empty.
 
 ```
 ssh-keygen -t ed25519 -C "your_email@example.com"
 ```
 
-Output the key into your terminal window:
+Output the public key into your terminal window:
 
 ```
 cat ~/.ssh/id_ed25519.pub
@@ -69,8 +69,7 @@ Next, open the SSH Keys page in your Alliance account: [https://ccdb.alliancecan
 
 ## SSH Access
 
-From a terminal, use the `ssh` command to log onto the cluster via [killarney.alliancecan.ca](killarney.alliancecan.ca):
-
+From a terminal, use the `ssh` command to log onto the cluster via **killarney.alliancecan.ca**:
 
 ```
 username@my-desktop:~$ ssh killarney_username@killarney.alliancecan.ca
@@ -106,12 +105,11 @@ The hostname **killarney.alliancecan.ca** load balances ssh connections across t
 
 # Killarney File System Intro
 
-
 ## Home directories
 
 When you first log onto the Killarney cluster, you will land in your home directory. This can be accessed at: `/home/username `or just` ~/`
 
-Home directories have 50 GB of storage space. To check the amount of free space in your home directory, use the` diskusage_report `command:
+Home directories have 50 GB of storage space. To check the amount of free space in your home directory, use the `diskusage_report` command:
 
 
 ```
@@ -123,98 +121,37 @@ username@klogin02:~$ diskusage_report
 
 ## Scratch space
 
-In addition to your home directory, you have a minimum of additional 250 GB scratch space (up to 2 TB, depending on your user level) available in the following location: `/scratch/$USER` or simply` $SCRATCH.`
+In addition to your home directory, you have a minimum of additional 250 GB scratch space (up to 2 TB, depending on your user level) available in the following location: `/scratch/$USER` or simply `$SCRATCH`.
 
 **⚠️ Unlike your home directory, this scratch space is temporary. It will get automatically purged of files that have not been accessed in 60 days.**
 
 A detailed description of the scratch purging policy is available on the Alliance Canada website: [https://docs.alliancecan.ca/wiki/Scratch_purging_policy](https://docs.alliancecan.ca/wiki/Scratch_purging_policy)
 
-
 ## Shared projects
 
-For collaborative projects where many people need access to the same files, you need a shared project space. These are generally stored at `/project`
+For collaborative projects where many people need access to the same files, you need a shared project space. These are stored at `/project`.
 
 To set up a shared project space, send a request to [ops-help@vectorinstitute.ai](mailto:ops-help@vectorinstitute.ai). Describe what the project is about, which users need access, how much disk space you need, also an end date when it can be removed.
 
-
 ## Shared datasets
 
-To reduce the storage footprint for each user, we've made various commonly-used datasets like MIMIC and IMAGENET available for everyone to use. These are generally stored at /datasets
-
-Instead of copying these datasets on your home directory, you can create a symlink via
+To reduce the storage footprint for each user, we've made various commonly-used datasets like MIMIC and IMAGENET available for everyone to use. These are stored at `/datasets`
 
+Instead of copying these datasets on your home directory, you can create a symlink via:
 
 ```
 ln -s /dataset/PATH_TO_DATASET ~/PATH_OF_LINK # path of link can be some place in your home directory so that PyTorch/TF can pick up the dataset to these already downloaded directories.
 ```
 
-
-For a list of available datasets please see [Current Datasets](https://support.vectorinstitute.ai/CurrentDatasets)
-
-
 ## Shared model weights
 
-Similar to datasets, model weights are typically very large and can be shared among many users. We've made various common model weights such as Llama3, Mixtral and Stable Diffusion available at /`model-weights`
-
+Similar to datasets, model weights are typically very large and can be shared among many users. We've made various common model weights such as Llama3, Mixtral and Stable Diffusion available at `/model-weights`
 
 ## Training checkpoints
 
 Unlike the legacy Bon Echo (Vaughan) cluster, there is no dedicated checkpoint space in the Killarney cluster. Now that the `$SCRATCH` space has been greatly expanded, please use this for any training checkpoints.
 
 
-# Migration from legacy Vaughan (Bon Echo) Cluster
-
-The easiest way to migrate data from the legacy Vaughan (Bon Echo) Cluster to Killarney is by using a file transfer command (likely `rsync` or `scp`) from an SSH session.
-
-Start by connecting via https://support.vectorinstitute.ai/Killarney?action=AttachFile&do=view&target=User+Guide+to+Killarney+for+Vector+Researchers.pdfsh into the legacy Bon Echo (Vaughan) cluster:
-
-
-```
-username@my-desktop:~$ ssh v.vectorinstitute.ai
-Password:
-Duo two-factor login for username
-
-Enter a passcode or select one of the following options:
-
- 1. Duo Push to XXX-XXX-3089
- 2. SMS passcodes to XXX-XXX-3089
-
-Passcode or option (1-2): 1
-Success. Logging you in...
-Welcome to the Vector Institute HPC - Vaughan Cluster
-
-Login nodes are shared among many users and therefore
-must not be used to run computationally intensive tasks.
-Those should be submitted to the slurm scheduler which
-will dispatch them on compute nodes.
-
-For more information, please consult the wiki at
-  https://support.vectorinstitute.ai/Computing
-For issues using this cluster, please contact us at
-  ops-help@vectorinstitute.ai
-If you forget your password, please visit our self-
-  service portal at https://password.vectorinstitute.ai.
-
-Last login: Mon Aug 18 07:28:24 2025 from 184.145.46.175
-```
-
-Next, use the `rsync` command to copy files across to the Killarney cluster. In the following example, I'm copying the contents of a folder called `my_projects` to my Killarney home directory.
-
-```
-username@v4:~$ cd ~/my_projects
-username@v4:~/my_projects$ rsync -avz * killarney_username@killarney.alliancecan.ca:~/my_projects~
-Duo two-factor login for username
-
-Enter a passcode or select one of the following options:
-
- 1. Duo Push to Phone
-
-Passcode or option (1-1): 1
-Success. Logging you in...
-sending incremental file list
-[...]
-```
-
 # Killarney GPU resources
 
 There are two main types of GPU resources on the Killarney cluster: capacity GPUs (NVIDIA L40S) and high-performance GPUs (NVIDIA H100). 
@@ -228,13 +165,11 @@ Since the cluster has many users and limited resources, we use the Slurm job sch
 
 The Alliance documentation provides lots of general information about submitting jobs using the Slurm job scheduler: [https://docs.alliancecan.ca/wiki/Running_jobs](https://docs.alliancecan.ca/wiki/Running_jobs)
 
-For some example Slurm workloads specific to the Killarney cluster (sbatch files, resource configurations, software environments, etc.) see the (../slurm-examples)[slurm-examples] provided in this repo.
+For some example Slurm workloads specific to the Killarney cluster (sbatch files, resource configurations, software environments, etc.) see the [Slurm examples](../slurm-examples) provided in this repo.
 
+## View jobs in a Slurm cluster (squeue)
 
-## View jobs in the Slurm cluster (squeue)
-
-To view all the jobs currently in the cluster, either running, pending or failed, use **squeue**: ([https://slurm.schedmd.com/squeue.html](https://slurm.schedmd.com/squeue.html))
-
+To view all the jobs currently in a cluster, either running, pending or failed, use `squeue`: ([https://slurm.schedmd.com/squeue.html](https://slurm.schedmd.com/squeue.html))
 
 ```
 username@klogin01:~$ squeue
@@ -263,16 +198,14 @@ username@login01:~$ $ squeue --me
 
 Refer to the ([squeue manual page](https://slurm.schedmd.com/squeue.html)) for a full list of options.
 
-
 ## Submit a new Slurm job (sbatch)
 
-To ask Slurm to run your jobs in the background so you can have your job running, even after logging off, use sbatch https://slurm.schedmd.com/sbatch.html
+To ask Slurm to run your jobs in the background so you can have your job running, even after logging off, use `sbatch`: https://slurm.schedmd.com/sbatch.html
 
-To use sbatch, you need to create a file, specify the configurations within (you can also specify these on the command line) and then run `sbatch my_sbatch_slurm.sh` to get Slurm to schedule it.
+To use sbatch, you need to create a file, specify the configurations within (you can also specify these on the command line) and then run `sbatch my_sbatch_slurm.sh` to get Slurm to schedule it. **Note**: You cannot submit jobs from your home directory. You need to submit them from a scratch or project folder.
 
 Example Hello World sbatch file (hello_world.sh):
 
-
 ```
 #!/bin/bash
 #SBATCH --job-name=hello_world_example
@@ -302,7 +235,6 @@ Since Slurm runs your job in the background, it becomes really difficult to see
 
 Note that the %j in output and error configuration tells Slurm to substitute the job ID where the %j is. So if your job ID is 1234 then your output file will be `hello_world.1234.out` and your error file will be `hello_world.1234.err`.
 
-
 ## Interactive sessions (srun)
 
 If all you want is an interactive session on a GPU node (without the batch job), just use `srun` (https://slurm.schedmd.com/srun.html)
@@ -324,8 +256,7 @@ srun: job 501831 has been allocated resources
 username@kn138:/project
 ```
 
-After you see $USER@kn###, you can run your script interactively.
-
+After you see $USER@kn###, you can use this shell session interactively.
 
 ## SSH to sbatch job
 
@@ -347,7 +278,6 @@ username@klogin01:~/scratch/imagenet$ srun --pty --overlap --jobid 937373 -w kn0
 username@kn060:~/scratch/imagenet$
 ```
 
-
 ## Accessing specific GPUs
 
 The Killarney cluster has both NVIDIA L40S and H100 GPUs available. To request a specific GPU type, use the `--gres=gpu` flag, for example:
@@ -367,12 +297,12 @@ srun: job 581667 has been allocated resources
 username@kn178:/scratch$
 ```
 
+Be careful about choosing the correct number of GPUs and correct time limit. The Slurm scheduler will "bill" you for resources used, so a resource-heavy job will reduce your future priority.
 
 ## View cluster resource utilization (sinfo)
 
 To see the availability in a more granular scale use sinfo ([https://slurm.schedmd.com/sinfo.html](https://slurm.schedmd.com/sinfo.html)). For example:
 
-
 ```
 sinfo -N --Format=Partition,CPUsState,GresUsed,Gres
 ```
@@ -400,6 +330,7 @@ gpubase_l40s_b3     32/32/0/64          gpu:l40s:4(IDX:0-3) gpu:l40s:4
 [...]
 ```
 
+
 # Software Environments
 
 The cluster comes with preinstalled software environments called **modules**. These will allow you to access many different versions of Python, VS Code Server, RStudio Server, NodeJS and many others. 
@@ -470,7 +401,7 @@ gpubase_l40s_b5    up 7-00:00:00        17/0/0/17 kn[085-101]
 
 ## Automatic Restarts
 
-All jobs in our Slurm cluster have a time limit, after which they will get stopped. For longer running jobs which need more than a few hours, the [Vaughan Slurm Changes](https://support.vectorinstitute.ai/Computing?action=AttachFile&do=view&target=Vector+Vaughan+HPC+Changes+FAQ+2023.pdf) document describes how to automatically restart these.
+When a job exceeds its time limit, it will get stopped by the Slurm scheduler. For longer running jobs which need more than a few hours, see our [Timeout Requeue](../slurm-examples/timeout-requeue/) example which shows how to automatically requeue your job.
 
 ## Checkpoints
 
@@ -479,19 +410,6 @@ In order to avoid losing your work when your job exits, you will need to impleme
 On the legacy Bon Echo cluster, there was a dedicated checkpoint space in the file system for checkpoints. **⚠️ In Killarney, there is no dedicated checkpoint space.** Users are expected to manage their own checkpoints under their `$SCRATCH` folder.
 
 
-# Useful Links and Resources
-
-Computing parent page: https://support.vectorinstitute.ai/Computing
-
-Vaughan valid partition/qos: https://support.vectorinstitute.ai/Vaughan_slurm_changes
-
-Checkpointing: https://support.vectorinstitute.ai/CheckpointExample
-
-Slurm Scheduler: https://support.vectorinstitute.ai/slurm_fairshare
-
-FAQ: https://support.vectorinstitute.ai/FAQ%20about%20the%20cluster
-
-
 # Support
 
 For any cluster issues please email ops-help@vectorinstitute.ai.