Skip to content

Commit 2686e18

Browse files
authored
Merge pull request #11 from VectorInstitute/proofread
Proofread which turn into a complete fix of the templates which were not working in their current state.
2 parents dfa241a + 0e03282 commit 2686e18

File tree

26 files changed

+451
-211
lines changed

26 files changed

+451
-211
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ See [getting-started](./getting-started) for documentation on using Vector compu
1313

1414
## Templates
1515

16-
See [templates](./templates) for training templates with Hydra + Submitit.
16+
See [templates](./templates) for training templates that use Hydra + Submitit to structure experiments.
1717

1818
- Code lives under: [templates/src](./templates/src)
1919
- Cluster configs live under: [templates/configs](./templates/configs)

getting-started/introduction-to-vector-compute/README.md

Lines changed: 65 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ ssh-ed25519 AAAA5AA7OZOZ7NRB1acK54bB47h58N6AIEX4zDziR1r0nM41d3NCG0fgCArjUD45pr13
6666

6767
Next, open the SSH Keys page in your Alliance account: [https://ccdb.alliancecan.ca/ssh_authorized_keys](https://ccdb.alliancecan.ca/ssh_authorized_keys). Paste your key into the SSH Key field, give it a name (typically the host name of the computer where you generated it) and hit Add Key.
6868

69+
**NOTE:** You may need to wait up to 30 minutes after adding your ssh key for it to work when trying to login via ssh. Have lunch and come back.
6970

7071
## SSH Access
7172

@@ -127,6 +128,8 @@ In addition to your home directory, you have a minimum of additional 250 GB scra
127128

128129
A detailed description of the scratch purging policy is available on the Alliance Canada website: [https://docs.alliancecan.ca/wiki/Scratch_purging_policy](https://docs.alliancecan.ca/wiki/Scratch_purging_policy)
129130

131+
Your scratch space directory will not exist when you initially log in. To have it set up send a request to [[email protected]](mailto:[email protected]). Include the name of your PI in the email.
132+
130133
## Shared projects
131134

132135
For collaborative projects where many people need access to the same files, you need a shared project space. These are stored at `/project`.
@@ -140,7 +143,7 @@ To reduce the storage footprint for each user, we've made various commonly-used
140143
Instead of copying these datasets on your home directory, you can create a symlink via:
141144

142145
```
143-
ln -s /dataset/PATH_TO_DATASET ~/PATH_OF_LINK # path of link can be some place in your home directory so that PyTorch/TF can pick up the dataset to these already downloaded directories.
146+
ln -s /datasets/PATH_TO_DATASET ~/PATH_OF_LINK # path of link can be some place in your home directory so that PyTorch/TF can pick up the dataset to these already downloaded directories.
144147
```
145148

146149
## Shared model weights
@@ -152,6 +155,61 @@ Similar to datasets, model weights are typically very large and can be shared am
152155
Unlike the legacy Bon Echo (Vaughan) cluster, there is no dedicated checkpoint space in the Killarney cluster. Now that the `$SCRATCH` space has been greatly expanded, please use this for any training checkpoints.
153156

154157

158+
# Migration from legacy Vaughan (Bon Echo) Cluster
159+
160+
**NOTE:** The approach for migrating detailed here requires that you set up a second ssh key on killarney. Your public ssh key on the vaughan cluster will be different than the one on your local machine.
161+
162+
The easiest way to migrate data from the legacy Vaughan (Bon Echo) Cluster to Killarney is by using a file transfer command (likely `rsync` or `scp`) from an SSH session.
163+
164+
Start by connecting via https://support.vectorinstitute.ai/Killarney?action=AttachFile&do=view&target=User+Guide+to+Killarney+for+Vector+Researchers.pdfsh into the legacy Bon Echo (Vaughan) cluster:
165+
166+
167+
```
168+
username@my-desktop:~$ ssh v.vectorinstitute.ai
169+
Password:
170+
Duo two-factor login for username
171+
172+
Enter a passcode or select one of the following options:
173+
174+
1. Duo Push to XXX-XXX-3089
175+
2. SMS passcodes to XXX-XXX-3089
176+
177+
Passcode or option (1-2): 1
178+
Success. Logging you in...
179+
Welcome to the Vector Institute HPC - Vaughan Cluster
180+
181+
Login nodes are shared among many users and therefore
182+
must not be used to run computationally intensive tasks.
183+
Those should be submitted to the slurm scheduler which
184+
will dispatch them on compute nodes.
185+
186+
For more information, please consult the wiki at
187+
https://support.vectorinstitute.ai/Computing
188+
For issues using this cluster, please contact us at
189+
190+
If you forget your password, please visit our self-
191+
service portal at https://password.vectorinstitute.ai.
192+
193+
Last login: Mon Aug 18 07:28:24 2025 from 184.145.46.175
194+
```
195+
196+
Next, use the `rsync` command to copy files across to the Killarney cluster. In the following example, I'm copying the contents of a folder called `my_projects` to my Killarney home directory.
197+
198+
```
199+
username@v4:~$ cd ~/my_projects
200+
username@v4:~/my_projects$ rsync -avz * [email protected]:~/my_projects~
201+
Duo two-factor login for username
202+
203+
Enter a passcode or select one of the following options:
204+
205+
1. Duo Push to Phone
206+
207+
Passcode or option (1-1): 1
208+
Success. Logging you in...
209+
sending incremental file list
210+
[...]
211+
```
212+
155213
# Killarney GPU resources
156214

157215
There are two main types of GPU resources on the Killarney cluster: capacity GPUs (NVIDIA L40S) and high-performance GPUs (NVIDIA H100).
@@ -330,9 +388,12 @@ gpubase_l40s_b3 32/32/0/64 gpu:l40s:4(IDX:0-3) gpu:l40s:4
330388
[...]
331389
```
332390

391+
For CPU's, A/I/OT stands for **A**llocated, **I**dle, **O**ther (eg. down) and **T**otal. Even if the GPU's on a node are available, if there are no Idle CPU's on the node then you won't be able to use it.
392+
333393

334394
# Software Environments
335395

396+
## Pre-installed Environments
336397
The cluster comes with preinstalled software environments called **modules**. These will allow you to access many different versions of Python, VS Code Server, RStudio Server, NodeJS and many others.
337398

338399
To see the available preinstalled environments, run:
@@ -347,7 +408,8 @@ To use an environment, use `module load`. For example, if you need to use Python
347408
module load python/3.10.12
348409
```
349410

350-
If there isn't a preinstalled environment for your needs, you can use Poetry or python-venv. Here is a quick example of how to use python venv.
411+
## Custom Environments
412+
If there isn't a preinstalled environment for your needs, you can use [uv](https://docs.astral.sh/uv/), or python-venv. For ongoing projects it is highly recommended to use uv to manage dependencies. To just run something quickly one time, python-venv might be easier. Here is a quick example of how to use python venv.
351413

352414
In the login node run the following:
353415

@@ -407,7 +469,7 @@ When a job exceeds its time limit, it will get stopped by the Slurm scheduler. F
407469

408470
In order to avoid losing your work when your job exits, you will need to implement checkpoints - periodic snapshots of your work that you load from so you can stop and resume without much lost work.
409471

410-
On the legacy Bon Echo cluster, there was a dedicated checkpoint space in the file system for checkpoints. **⚠️ In Killarney, there is no dedicated checkpoint space.** Users are expected to manage their own checkpoints under their `$SCRATCH` folder.
472+
On the legacy Bon Echo cluster, there was a dedicated checkpoint space in the file system for checkpoints. **⚠️ In Killarney, there is no dedicated checkpoint space.** Users are expected to manage their own checkpoints under their `$SCRATCH` folder. Recall that your scratch folder is not permanent, and so you'll want to move any important checkpoints to you're home or project folder.
411473

412474

413475
# Support

pyproject.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,12 @@ requires = ["setuptools>=65", "wheel"]
33
build-backend = "setuptools.build_meta"
44

55
[tool.setuptools.packages.find]
6-
where = ["templates/src"]
6+
where = ["templates", "templates/src"] # Include configs and templates as packages
77
include = ["*"]
88

9+
[tool.setuptools.package-data]
10+
"configs" = ["**/*.yaml"] # Make sure configs package includes the yaml configs
11+
912
[project]
1013
name = "vec-playbook"
1114
version = "0.1.0"

0 commit comments

Comments
 (0)