Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ See [getting-started](./getting-started) for documentation on using Vector compu

## Templates

See [templates](./templates) for training templates with Hydra + Submitit.
See [templates](./templates) for training templates that use Hydra + Submitit to structure experiments.

- Code lives under: [templates/src](./templates/src)
- Cluster configs live under: [templates/configs](./templates/configs)
Expand Down
16 changes: 13 additions & 3 deletions getting-started/introduction-to-vector-compute/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ ssh-ed25519 AAAA5AA7OZOZ7NRB1acK54bB47h58N6AIEX4zDziR1r0nM41d3NCG0fgCArjUD45pr13

Next, open the SSH Keys page in your Alliance account: [https://ccdb.alliancecan.ca/ssh_authorized_keys](https://ccdb.alliancecan.ca/ssh_authorized_keys). Paste your key into the SSH Key field, give it a name (typically the host name of the computer where you generated it) and hit Add Key.

**NOTE:** You may need to wait up to 30 minutes after adding your ssh key for it to work when trying to login via ssh. Have lunch and come back.

## SSH Access

Expand Down Expand Up @@ -127,6 +128,7 @@ In addition to your home directory, you have a minimum of additional 250 GB scra

A detailed description of the scratch purging policy is available on the Alliance Canada website: [https://docs.alliancecan.ca/wiki/Scratch_purging_policy](https://docs.alliancecan.ca/wiki/Scratch_purging_policy)

Your scratch space directory will not exist when you initially log in. To have it set up send a request to [[email protected]](mailto:[email protected]). Include the name of your PI in the email.

## Shared projects

Expand All @@ -143,7 +145,7 @@ Instead of copying these datasets on your home directory, you can create a symli


```
ln -s /dataset/PATH_TO_DATASET ~/PATH_OF_LINK # path of link can be some place in your home directory so that PyTorch/TF can pick up the dataset to these already downloaded directories.
ln -s /datasets/PATH_TO_DATASET ~/PATH_OF_LINK # path of link can be some place in your home directory so that PyTorch/TF can pick up the dataset to these already downloaded directories.
```


Expand All @@ -162,6 +164,8 @@ Unlike the legacy Bon Echo (Vaughan) cluster, there is no dedicated checkpoint s

# Migration from legacy Vaughan (Bon Echo) Cluster

**NOTE:** The approach for migrating detailed here requires that you set up a second ssh key on killarney. Your public ssh key on the vaughan cluster will be different than the one on your local machine.

The easiest way to migrate data from the legacy Vaughan (Bon Echo) Cluster to Killarney is by using a file transfer command (likely `rsync` or `scp`) from an SSH session.

Start by connecting via https://support.vectorinstitute.ai/Killarney?action=AttachFile&do=view&target=User+Guide+to+Killarney+for+Vector+Researchers.pdfsh into the legacy Bon Echo (Vaughan) cluster:
Expand Down Expand Up @@ -377,6 +381,8 @@ gpubase_l40s_b3 32/32/0/64 gpu:l40s:4(IDX:0-3) gpu:l40s:4
[...]
```

For CPU's, A/I/OT stands for **A**llocated, **I**dle, **O**ther (eg. down) and **T**otal. Even if the GPU's on a node are available, if there are no Idle CPU's on the node then you won't be able to use it.

## Jupyter notebooks

To run a Jupyter environment from the cluster, you can request an interactive session and start a Jupyter notebook from there.
Expand Down Expand Up @@ -430,6 +436,7 @@ You will need a VPN connection to access this notebook. Once you are connected t

# Software Environments

## Pre-installed Environments
The cluster comes with preinstalled software environments called **modules**. These will allow you to access many different versions of Python, VS Code Server, RStudio Server, NodeJS and many others.

To see the available preinstalled environments, run:
Expand All @@ -444,7 +451,8 @@ To use an environment, use `module load`. For example, if you need to use Python
module load python/3.10.12
```

If there isn't a preinstalled environment for your needs, you can use Poetry or python-venv. Here is a quick example of how to use python venv.
## Custom Environments
If there isn't a preinstalled environment for your needs, you can use [uv](https://docs.astral.sh/uv/), or python-venv. For ongoing projects it is highly recommended to use uv to manage dependencies. To just run something quickly one time, python-venv might be easier. Here is a quick example of how to use python venv.

In the login node run the following:

Expand Down Expand Up @@ -498,13 +506,15 @@ gpubase_l40s_b5 up 7-00:00:00 17/0/0/17 kn[085-101]

## Automatic Restarts

**NOTE:** There is currently no premption on the Killarney cluster
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also worth noting somewhere in this documentation the nuance with all code having to reside in the scratch space. This leads to some weirdness with the uv cache, etc. It almost might be better to use scratch as a home directory and use home directory as a backup for files.


All jobs in our Slurm cluster have a time limit, after which they will get stopped. For longer running jobs which need more than a few hours, the [Vaughan Slurm Changes](https://support.vectorinstitute.ai/Computing?action=AttachFile&do=view&target=Vector+Vaughan+HPC+Changes+FAQ+2023.pdf) document describes how to automatically restart these.

## Checkpoints

In order to avoid losing your work when your job exits, you will need to implement checkpoints - periodic snapshots of your work that you load from so you can stop and resume without much lost work.

On the legacy Bon Echo cluster, there was a dedicated checkpoint space in the file system for checkpoints. **⚠️ In Killarney, there is no dedicated checkpoint space.** Users are expected to manage their own checkpoints under their `$SCRATCH` folder.
On the legacy Bon Echo cluster, there was a dedicated checkpoint space in the file system for checkpoints. **⚠️ In Killarney, there is no dedicated checkpoint space.** Users are expected to manage their own checkpoints under their `$SCRATCH` folder. Recall that your scratch folder is not permanent, and so you'll want to move any important checkpoints to you're home or project folder.


# Useful Links and Resources
Expand Down
99 changes: 73 additions & 26 deletions templates/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,36 @@

Templates for training ML models workflows on Bon Echo and Killarney clusters using Hydra and Submitit.

[Hydra](https://hydra.cc/docs/intro/) is a python framework for creating configurable experiments that you can change through a config file. One of it's main uses is its ability to automatically perform hyperparameter sweeps for model training.

[submitit](https://github.com/facebookincubator/submitit) is a simple python package that lets you submit slurm jobs programmatically and automatically access and manipulate the results of those jobs once they are complete. It also handles automatic requeing of jobs should they be inturrupted for some reason.

Hydra conveniently has a submitit plugin that allows them to work together. Put simply, using these tools you can automatically queue up a large number of experiments, run dependent experiments sequentially, requeue long running experiments and more.

## Local Setup

1) Install [uv](https://docs.astral.sh/uv/getting-started/installation/)
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

2) Clone the vec-playbook repository
```bash
git clone https://github.com/VectorInstitute/vec-playbook.git
```

3) Resolve and install dependencies from `pyproject.toml` into a virtual environment:
```bash
cd path/to/vec-playbook
uv sync # Automatically installs dependencies in vec-playbook/.venv
```

Finally, ensure you're working directory (by default your cluster scratch space) exists and that you have access to the resources you're requesting on the cluster.

### UV Tip for Killarney

If you're on killarney you'll have to clone the repository into your scratch space. You can't run files stored in your home directory. The UV cache by default is located in your home directory which is a different filesystem. This breaks uv's default method of hardlinking packages to avoid having to redownload packages. You can either change your cache directory to be on the same filesystem or use `--link-mode=copy`. Avoid using symlink mode as this can break things.

## Layout

```
Expand All @@ -17,25 +47,14 @@ templates/
Each template directory is self-contained: it has a `launch.py`, a `train.py`, and a `config.yaml`.
The `configs/` directory defines Slurm presets and shared Hydra + Submitit settings.

Hydra starts from `configs/_global.yaml`, pulls in the appropriate entries from `configs/user.yaml` and `configs/compute/*`, then merges the template's own `config.yaml` before forwarding the resolved configuration to Submitit; CLI overrides (e.g. `compute=killarney/h100_1x`) are applied in that final merge, so every launch script receives a single, fully-specified config that Submitit uses to submit or run locally.

## Local Setup
Hydra starts from `configs/_global.yaml` and pulls in the appropriate entries from `configs/user.yaml` and `configs/compute/*`. The launch script within each template then merges the template's own local `config.yaml` before forwarding the resolved configuration to Submitit; CLI overrides (e.g. `compute=killarney/h100_1x`) are applied in that final merge, so every launch script receives a single, fully-specified config that Submitit uses to submit or run locally.

1) Create and activate a virtual environment:
```bash
uv venv .venv
source .venv/bin/activate
```
The `_global.yaml` config contains the bulk of the autoconfiguration. Placeholders are used to automatically fill values with values from other configuration files. `hydra.launcher` arguments largely align with the CLI arguments available for the [sbatch](https://slurm.schedmd.com/sbatch.html) command. See [this](https://hydra.cc/docs/plugins/submitit_launcher/) page for the officialy available hydra slurm launcher parameters. Note that the majority of the parameters are sourced from the selected `compute` config.

2) Resolve and install dependencies from `pyproject.toml`:
```bash
uv lock
uv sync
```

## Cluster Setup

1) Provide your user Slurm account and any optional parameters in `templates/configs/user.yaml`.
1) Provide your Slurm user account and any optional parameters in `templates/configs/user.yaml`.

```yaml
user:
Expand All @@ -44,17 +63,19 @@ user:
# additional_parameters:
# qos: m2 # example Bon Echo QoS
```
**NOTE:** why is qos used as example of additional parameter here when it is an official launcher parameter that seems to be sourced from compute config?

Uncomment and edit `additional_parameters` entries as needed. Use CLI overrides for alternate accounts or QoS when launching jobs, for example `... user.slurm.account=ACCOUNT_B user.slurm.additional_parameters.qos=fast`.
Uncomment and edit `additional_parameters` entries as needed. This field is solely for sbatch arguments not already available in the [Hydra Submitit Slurm Launcher Plugin](https://hydra.cc/docs/plugins/submitit_launcher/). Use CLI overrides for alternate accounts or QoS when launching jobs, for example `... user.slurm.account=ACCOUNT_B user.slurm.additional_parameters.qos=fast`.

2) Pick a compute preset:
2) Pick a compute preset to use in the next section:
- `templates/configs/compute/bon_echo/*` (A40, A100)
- `templates/configs/compute/killarney/*` (L40S, H100)
- Create your own preset under `templates/configs/compute/` if you need different resources (match the YAML shape used in the existing files).

## Running Templates

All launchers follow the same pattern: use `uv run python -m <package>.launch` with Hydra overrides that select compute presets, requeue behaviour, and any template-specific hyperparameters.
All launchers follow the same pattern: use `uv run python -m <package>.launch` with Hydra overrides that select compute presets, requeue behaviour, and any template-specific hyperparameters. uv will automatically detect the virtual environment located in `.venv` of your CWD. The templates are automatically loaded as python modules by `uv`. If you add your own template you will have to sync the virtual environment using `uv sync`.

### Command Pattern

```bash
Expand All @@ -65,11 +86,14 @@ uv run python -m <template_pkg>.launch \
--multirun
```

- `compute=<cluster>/<preset>` chooses the Slurm resources defined under `templates/configs/compute/` (or a custom preset you add).
- `requeue=<on|off>` toggles the Submitit requeue flag described in the checkpointing section.
- `<template_pkg>`: The module path to the template launch script (eg. `mlp.single`)
- `compute=<cluster>/<preset>`: chooses the Slurm resources defined under `templates/configs/compute/` (or a custom preset you add).
- `requeue=<on|off>`: toggles the Submitit requeue flag described in the checkpointing section.
- Additional Hydra overrides use `key=value` syntax; nested keys follow the YAML structure (e.g., `trainer.learning_rate=5e-4`).
- Use of `--multirun` is required for the launcher to be picked up..
- Prepend `+` to introduce new keys at runtime, like `+trainer.notes=baseline_a`.
- Prepend `+` to introduce new keys (not already present in config) at runtime, like `+trainer.notes=baseline_a`.
- Use of `--multirun` is required for the launcher to be picked up.

[//]: <> (What does "picked up" mean when explaining --multirun flag?)

### Examples (single parameter set)

Expand All @@ -87,7 +111,19 @@ uv run python -m llm.text_classification.launch \
--multirun
```

Your output should look something like this:
```
[2025-09-29 11:06:00,546][HYDRA] Submitit 'slurm' sweep output dir : /scratch/$USER/vec_jobs/20250929-110600
[2025-09-29 11:06:00,546][HYDRA] #0 : compute=killarney/l40s_1x
```

[//]: <> (Why does learning_rate need the + prepended if its already in local config?)
[//]: <> (Perhaps a little more clarity on this)
[//]: <> (`+trainer.num_epochs=100` override did not work for mlp.single)
[//]: <> (multirun.yaml is long and confusing and still contains placeholders. Is there a way to save the final static config yaml?)

Hydra blocks until the job finishes (or fails). For long or interactive sessions, wrap the command in `tmux`, `screen`, or submit a wrapper script as shown below.

### Practical Patterns for Long Jobs

```bash
Expand All @@ -105,11 +141,13 @@ uv run python -m llm.text_classification.launch compute=bon_echo/a40_1x --multir

Hydra sweeps expand comma-separated value lists into Cartesian products and schedule each configuration as a separate Submitit job. Output directories are numbered based on Hydra's sweep index.

[//]: <> (Sweep seems to work, but checkpoints overwrite eachother i'm assuming? Hydra does not create subdirectories in outputs for sweep.l)

```bash
# Sweep learning rate and hidden size for the MLP template
uv run python -m mlp.single.launch \
+trainer.learning_rate=1e-2,1e-3,1e-4 \
+trainer.hidden_dim=64,128,256 \
+trainer.learning_rate=1e-2,1e-3 \
+trainer.hidden_dim=64,128 \
compute=bon_echo/a40_1x \
--multirun

Expand All @@ -121,13 +159,23 @@ uv run python -m vlm.image_captioning.launch \
--multirun
```

Your output for a sweep should look something like this:

```
[2025-09-29 11:06:00,546][HYDRA] Submitit 'slurm' sweep output dir : /scratch/$USER/vec_jobs/20250929-110600
[2025-09-29 11:06:00,546][HYDRA] #0 : +trainer.learning_rate=0.01 +trainer.hidden_dim=64 compute=killarney/l40s_1x
[2025-09-29 11:06:00,546][HYDRA] #1 : +trainer.learning_rate=0.01 +trainer.hidden_dim=128 compute=killarney/l40s_1x
[2025-09-29 11:06:00,546][HYDRA] #2 : +trainer.learning_rate=0.001 +trainer.hidden_dim=64 compute=killarney/l40s_1x
[2025-09-29 11:06:00,546][HYDRA] #3 : +trainer.learning_rate=0.001 +trainer.hidden_dim=128 compute=killarney/l40s_1x
```

### Monitoring Jobs

By default, Hydra and Submitit create the working directory at `~/vec_jobs/<timestamp>` (see `configs/_global.yaml`). Override it when needed with flags such as `paths.work_root=/scratch/$USER` or `work_dir=/scratch/$USER/vec_jobs/${experiment_name}`.

```bash
# Check SLURM job status
squeue -u $USER
squeue --me

# Inspect the latest work directory
ls -1t ~/vec_jobs | head
Expand All @@ -137,7 +185,7 @@ tail -f ~/vec_jobs/YYYYMMDD-HHMMSS/.submitit/*/stdout*
```
## Checkpointing & Requeue

Checkpointing lets Submitit resubmit interrupted jobs (preemption, timeout, manual `scontrol requeue`) without restarting from scratch. The templates already subclass `submitit.helpers.Checkpointable`, so they ship with a default `checkpoint()` that returns `DelayedSubmission(self, *args, **kwargs)`. You simply need to persist enough training state to continue where you left off.
Checkpointing lets Submitit resubmit interrupted jobs (preemption, timeout, manual `scontrol requeue`) without restarting from scratch. The templates already subclass `submitit.helpers.Checkpointable`, so they ship with a default `checkpoint()` that returns `DelayedSubmission(self, *args, **kwargs)`. You simply need to persist enough training state to continue where you left off. See [mlp.single.train](src/mlp/single/train.py) for an example of a basic checkpointing implementation.

Submitit’s official [checkpointing guide](https://github.com/facebookincubator/submitit/blob/main/docs/checkpointing.md) covers how the `checkpoint()` hook works under the hood and provides additional patterns (e.g., swapping callables, partial pickling) if you need more control.

Expand All @@ -151,7 +199,6 @@ Submitit’s official [checkpointing guide](https://github.com/facebookincubator
3. Ensure your `checkpoint()` method returns a `DelayedSubmission` that recreates the callable with the same arguments. If you need custom behaviour (changing hyperparameters, skipping corrupt steps), instantiate a new callable and pass it to `DelayedSubmission` instead of `self`.
4. Test the flow by requeueing a running job (`scancel --signal=USR1 <jobid>` or Submitit's `job._interrupt(timeout=True)`) to confirm state is restored as expected.


## Resources
- Submitit: https://github.com/facebookincubator/submitit
- Hydra Submitit launcher: https://hydra.cc/docs/plugins/submitit_launcher