This repository provides material to use Nanochat as a containerized LLM training test case for benchmarking scalability and performance of different container runtimes, with a primary focus on the Alps research infrastructure at CSCS.
The default container image provides an installation of Nanochat ready to go without resorting to a virtual environment, allowing to skip the installation of dependencies and build of the Python package itself.
The image is built on top of the NVIDIA NGC PyTorch 25.10, providing an optimized software stack featuring:
- CUDA 13.0
- CUDNN 9.14.0
- PyTorch 2.9.0+145a3a7bda
- NCCL 2.27.7
- AWS OFI NCCL plugin 1.17.1
Note: Since the image provides its own AWS NCCL plugin and the included libfabric 1.22 matches the host version on Alps, it is not necessary to activate the AWS OFI NCCL container hook. The EDFs already define environment variables setting NCCL to use libfabric and the Slingshot network.
It is advised to not change the container working directory from /workspace/nanochat (as set in the Containerfile), since the documented commands and scripts rely on a filesystem path to run the Nanochat Python modules.
The image currently uses a custom version of Nanochat, introducing small tweaks to run beyond 32 GPUs as detailed in the "Notes on scaling" section below.
Note: for reference and additional details check the speedrun.sh script in the upstream Nanochat repo.
- Clone this repository
- (Optional) Use the Containerfile to build and customize your own container image if you don't want to use the default one. The build requires the modified
pyproject.tomlfile provided in this repo. - Copy the EDF files from
edf/into your EDF search path (by default$HOME/.edf/). Change the image reference if you don't intend to use the default image. - Start an interactive container
$ srun --environment=nanochat --pty bash - Download enough data shards for training the default
d20network:Data will be downloaded under$ python -m nanochat.dataset -n 250$NANOCHAT_BASE_DIR, which is set in the EDFs to${SCRATCH}/nanochat-bench/cache. - Download identity conversation data (used in midtraining)
$ curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl - Train the tokenizer on ~2B characters of data
$ python -m scripts.tok_train --max_chars=2000000000 - Evaluate the tokenizer
$ python -m scripts.tok_eval
Note: Training the tokenizer only once and not including it in the measured runs is an arbitrary choice by this repo. The main point is that tokenizer training is not relevant for evaluating scaling performance, since it happens on a single process.
-
Interactively (allows to run steps individually, have an immediate feedback on the terminal and see what's happening):
- Obtain an interactive Slurm allocation with
salloc - Follow the commands in the
e2e-interactive.txtfile
- Obtain an interactive Slurm allocation with
-
As a batch job: use the
sbatch.shscript after tweaking it to your preference (e.g. number of nodes).
Note: The sbatch-compare-runtimes.sh script is not intended for general usage, since the use of Podman to run containers at scale on Alps is still experimental.
The Nanochat code is intended to run on a node with 8 H100 GPUs. On Alps this is equivalent to 2 compute nodes, each having 4 GH200 modules.
The complete training for the d20 model takes slightly longer than the 4 hours indicated in the Nanochat documentation.
Among several factors, it should be considered that using 2 separate nodes with 4 GPUs each is less efficient than having a single node with 8 GPUs.
When taking care of adjusting total batch size (for pre-training and midtraining), split tokens (for base evaluation) and target examples per step (for supervised fine-tuning) according to the number of GPUs/ranks used, the original Nanochat code runs without problems up to about 50 GPUs.
For the following sections we assume to be scaling GPUs and ranks in powers of 2, therefore the highest viable number with upstream Nanochat is 32. We also assume to use 1 GPU per rank, hence those terms are used interchangeably.
In the original Nanochat code, a single data shard is used for validation during pre-training, having each of the ranks access one row-group of data. Since the data shard files are similar in size and contain slightly over 50 row-groups, when using 64 ranks some of them don't get any data to process, causing in a stall during an all-reduce operation, and eventually resulting in the NCCL watchdog timer killing the process.
The customized version of Nanochat used in the Containerfile changes the way ranks access data shards during training and evaluation, introducing a mechanism to use multiple files during validation (instead of just 1) and to "stride" contiguously through row groups across file boundaries, thus allowing to support arbitrary numbers of ranks.
The number of files to use during pre-training validation is controlled through the NANOCHAT_VAL_FILES_K environment variable. 4 files result in roughly 210 row groups, and thus are suitable to run up to 128 ranks.
Since more files are used for validation, we need to download a slightly higher number of files beforehand to preserve the recommended amount of training data. This is the reason why in the setup section above it's indicated to download 250 shards instead of the 240 documented in the upsteam Nanochat repository.
Running pre-training on 128 GPUs of Alps (i.e. 32 nodes) resulted in several GPUs going out-of-memory (OOM).
At the time of writing, no other solution was found besides reducing the device batch size to 16 (from the default of 32). To do so, the CLI option --device_batch_size=16 must be passed to the arguments of the scripts.base_train script, e.g. in the case of interactive execution:
srun --ntasks-per-node=1 --gpus-per-task=4 --cpus-per-task=288 --environment=nanochat torchrun --nproc_per_node=4 --nnodes=${SLURM_JOB_NUM_NODES} --rdzv_endpoint $MASTER_ADDRESS:$MASTER_PORT --rdzv_backend c10d -m scripts.base_train -- --depth=20 --device_batch_size=16 --total_batch_size=${TOTAL_BATCH_SIZE} --run=dummy
Leaving the total batch size unchaged results in the same number of pre-training iterations with 2 gradient accumulation steps per iteration, which is less efficient but still scales decently.
In the attempt to reduce GPU memory load to address the previous point, the custom Nanochat code used in the Containerfile allows to restrict the choice of PyTorch implementations for Scaled Dot-Product Attention (SDPA) to the more optimized options (FlashAttention and Memory-Efficient Attention), excluding the "math" implementation.
This restriction is activated by setting the NANOCHAT_FORCE_FLASH_SDPA=1 environment variable and is only possible with recent PyTorch releases.