Nanochat bench

This repository provides material to use Nanochat as a containerized LLM training test case for benchmarking scalability and performance of different container runtimes, with a primary focus on the Alps research infrastructure at CSCS.

Container image

The default container image provides an installation of Nanochat ready to go without resorting to a virtual environment, allowing to skip the installation of dependencies and build of the Python package itself.

The image is built on top of the NVIDIA NGC PyTorch 25.10, providing an optimized software stack featuring:

CUDA 13.0
CUDNN 9.14.0
PyTorch 2.9.0+145a3a7bda
NCCL 2.27.7
AWS OFI NCCL plugin 1.17.1

Note: Since the image provides its own AWS NCCL plugin and the included libfabric 1.22 matches the host version on Alps, it is not necessary to activate the AWS OFI NCCL container hook. The EDFs already define environment variables setting NCCL to use libfabric and the Slingshot network.

It is advised to not change the container working directory from /workspace/nanochat (as set in the Containerfile), since the documented commands and scripts rely on a filesystem path to run the Nanochat Python modules.

The image currently uses a custom version of Nanochat, introducing small tweaks to run beyond 32 GPUs as detailed in the "Notes on scaling" section below.

Setting up

Note: for reference and additional details check the speedrun.sh script in the upstream Nanochat repo.

Clone this repository
(Optional) Use the Containerfile to build and customize your own container image if you don't want to use the default one. The build requires the modified pyproject.toml file provided in this repo.
Copy the EDF files from edf/ into your EDF search path (by default $HOME/.edf/). Change the image reference if you don't intend to use the default image.

Start an interactive container

$ srun --environment=nanochat --pty bash

Download enough data shards for training the default d20 network:
```
$ python -m nanochat.dataset -n 250
```
Data will be downloaded under $NANOCHAT_BASE_DIR, which is set in the EDFs to ${SCRATCH}/nanochat-bench/cache.

Download identity conversation data (used in midtraining)

$ curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl

Train the tokenizer on ~2B characters of data

$ python -m scripts.tok_train --max_chars=2000000000

Evaluate the tokenizer
```
$ python -m scripts.tok_eval
```

Note: Training the tokenizer only once and not including it in the measured runs is an arbitrary choice by this repo. The main point is that tokenizer training is not relevant for evaluating scaling performance, since it happens on a single process.

Running the training

Interactively (allows to run steps individually, have an immediate feedback on the terminal and see what's happening):
1. Obtain an interactive Slurm allocation with salloc
2. Follow the commands in the e2e-interactive.txt file
As a batch job: use the sbatch.sh script after tweaking it to your preference (e.g. number of nodes).

Note: The sbatch-compare-runtimes.sh script is not intended for general usage, since the use of Podman to run containers at scale on Alps is still experimental.

Notes on scaling

The Nanochat code is intended to run on a node with 8 H100 GPUs. On Alps this is equivalent to 2 compute nodes, each having 4 GH200 modules. The complete training for the d20 model takes slightly longer than the 4 hours indicated in the Nanochat documentation. Among several factors, it should be considered that using 2 separate nodes with 4 GPUs each is less efficient than having a single node with 8 GPUs.

When taking care of adjusting total batch size (for pre-training and midtraining), split tokens (for base evaluation) and target examples per step (for supervised fine-tuning) according to the number of GPUs/ranks used, the original Nanochat code runs without problems up to about 50 GPUs.

For the following sections we assume to be scaling GPUs and ranks in powers of 2, therefore the highest viable number with upstream Nanochat is 32. We also assume to use 1 GPU per rank, hence those terms are used interchangeably.

Running beyond 32 GPUs

In the original Nanochat code, a single data shard is used for validation during pre-training, having each of the ranks access one row-group of data. Since the data shard files are similar in size and contain slightly over 50 row-groups, when using 64 ranks some of them don't get any data to process, causing in a stall during an all-reduce operation, and eventually resulting in the NCCL watchdog timer killing the process.

The customized version of Nanochat used in the Containerfile changes the way ranks access data shards during training and evaluation, introducing a mechanism to use multiple files during validation (instead of just 1) and to "stride" contiguously through row groups across file boundaries, thus allowing to support arbitrary numbers of ranks.

The number of files to use during pre-training validation is controlled through the NANOCHAT_VAL_FILES_K environment variable. 4 files result in roughly 210 row groups, and thus are suitable to run up to 128 ranks.

Since more files are used for validation, we need to download a slightly higher number of files beforehand to preserve the recommended amount of training data. This is the reason why in the setup section above it's indicated to download 250 shards instead of the 240 documented in the upsteam Nanochat repository.

Running beyond 64 GPUs

Running pre-training on 128 GPUs of Alps (i.e. 32 nodes) resulted in several GPUs going out-of-memory (OOM).

At the time of writing, no other solution was found besides reducing the device batch size to 16 (from the default of 32). To do so, the CLI option --device_batch_size=16 must be passed to the arguments of the scripts.base_train script, e.g. in the case of interactive execution:

srun --ntasks-per-node=1 --gpus-per-task=4 --cpus-per-task=288 --environment=nanochat torchrun --nproc_per_node=4 --nnodes=${SLURM_JOB_NUM_NODES} --rdzv_endpoint $MASTER_ADDRESS:$MASTER_PORT --rdzv_backend c10d -m scripts.base_train -- --depth=20 --device_batch_size=16 --total_batch_size=${TOTAL_BATCH_SIZE} --run=dummy

Leaving the total batch size unchaged results in the same number of pre-training iterations with 2 gradient accumulation steps per iteration, which is less efficient but still scales decently.

Forcing optimized implementations for SDPA

In the attempt to reduce GPU memory load to address the previous point, the custom Nanochat code used in the Containerfile allows to restrict the choice of PyTorch implementations for Scaled Dot-Product Attention (SDPA) to the more optimized options (FlashAttention and Memory-Efficient Attention), excluding the "math" implementation.

This restriction is activated by setting the NANOCHAT_FORCE_FLASH_SDPA=1 environment variable and is only possible with recent PyTorch releases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nanochat bench

Container image

Setting up

Running the training

Notes on scaling

Running beyond 32 GPUs

Running beyond 64 GPUs

Forcing optimized implementations for SDPA

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
edf		edf
Containerfile		Containerfile
LICENSE		LICENSE
README.md		README.md
e2e-interactive.txt		e2e-interactive.txt
pyproject.toml		pyproject.toml
sbatch-compare-runtimes.sh		sbatch-compare-runtimes.sh
sbatch.sh		sbatch.sh

License

sarus-suite/nanochat-bench

Folders and files

Latest commit

History

Repository files navigation

Nanochat bench

Container image

Setting up

Running the training

Notes on scaling

Running beyond 32 GPUs

Running beyond 64 GPUs

Forcing optimized implementations for SDPA

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages