The main purpose of this project was for me to better grasp all the details in the end-to-end process of training a system such as GPT-2. However, I hope this project can also be of use to others, if only by providing an example implementation of this end-to-end process.
train.py
contains the entry point for training this GPT-2 like model on a subset of FineWeb-Edu, across all GPUs of the machine.inference.ipynb
contains code for loading and streaming from the model trained intrain.py
(or the model I have pre-trained for you).outputs/tokenizer
andoutputs/model.pt
contain the result of my training run on a 1024-length sequence packed encoding of the sample-10BT subset of FineWeb-Edu, for just below 30k updates (1.5 epochs) of batch size 512 each.utilities
contains our model and optimizer architectures, as well as all the helper code we use for loading and preprocessing data and performing our training runs.
- pixi. Installation guide:
curl -fsSL https://pixi.sh/install.sh | sh
- git lfs. Installation guide:
- Linux:
sudo apt-get install git-lfs ; git lfs install
- macOS:
brew install git-lfs ; git lfs install
- Linux:
git clone https://github.com/JaHeRoth/gpt2-from-scratch.git
cd gpt2-from-scratch
pixi install
pixi shell
- Exact network architectures of GPT and GPT-2 (down to the level of every individual nn.Parameter)
- Inner workings of AdamW optimizer
- LLM sampling tricks (and implementing temperature and nucleus sampling)
- Sequence packing
- HuggingFace tokenizers, datasets and Hub
- The PyTorch stack
- GPU tricks (kernel fusion through torch.compile, optimizing tensor sizes)
- Mixed-precision training
- Distributed training with DistributedDataParallel
- Optimizations of newer models: MoE, MLA, ROPE, YaRN, SWiGLU, QKNorm
- SFT, to turn into chatbot
- Reinforcement learning and allowing "thinking" stage (although I doubt model is smart enough to benefit from chain of thought)
- Attention Is All You Need (Transformer paper)
- Generating Wikipedia by Summarizing Long Sequences (Decoder-only transformer paper)
- Improving Language Understanding by Generative Pre-Training (GPT paper)
- Language Models are Unsupervised Multitask Learners (GPT-2 paper)
- Gaussian Error Linear Units (GELU paper)
- Decoupled Weight Decay Regularization (AdamW paper)
- Using the Output Embedding to Improve Language Models (Weight tying paper)
- The Curious Case of Neural Text Degeneration (Nucleus sampling paper)
- Andrej Karpathy’s Lectures and Tutorials