Skip to content

Commit 3745a4c

Browse files
authored
Sagemaker Hyperpod Recipes Release 1.4.0 (#43)
1 parent 6a633c5 commit 3745a4c

File tree

94 files changed

+3914
-5
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

94 files changed

+3914
-5
lines changed

.coveragerc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
# Exclude submodule directory from coverage
33
omit =
44
launcher/nemo/nemo_framework_launcher/*
5+
launcher/nova/constants/*
56
template/*
67

78
[report]

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,4 @@ coverage_html_report/
2525

2626
# Playground area
2727
mypg/
28+
.idea/

README.md

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,9 @@ Amazon SageMaker HyperPod recipes include built-in support for:
1414
- Automated distributed checkpointing
1515
- Distributed optimizer
1616
- Accelerators: NVIDIA H100 (ml.p5), NVIDIA A100 (ml.p4), and AWS Trainium (ml.trn1)
17-
- Fine-tuning: Full, QLoRA, LoRA, DPO
17+
- Fine-tuning: Full, QLoRA, LoRA, DPO, PPO
1818
- AWS Instances: ml.p5.48xlarge, ml.p4d.24xlarge, and ml.trn1.32xlarge instance families
19-
- Supported Models: DeepSeek R1, DeepSeek R1 Distill Llama, DeepSeek R1 Distill Qwen, Llama, Mistral, Mixtral models
19+
- Supported Models: DeepSeek R1, DeepSeek R1 Distill Llama, DeepSeek R1 Distill Qwen, Llama, Mistral, Mixtral models, Nova Micro, Nova Lite, Nova Pro.
2020
- Model Evaluation: [Tensorboard](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.tensorboard.html#module-lightning.pytorch.loggers.tensorboard), [MLflow](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.mlflow.html), [Wandb](https://lightning.ai/docs/pytorch/stable/extensions/generated/lightning.pytorch.loggers.WandbLogger.html) - feel free to add any key word arguments to the Logger classes by using their associated kwargs config
2121

2222
###### ***Note: For DeepSeek R1 671b customers must ensure that their model repository contains weights of type bf16. DeepSeek's [HuggingFace repository](https://huggingface.co/deepseek-ai/DeepSeek-R1) contains the model in dtype fp8 by default. In order to convert a model repository from fp8 to bf16 we recommend using [this script](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo/blob/main/src/hyperpod_nemo_adapter/scripts/fp8_cast_bf16.py) and pointing your recipe to the output directory.
@@ -60,18 +60,36 @@ List of specific pre-training recipes used by the launch scripts.
6060
| Hugging Face | Mixtral | 7b | 16384 | 32 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/training/mixtral/hf_mixtral_8x7b_seq16k_gpu_p5x32_pretrain.yaml) | [link](launcher_scripts/mixtral/run_hf_mixtral_8x7b_seq16k_gpu_p5x32_pretrain.sh) |
6161
| Hugging Face | Mixtral | 7b | 8192 | 16 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/training/mixtral/hf_mixtral_8x7b_seq8k_gpu_p5x16_pretrain.yaml) | [link](launcher_scripts/mixtral/run_hf_mixtral_8x7b_seq8k_gpu_p5x16_pretrain.sh) |
6262
| Hugging Face | Mixtral | 7b | 8192 | 32 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/training/mixtral/hf_mixtral_8x7b_seq8k_gpu_p5x32_pretrain.yaml) | [link](launcher_scripts/mixtral/run_hf_mixtral_8x7b_seq8k_gpu_p5x32_pretrain.sh) |
63+
| Amazon | Nova Micro | - | 8192 | 8 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/training/nova/nova_micro_p5x8_gpu_pretrain.yaml) | [link](launcher_scripts/nova/run_nova_micro_p5x8_gpu_pretrain.sh) |
64+
| Amazon | Nova Lite | - | 8192 | 16 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/training/nova/nova_lite_p5x16_gpu_pretrain.yaml) | [link](launcher_scripts/nova/run_nova_lite_p5x16_gpu_pretrain.sh) |
65+
| Amazon | Nova Pro | - | 8192 | 24 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/training/nova/nova_pro_p5x24_gpu_pretrain.yaml) | [link](launcher_scripts/nova/run_nova_pro_p5x24_gpu_pretrain.sh) |
6366

6467

6568
### Fine-Tuning
6669

6770
List of specific fine-tuning recipes used by the launch scripts.
68-
All model sources are from Hugging Face.
6971

7072
| Model | Method | Size | Sequence length | Nodes | Instance | Accelerator | Recipe | Script |
7173
| --------- | ------ | ---- | ----------------| ----- | -------------- | ----------- | ------ | ------ |
7274
| LLama 4 Scout | LoRA (multi-modal) | 17B 16E (109B) | 8192 | 2 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/llama/hf_llama4_17b_16e_seq8k_gpu_lora_multimodal_finetuning.yaml) | [link](launcher_scripts/llama/run_hf_llama4_17b_16e_seq8k_gpu_lora_multimodal_finetuning.sh) |
7375
| LLama 4 Scout | LoRA (multi-modal) | 17B 16E (109B) | 4096 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/llama/hf_llama4_17b_16e_seq4k_gpu_lora_multimodal_finetuning.yaml) | [link](launcher_scripts/llama/run_hf_llama4_17b_16e_seq4k_gpu_lora_multimodal_finetuning.sh) |
7476
| LLama 4 Scout | LoRA (text-only) | 17B 16E (109B) | 4096 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/llama/hf_llama4_17b_16e_seq4k_gpu_lora_text_to_text.yaml) | [link](launcher_scripts/llama/run_hf_llama4_17b_16e_seq4k_gpu_lora_text_to_text.sh) |
77+
Nova Micro | Supervised Fine-Tuning (LoRA) | - | 65536 | 2 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_micro_p5_gpu_lora_sft.yaml) | [link](launcher_scripts/nova/run_nova_micro_p5_gpu_lora_sft.sh) |
78+
Nova Micro | Supervised Fine-Tuning (Full) | - | 65536 | 2 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_micro_p5_gpu_sft.yaml) | [link](launcher_scripts/nova/run_nova_micro_p5_gpu_sft.sh) |
79+
Nova Micro | Direct Preference Optimization (Full) | - | 32768 | 2 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_micro_p5_gpu_dpo.yaml) | [link](launcher_scripts/nova/run_nova_micro_p5_gpu_dpo.sh) |
80+
Nova Micro | Direct Preference Optimization (LoRA) | - | 32768 | 2 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_micro_p5_gpu_lora_dpo.yaml) | [link](launcher_scripts/nova/run_nova_micro_p5_gpu_lora_dpo.sh) |
81+
Nova Micro | Rewards Based Reinforcement Learning (PPO) | - | 8192 | 5 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_micro_p5_gpu_ppo.yaml) | [link](launcher_scripts/nova/run_nova_micro_p5_gpu_ppo.sh) |
82+
Nova Lite | Supervised Fine-Tuning (LoRA) | - | 32768 | 4 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_lite_p5_gpu_lora_sft.yaml) | [link](launcher_scripts/nova/run_nova_lite_p5_gpu_lora_sft.sh) |
83+
Nova Lite | Supervised Fine-Tuning (Full) | - | 65536 | 4 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_lite_p5_gpu_sft.yaml) | [link](launcher_scripts/nova/run_nova_lite_p5_gpu_sft.sh) |
84+
Nova Lite | Direct Preference Optimization (Full) | - | 32768 | 4 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_lite_p5_gpu_dpo.yaml) | [link](launcher_scripts/nova/run_nova_lite_p5_gpu_dpo.sh) |
85+
Nova Lite | Direct Preference Optimization (LoRA) | - | 32768 | 4 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_lite_p5_gpu_lora_dpo.yaml) | [link](launcher_scripts/nova/run_nova_lite_p5_gpu_lora_dpo.sh) |
86+
Nova Lite | Rewards Based Reinforcement Learning (PPO) | - | 8192 | 6 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_lite_p5_gpu_ppo.yaml) | [link](launcher_scripts/nova/run_nova_lite_p5_gpu_ppo.sh) |
87+
Nova Pro | Supervised Fine-Tuning (LoRA) | - | 65536 | 6 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_pro_p5_gpu_lora_sft.yaml) | [link](launcher_scripts/nova/run_nova_pro_p5_gpu_lora_sft.sh) |
88+
Nova Pro | Supervised Fine-Tuning (Full) | - | 65536 | 6 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_pro_p5_gpu_sft.yaml) | [link](launcher_scripts/nova/run_nova_pro_p5_gpu_sft.sh) |
89+
Nova Pro | Direct Preference Optimization (Full) | - | 32768 | 6 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_pro_p5_gpu_dpo.yaml) | [link](launcher_scripts/nova/run_nova_pro_p5_gpu_dpo.sh) |
90+
Nova Pro | Direct Preference Optimization (LoRA) | - | 32768 | 6 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_pro_p5_gpu_lora_dpo.yaml) | [link](launcher_scripts/nova/run_nova_pro_p5_gpu_lora_dpo.sh) |
91+
Nova Pro | Rewards Based Reinforcement Learning (PPO) | - | 8192 | 8 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/nova/nova_pro_p5_gpu_ppo.yaml) | [link](launcher_scripts/nova/run_nova_pro_p5_gpu_ppo.sh) |
92+
Nova Pro | Model Distillation for Post-Training | - | - | 1 | ml.r5.24xlarge | - | [link](recipes_collection/recipes/fine-tuning/nova/nova_pro_r5_cpu_distill.yaml) | [link](launcher_scripts/nova/run_nova_pro_r5_cpu_distill.sh) |
7593
| DeepSeek R1 | QLoRA | 671b | 8192 | 2 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_671b_seq8k_gpu_qlora.yaml) | [link](launcher_scripts/deepseek/run_hf_deepseek_r1_671b_seq8k_gpu_qlora.sh) |
7694
| DeepSeek R1 | LoRA | 671b | 8192 | 5 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_671b_seq8k_gpu_lora.yaml) | [link](launcher_scripts/deepseek/run_hf_deepseek_r1_671b_seq8k_gpu_lora.sh) |
7795
| DeepSeek R1 Distill Llama 3 | SFT | 8b | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_llama_8b_seq8k_gpu_fine_tuning.yaml) | [link](launcher_scripts/deepseek/run_hf_deepseek_r1_llama_8b_seq8k_gpu_fine_tuning.sh) |
@@ -123,6 +141,24 @@ All model sources are from Hugging Face.
123141
| Llama 3 | SFT | 8b | 8192 | 1 | ml.trn1.32xlarge | TRN | [link](recipes_collection/recipes/fine-tuning/llama/hf_llama3_8b_seq8k_trn1_fine_tuning.yaml) | [link](launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1_fine_tuning.sh) |
124142

125143

144+
### Evaluation
145+
146+
List of specific evaluation recipes used by the launch scripts.
147+
148+
| Model | Method | Size | Sequence length | Nodes | Instance | Accelerator | Recipe | Script |
149+
| --------- | ------ | ---- | ----------------| ----- | -------------- | ----------- | ------ | ------ |
150+
Nova Micro | General Text Benchmark Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_micro_p5_48xl_general_text_benchmark_eval.yaml) | [link](launcher_scripts/nova/run_nova_micro_p5_48xl_general_text_benchmark_eval.sh) |
151+
Nova Micro | Bring your own dataset (gen_qa) benchmark Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_micro_p5_48xl_bring_your_own_dataset_eval.yaml) | [link](launcher_scripts/nova/run_nova_micro_p5_48xl_bring_your_own_dataset_eval.sh) |
152+
Nova Micro | Nova LLM as a Judge Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_micro_p5_48xl_llm_judge_eval.yaml) | [link](launcher_scripts/nova/run_nova_micro_p5_48xl_llm_judge_eval.sh) |
153+
Nova Lite | General Text Benchmark Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_lite_p5_48xl_general_text_benchmark_eval.yaml) | [link](launcher_scripts/nova/run_nova_lite_p5_48xl_general_text_benchmark_eval.sh) |
154+
Nova Lite | Bring your own dataset (gen_qa) benchmark Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_lite_p5_48xl_bring_your_own_dataset_eval.yaml) | [link](launcher_scripts/nova/run_nova_lite_p5_48xl_bring_your_own_dataset_eval.sh) |
155+
Nova Lite | Nova LLM as a Judge Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_lite_p5_48xl_llm_judge_eval.yaml) | [link](launcher_scripts/nova/run_nova_lite_p5_48xl_llm_judge_eval.sh) |
156+
Nova Lite | Multi-Modal Benchmarks | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_lite_p5_48xl_general_multi_modal_benchmark_eval.yaml) | [link](launcher_scripts/nova/run_nova_lite_p5_48xl_general_multi_modal_benchmark_eval.sh) |
157+
Nova Pro | General Text Benchmark Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_pro_p5_48xl_general_text_benchmark_eval.yaml) | [link](launcher_scripts/nova/run_nova_pro_p5_48xl_general_text_benchmark_eval.sh) |
158+
Nova Pro | Bring your own dataset (gen_qa) benchmark Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_pro_p5_48xl_bring_your_own_dataset_eval.yaml) | [link](launcher_scripts/nova/run_nova_pro_p5_48xl_bring_your_own_dataset_eval.sh) |
159+
Nova Pro | Nova LLM as a Judge Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_pro_p5_48xl_llm_judge_eval.yaml) | [link](launcher_scripts/nova/run_nova_pro_p5_48xl_llm_judge_eval.sh) |
160+
Nova Pro | Multi-Modal Benchmarks | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_pro_p5_48xl_general_multi_modal_benchmark_eval.yaml) | [link](launcher_scripts/nova/run_nova_pro_p5_48xl_general_multi_modal_benchmark_eval.sh) |
161+
126162
## Installation
127163

128164
Amazon SageMaker HyperPod recipes should be installed on the head node of your HyperPod cluster or on your local machine with a virtual python environment.
@@ -143,7 +179,7 @@ which includes popular publicly-available models like Llama or Mistral. Based on
143179
needs, you might need to modify the parameters defined in the recipes for
144180
pre-training or fine-tuning. Once your configurations are setup, you can run training on SageMaker
145181
HyperPod (with Slurm or Amazon EKS) for workload orchestration. Alternatively, you can run the recipe on
146-
SageMaker training jobs using the Amazon SageMaker Python SDK.
182+
SageMaker training jobs using the Amazon SageMaker Python SDK. Note that Amazon Nova model recipes are only compatible with SageMaker HyperPod with Amazon EKS and SageMaker training jobs.
147183

148184
### Running a recipe via a Slurm job on a SageMaker HyperPod cluster
149185

@@ -220,6 +256,7 @@ hyperpod start-job --recipe training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrai
220256
"cluster_type": "k8s"
221257
}'
222258
```
259+
To run Amazon Nova recipe on SageMaker HyperPod clusters orchestrated by Amazon EKS, you will need to create a Restricted Instance Group in your cluster. Refer to the following documentation to [learn more](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-cluster.html).
223260

224261
### Running a recipe on SageMaker training jobs
225262

@@ -300,6 +337,7 @@ Running the above code creates a `PyTorch` estimator object with the specified t
300337
and then trains the model using the `fit()` method. The new `training_recipe` parameter enables you
301338
to specify the recipe you want to use.
302339

340+
To learn more about running Amazon Nova recipe on SageMaker training job, refer to [this documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-training-job.html).
303341

304342
## Troubleshooting
305343

launcher/nemo/stages.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@
1313
# Portions taken from https://github.com/NVIDIA/NeMo-Framework-Launcher, Copyright Nvidia Corporation
1414

1515

16-
from ast import literal_eval
1716
import logging
1817
import shutil
18+
from ast import literal_eval
1919
from pathlib import Path
2020
from typing import Dict, List
2121

launcher/nova/__init__.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License"). You
4+
# may not use this file except in compliance with the License. A copy of
5+
# the License is located at
6+
#
7+
# http://aws.amazon.com/apache2.0/
8+
#
9+
# or in the "license" file accompanying this file. This file is
10+
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
11+
# ANY KIND, either express or implied. See the License for the specific
12+
# language governing permissions and limitations under the License.
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
INIT_CONTAINER_REGION_ACCOUNT_MAP = {"us-east-1": "708977205387"}
2+
INIT_CONTAINER_IMAGE_URI = "{account_id}.dkr.ecr.{region}.amazonaws.com/init-container-repo:latest"
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
from enum import Enum
2+
3+
ACTOR_GENERATION_REGION_ACCOUNT_MAP = {"us-east-1": "708977205387"}
4+
5+
6+
class JobType(Enum):
7+
REWARD_MODEL = "rm"
8+
CRITIC_MODEL = "cm"
9+
ANCHOR_MODEL = "am"
10+
ACTOR_GENERATION = "ag"
11+
ACTOR_TRAIN = "at"
12+
13+
14+
JOB_TYPE_DICT = {
15+
JobType.REWARD_MODEL: "ppo_reward",
16+
JobType.CRITIC_MODEL: "ppo_critic",
17+
JobType.ANCHOR_MODEL: "ppo_anchor",
18+
JobType.ACTOR_GENERATION: "ppo_actor_generation",
19+
JobType.ACTOR_TRAIN: "ppo_actor_train",
20+
}
21+
JOB_TASK_TYPE_DICT = {
22+
JobType.REWARD_MODEL: "ppo_rm",
23+
JobType.CRITIC_MODEL: "ppo_cm",
24+
JobType.ANCHOR_MODEL: "ppo_anchor",
25+
JobType.ACTOR_GENERATION: "ppo_actor_gen",
26+
JobType.ACTOR_TRAIN: "ppo_actor_train",
27+
}
28+
KEYS_TO_REMOVE = ["actor_train_replicas", "rm_replicas", "cm_replicas", "am_replicas"]
29+
ACTOR_GENERATION_CONTAINER_IMAGE = "{account_id}.dkr.ecr.{region}.amazonaws.com/nova-fine-tune-repo:SMHP-PPO-TRT-latest"
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
apiVersion: v2
2+
appVersion: "1.0"
3+
description: Sagemaker Model Training
4+
name: sagemaker-training
5+
version: 1.0.0

0 commit comments

Comments
 (0)