Skip to content

Commit a9ba8bf

Browse files
authored
Merge pull request #50 from aws/release-1.5.1
Release 1.5.1
2 parents 4a511a8 + 0413a71 commit a9ba8bf

File tree

6 files changed

+382
-1
lines changed

6 files changed

+382
-1
lines changed

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,8 @@ Amazon SageMaker HyperPod recipes include built-in support for:
1919
- Supported Models: DeepSeek R1, DeepSeek R1 Distill Llama, DeepSeek R1 Distill Qwen, Llama, Mistral, Mixtral models, Nova Micro, Nova Lite, Nova Pro.
2020
- Model Evaluation: [Tensorboard](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.tensorboard.html#module-lightning.pytorch.loggers.tensorboard), [MLflow](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.mlflow.html), [Wandb](https://lightning.ai/docs/pytorch/stable/extensions/generated/lightning.pytorch.loggers.WandbLogger.html) - feel free to add any key word arguments to the Logger classes by using their associated kwargs config
2121

22-
###### ***Note: For DeepSeek R1 671b customers must ensure that their model repository contains weights of type bf16. DeepSeek's [HuggingFace repository](https://huggingface.co/deepseek-ai/DeepSeek-R1) contains the model in dtype fp8 by default. In order to convert a model repository from fp8 to bf16 we recommend using [this script](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo/blob/main/src/hyperpod_nemo_adapter/scripts/fp8_cast_bf16.py) and pointing your recipe to the output directory.
22+
###### ***Note: DeepSeek R1 671b customers must ensure that their model repository contains weights of type bf16. DeepSeek's [HuggingFace repository](https://huggingface.co/deepseek-ai/DeepSeek-R1) contains the model in dtype fp8 by default. In order to convert a model repository from fp8 to bf16 we recommend using [this script](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo/blob/main/src/hyperpod_nemo_adapter/scripts/fp8_cast_bf16.py) and pointing your recipe to the output directory.
23+
###### ***Note: GPT OSS customers are recommended to use the gpt-oss-patch image `658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:sm-pytorch_gpt_oss_patch_pt-2.7_cuda12.8` to support vllm-flash-attn3 and run the recipe as written. Per device batch sizes > 1 are not currently supported.
2324

2425
## Model Support
2526

@@ -116,6 +117,8 @@ Nova Pro | Model Distillation for Post-Training | - | - | 1 | ml
116117
| DeepSeek R1 Distill Qwen 2 | LoRA | 32b | 8192 | 2 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_32b_seq8k_gpu_lora.yaml) | [link](launcher_scripts/deepseek/run_hf_deepseek_r1_qwen_32b_seq8k_gpu_lora.sh) |
117118
| DeepSeek R1 Distill Qwen 2 | SFT | 32b | 16384 | 6 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_32b_seq16k_gpu_fine_tuning.yaml) | [link](launcher_scripts/deepseek/run_hf_deepseek_r1_qwen_32b_seq16k_gpu_fine_tuning.sh) |
118119
| DeepSeek R1 Distill Qwen 2 | LoRA | 32b | 16384 | 2 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_32b_seq16k_gpu_lora.yaml) | [link](launcher_scripts/deepseek/run_hf_deepseek_r1_qwen_32b_seq16k_gpu_lora.sh) |
120+
| GPT OSS | LoRA | 20b | 16384 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/gpt_oss/hf_gpt_oss_20b_seq16k_gpu_lora.yaml) | [link](launcher_scripts/gpt_oss/run_hf_gpt_oss_20b_seq16k_gpu_lora.sh) |
121+
| GPT OSS | LoRA | 120b | 4096 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/gpt_oss/hf_gpt_oss_120b_seq4k_gpu_lora.yaml) | [link](launcher_scripts/gpt_oss/run_hf_gpt_oss_120b_seq4k_gpu_lora.sh) |
119122
| Llama 3.1 | QLoRA | 405b | 131072 | 2 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/llama/hf_llama3_405b_seq128k_gpu_qlora.yaml) | [link](launcher_scripts/llama/run_hf_llama3_405b_seq128k_gpu_qlora.sh) |
120123
| Llama 3.1 | QLoRA | 405b | 32768 | 2 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/llama/hf_llama3_405b_seq32k_gpu_qlora.yaml) | [link](launcher_scripts/llama/run_hf_llama3_405b_seq32k_gpu_qlora.sh) |
121124
| Llama 3.1 | LoRA | 405b | 16384 | 6 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/fine-tuning/llama/hf_llama3_405b_seq16k_gpu_lora.yaml) | [link](launcher_scripts/llama/run_hf_llama3_405b_seq16k_gpu_lora.sh) |

launcher/nemo/stages.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -303,6 +303,9 @@ def _make_launch_docker_container_text(self):
303303
if OmegaConf.select(self.cfg, "recipes.model.model_type", default=None) == "llama_v4":
304304
transformers_upgrade_cmd = "pip install transformers==4.51.3"
305305
post_launch_commands.append(transformers_upgrade_cmd)
306+
if OmegaConf.select(self.cfg, "recipes.model.model_type", default=None) == "gpt_oss":
307+
transformers_upgrade_cmd = "pip install transformers==4.55.0"
308+
post_launch_commands.append(transformers_upgrade_cmd)
306309

307310
launch_docker_container_text.append(f' "{image}" sleep infinity')
308311
launch_docker_container_text.append("")
@@ -429,6 +432,10 @@ def _make_train_script_text(self, stage_cfg_path=None, port=41000) -> str:
429432
transformers_upgrade_cmd = "pip install transformers==4.51.3"
430433
script_text.append("")
431434
script_text.append(transformers_upgrade_cmd)
435+
if OmegaConf.select(self.cfg, "recipes.model.model_type", default=None) == "gpt_oss":
436+
transformers_upgrade_cmd = "pip install transformers==4.55.0"
437+
script_text.append("")
438+
script_text.append(transformers_upgrade_cmd)
432439

433440
script_text.append("")
434441
script_text.append(self._make_custom_call_string(stage_cfg_path))
@@ -768,6 +775,9 @@ def update_stage_specific_k8s_values(self, values_template):
768775
if OmegaConf.select(self.cfg, "recipes.model.model_type", default=None) == "llama_v4":
769776
transformers_upgrade_cmd = "pip install transformers==4.51.3"
770777
values_template.trainingConfig.pre_script.append(transformers_upgrade_cmd)
778+
if OmegaConf.select(self.cfg, "recipes.model.model_type", default=None) == "gpt_oss":
779+
transformers_upgrade_cmd = "pip install transformers==4.55.0"
780+
values_template.trainingConfig.pre_script.append(transformers_upgrade_cmd)
771781

772782
return values_template
773783

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
#!/bin/bash
2+
3+
# Original Copyright (c), NVIDIA CORPORATION. Modifications © Amazon.com
4+
5+
#Users should setup their cluster type in /recipes_collection/config.yaml
6+
7+
SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
8+
9+
HF_MODEL_NAME_OR_PATH="${HF_MODEL_NAME_OR_PATH}" # HuggingFace pretrained model name or path
10+
HF_ACCESS_TOKEN="${HF_ACCESS_TOKEN}" # Optional HuggingFace access token
11+
12+
TRAIN_DIR="${TRAIN_DIR}" # Location of training dataset
13+
VAL_DIR="${VAL_DIR}" # Location of validation dataset
14+
15+
EXP_DIR="${EXP_DIR}" # Location to save experiment info including logging, checkpoints, ect
16+
17+
18+
HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
19+
recipes=fine-tuning/gpt_oss/hf_gpt_oss_120b_seq4k_gpu_lora \
20+
container="658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:sm-pytorch_gpt_oss_patch_pt-2.7_cuda12.8" \
21+
base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
22+
recipes.run.name="hf-gpt-oss-120b-lora" \
23+
recipes.exp_manager.exp_dir="$EXP_DIR" \
24+
recipes.trainer.num_nodes=1 \
25+
recipes.model.data.train_dir="$TRAIN_DIR" \
26+
recipes.model.data.val_dir="$VAL_DIR" \
27+
recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" \
28+
recipes.model.hf_access_token="$HF_ACCESS_TOKEN" \
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
#!/bin/bash
2+
3+
# Original Copyright (c), NVIDIA CORPORATION. Modifications © Amazon.com
4+
5+
#Users should setup their cluster type in /recipes_collection/config.yaml
6+
7+
SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
8+
9+
HF_MODEL_NAME_OR_PATH="${HF_MODEL_NAME_OR_PATH}" # HuggingFace pretrained model name or path
10+
HF_ACCESS_TOKEN="${HF_ACCESS_TOKEN}" # Optional HuggingFace access token
11+
12+
TRAIN_DIR="${TRAIN_DIR}" # Location of training dataset
13+
VAL_DIR="${VAL_DIR}" # Location of validation dataset
14+
15+
EXP_DIR="${EXP_DIR}" # Location to save experiment info including logging, checkpoints, ect
16+
17+
18+
HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
19+
recipes=fine-tuning/gpt_oss/hf_gpt_oss_20b_seq16k_gpu_lora \
20+
container="658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:sm-pytorch_gpt_oss_patch_pt-2.7_cuda12.8" \
21+
base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
22+
recipes.run.name="hf-gpt-oss-20b-lora" \
23+
recipes.exp_manager.exp_dir="$EXP_DIR" \
24+
recipes.trainer.num_nodes=1 \
25+
recipes.model.data.train_dir="$TRAIN_DIR" \
26+
recipes.model.data.val_dir="$VAL_DIR" \
27+
recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" \
28+
recipes.model.hf_access_token="$HF_ACCESS_TOKEN" \
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Original Copyright (c), NVIDIA CORPORATION. Modifications © Amazon.com
2+
3+
# Basic run information configs
4+
run:
5+
name: gpt-oss-120b
6+
results_dir: ${base_results_dir}/${.name}
7+
time_limit: "6-00:00:00"
8+
model_type: hf # huggingface for our recipes
9+
10+
# Basic pytorch lightning trainer config
11+
trainer:
12+
devices: 8
13+
num_nodes: 1
14+
accelerator: gpu
15+
precision: bf16
16+
max_steps: 50
17+
log_every_n_steps: 1
18+
val_check_interval: 1
19+
limit_val_batches: 0 # Number of batches per each validation run, set to 0 to disable validation.
20+
21+
# Basic pytorch lightning experiment config
22+
# Config for checkpoint/tensorboard etc
23+
exp_manager:
24+
exp_dir: null
25+
name: experiment
26+
# experiment loggers
27+
create_tensorboard_logger: False
28+
summary_writer_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}/tensorboard"}
29+
create_mlflow_logger: False
30+
mlflow_logger_kwargs: {"tracking_uri" : "${recipes.exp_manager.exp_dir}/mlflow"}
31+
create_wandb_logger: False
32+
wandb_logger_kwargs: {"save_dir" : "${recipes.exp_manager.exp_dir}"} # wandb creates a wandb folder by default
33+
create_checkpoint_callback: True
34+
# Configs to save checkpoint with a fixed interval
35+
# Note: These config will not work with auto checkpoint mode
36+
checkpoint_callback_params:
37+
# Set save_top_k = 0 to disable sharded checkpointing
38+
save_top_k: 0
39+
every_n_train_steps: 10
40+
monitor: "step"
41+
mode: "max"
42+
save_last: False
43+
checkpoint_dir: ${recipes.exp_manager.exp_dir}/checkpoints/
44+
resume_from_checkpoint: null
45+
# Enable auto_checkpoint to automatically calculate the checkpoint interval and resume from checkpoint
46+
auto_checkpoint:
47+
enabled: False
48+
export_full_model:
49+
# Set every_n_train_steps = 0 to disable full checkpointing
50+
every_n_train_steps: 0
51+
save_last: True
52+
53+
################# Predefined configs ##########################
54+
use_smp_model: False # Enable sagemaker model parallelism
55+
distributed_backend: nccl
56+
57+
# Model training configs
58+
model:
59+
model_type: gpt_oss
60+
# Base configs
61+
train_batch_size: 1 # Batch sizes > 1 are not currently supported
62+
val_batch_size: 1
63+
seed: 12345
64+
grad_clip: 1.0
65+
log_reduced_training_loss: True
66+
67+
# Memory saving / distributed training configs
68+
tensor_model_parallel_degree: 1
69+
expert_model_parallel_degree: 1
70+
context_parallel_degree: 1
71+
moe: False
72+
activation_checkpointing: True
73+
activation_loading_horizon: 2
74+
delayed_param: True
75+
offload_activations: False
76+
77+
# FSDP Configs
78+
sharding_strategy: hybrid_shard
79+
forward_prefetch: True
80+
shard_degree: 8
81+
backward_fetch_policy: backward_pre
82+
auto_wrap_policy: transformer_auto_wrap_policy
83+
limit_all_gathers: true
84+
use_orig_param: False
85+
86+
# FP8 config
87+
fp8: False
88+
fp8_amax_history_len: 1024
89+
fp8_amax_compute_algo: max
90+
91+
# Model architecture
92+
max_context_width: 4096
93+
max_position_embeddings: ${.max_context_width} # 131072
94+
num_hidden_layers: 36
95+
hidden_size: 2880
96+
num_attention_heads: 64
97+
intermediate_size: 2880
98+
initializer_range: 0.02
99+
layernorm_epsilon: 1e-5
100+
vocab_size: 201088
101+
num_key_value_heads: 8
102+
rms_norm_eps: 1e-05
103+
use_flash_attention: False # Use the gpt-oss-patch container for kernels-community/vllm-flash-attn3
104+
sliding_window: 128
105+
use_sliding_window: True
106+
num_experts_per_tok: 4
107+
num_local_experts: 128
108+
moe_load_balancing: 'sinkhorn'
109+
global_token_shuffle: True
110+
moe_all_to_all_dispatcher: False
111+
rope_theta: 150000.0
112+
tie_word_embeddings: False
113+
114+
# Finetuning config
115+
do_finetune: True
116+
# The path to resume from, needs to be HF compatible
117+
hf_model_name_or_path: null
118+
hf_access_token: null
119+
# PEFT config
120+
peft:
121+
peft_type: lora
122+
rank: 16
123+
alpha: 32
124+
dropout: 0.1
125+
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"]
126+
127+
precision: ${recipes.trainer.precision}
128+
################# End of Predefined configs ##########################
129+
130+
# Learning rate and optimizer configs
131+
lr_decay_iters: ${recipes.trainer.max_steps}
132+
# Optimizer
133+
optim:
134+
name: adamw
135+
lr: 2e-4
136+
weight_decay: 0.01
137+
betas:
138+
- 0.9
139+
- 0.95
140+
sched:
141+
name: CosineAnnealing
142+
warmup_steps: 0
143+
constant_steps: 0
144+
min_lr: 2e-6
145+
146+
# Data configs
147+
data:
148+
train_dir: null
149+
val_dir: null
150+
dataset_type: hf
151+
use_synthetic_data: False
152+
153+
# Profiling configs
154+
# Viztracer profiling options
155+
viztracer:
156+
enabled: false

0 commit comments

Comments
 (0)