You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Supported Models: DeepSeek R1, DeepSeek R1 Distill Llama, DeepSeek R1 Distill Qwen, Llama, Mistral, Mixtral models, Nova Micro, Nova Lite, Nova Pro.
20
20
- Model Evaluation: [Tensorboard](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.tensorboard.html#module-lightning.pytorch.loggers.tensorboard), [MLflow](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.mlflow.html), [Wandb](https://lightning.ai/docs/pytorch/stable/extensions/generated/lightning.pytorch.loggers.WandbLogger.html) - feel free to add any key word arguments to the Logger classes by using their associated kwargs config
21
21
22
22
###### ***Note: For DeepSeek R1 671b customers must ensure that their model repository contains weights of type bf16. DeepSeek's [HuggingFace repository](https://huggingface.co/deepseek-ai/DeepSeek-R1) contains the model in dtype fp8 by default. In order to convert a model repository from fp8 to bf16 we recommend using [this script](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo/blob/main/src/hyperpod_nemo_adapter/scripts/fp8_cast_bf16.py) and pointing your recipe to the output directory.
@@ -60,18 +60,36 @@ List of specific pre-training recipes used by the launch scripts.
Nova Micro | General Text Benchmark Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_micro_p5_48xl_general_text_benchmark_eval.yaml) | [link](launcher_scripts/nova/run_nova_micro_p5_48xl_general_text_benchmark_eval.sh) |
151
+
Nova Micro | Bring your own dataset (gen_qa) benchmark Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_micro_p5_48xl_bring_your_own_dataset_eval.yaml) | [link](launcher_scripts/nova/run_nova_micro_p5_48xl_bring_your_own_dataset_eval.sh) |
152
+
Nova Micro | Nova LLM as a Judge Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_micro_p5_48xl_llm_judge_eval.yaml) | [link](launcher_scripts/nova/run_nova_micro_p5_48xl_llm_judge_eval.sh) |
153
+
Nova Lite | General Text Benchmark Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_lite_p5_48xl_general_text_benchmark_eval.yaml) | [link](launcher_scripts/nova/run_nova_lite_p5_48xl_general_text_benchmark_eval.sh) |
154
+
Nova Lite | Bring your own dataset (gen_qa) benchmark Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_lite_p5_48xl_bring_your_own_dataset_eval.yaml) | [link](launcher_scripts/nova/run_nova_lite_p5_48xl_bring_your_own_dataset_eval.sh) |
155
+
Nova Lite | Nova LLM as a Judge Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_lite_p5_48xl_llm_judge_eval.yaml) | [link](launcher_scripts/nova/run_nova_lite_p5_48xl_llm_judge_eval.sh) |
Nova Pro | General Text Benchmark Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_pro_p5_48xl_general_text_benchmark_eval.yaml) | [link](launcher_scripts/nova/run_nova_pro_p5_48xl_general_text_benchmark_eval.sh) |
158
+
Nova Pro | Bring your own dataset (gen_qa) benchmark Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_pro_p5_48xl_bring_your_own_dataset_eval.yaml) | [link](launcher_scripts/nova/run_nova_pro_p5_48xl_bring_your_own_dataset_eval.sh) |
159
+
Nova Pro | Nova LLM as a Judge Recipe | - | 8192 | 1 | ml.p5.48xlarge | GPU H100 | [link](recipes_collection/recipes/evaluation/nova/nova_pro_p5_48xl_llm_judge_eval.yaml) | [link](launcher_scripts/nova/run_nova_pro_p5_48xl_llm_judge_eval.sh) |
Amazon SageMaker HyperPod recipes should be installed on the head node of your HyperPod cluster or on your local machine with a virtual python environment.
@@ -143,7 +179,7 @@ which includes popular publicly-available models like Llama or Mistral. Based on
143
179
needs, you might need to modify the parameters defined in the recipes for
144
180
pre-training or fine-tuning. Once your configurations are setup, you can run training on SageMaker
145
181
HyperPod (with Slurm or Amazon EKS) for workload orchestration. Alternatively, you can run the recipe on
146
-
SageMaker training jobs using the Amazon SageMaker Python SDK.
182
+
SageMaker training jobs using the Amazon SageMaker Python SDK. Note that Amazon Nova model recipes are only compatible with SageMaker HyperPod with Amazon EKS and SageMaker training jobs.
147
183
148
184
### Running a recipe via a Slurm job on a SageMaker HyperPod cluster
To run Amazon Nova recipe on SageMaker HyperPod clusters orchestrated by Amazon EKS, you will need to create a Restricted Instance Group in your cluster. Refer to the following documentation to [learn more](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-cluster.html).
223
260
224
261
### Running a recipe on SageMaker training jobs
225
262
@@ -300,6 +337,7 @@ Running the above code creates a `PyTorch` estimator object with the specified t
300
337
and then trains the model using the `fit()` method. The new `training_recipe` parameter enables you
301
338
to specify the recipe you want to use.
302
339
340
+
To learn more about running Amazon Nova recipe on SageMaker training job, refer to [this documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-training-job.html).
0 commit comments