Skip to content

Releases: aws/sagemaker-hyperpod-cli

init experience Launch

24 Sep 19:51

Choose a tag to compare

Features

  • Init Experience
    • Init, Validate, and Create JumpStart endpoint, Custom endpoint, and PyTorch Training Job with local configuration
  • Cluster management
    • Bug fixes for cluster creation

Bug fixes

10 Sep 19:23
162fb79

Choose a tag to compare

Features

  • Fix for production canary failures caused by bad training job template.
  • New version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes.

Bug Fixes

28 Aug 00:14
5a346e8

Choose a tag to compare

  • Bug Fixes in cluster creation

v3.2.0 - Cluster Management and Init Experience

26 Aug 22:56
12730ca

Choose a tag to compare

Features

Cluster management

    Creation of cluster stack
    Describing and listing a cluster stack
    Updating a cluster

Init Experience

    Init, Validate, Create with local configurations

v3.1.0: Release tg (#209)

14 Aug 20:35
0fd2bef

Choose a tag to compare

v3.1.0 (2025-08-13)

Features

  • Task Governance feature for training jobs.

v.3.0.2

01 Aug 18:16
36fac66

Choose a tag to compare

Features

  • Update volume flag to support hostPath and PVC
  • Add an option to disable the deployment of KubeFlow TrainingOperator
  • Enable telemetry for CLI

v3.0.0

10 Jul 16:27
95096e8

Choose a tag to compare

Includes changes for

  • Training
  • Inference
  • Observability

New recipes support for DeepSeek's family of distilled R1 models

01 Feb 01:45
a15395f

Choose a tag to compare

What's Changed

New recipes

  • Added support for DeepSeek's family of distilled R1 models. Users can now finetune various sizes of DeepSeek-R1-Distill-Llama and DeepSeek-R1-Distill-Qwen using SFT and PEFT (lora/qlora).

SageMaker V2.0.0

04 Dec 16:14
bb25aed

Choose a tag to compare

The HyperPod CLI now support (HyperPod recipes).

The HyperPod recipes enable customers to get started training and fine-tuning popular publicly-available foundation models like Llama 3.1 405B in minutes. Learn more https://github.com/aws/sagemaker-hyperpod-recipes.

Introducing job scheduling integration with SageMaker managed quota allocation policies

Learn more: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html

  1. New Default Scheduler Type - “SageMaker”:
  • Autofill command with accessible SageMaker managed namespace
  • Autofill command with SageMaker managed queue name
  • Validation checks on priority, namespace and provide valid options if invalid values are detected
  1. Auto discovery of namespace across all HyperPod commands:
  • Namespace will be filled in the following order (ranked from high to low): User provided namespace in CLI parameter, User provided namespace when connecting to the cluster, the system dynamically identifies and configures the namespace where SageMaker resources should operate without requiring manual intervention.
  1. Get available clusters and total accelerator quota allocation per namespace:
  • Users can specify the namespace when invoking get-clusters , then HypePod CLI will read the corresponding cluster queue and display the available/total accelerators allocated to the queue
  1. List jobs with priority:
  • List jobs now includes an extra attribute for each job summary to show the WorkloadPriorityClass specified for each of the job

Important note:

In version 1.0, if the user does not explicitly specify a namespace parameter when running commands (e.g., submitting a job), the CLI would automatically map the Kubernetes namespace to default. However, starting from 2.0 release, if no namespace parameter is specified, HyperPod CLI will auto-discover the namespace user has access to. In order to replicate the same behavior in 2.0, please specify default namespace when connecting to the cluster which will prevent HyperPod CLI from auto discovering. When submitting the jobs, please also override the default scheduler type by adding --scheduler-type Kueue. in order to use Kueue. If you don’t want to use scheduler at all, please set —scheduler-type None

  1. Example on explicitly connect to cluster using default namespace:
hyperpod connect-cluster --namespace default
  1. Example on using Kueue in Version 2.0:
hyperpod start-job \
  --job-name my-training-job \
  --scheduler-type Kueue \
  --image my-docker-image:latest \
  --volume /data:/mnt/data
  1. Example on not using scheduler in Version 2.0
hyperpod start-job \
  --job-name my-training-job \
  --scheduler-type None \
  --image my-docker-image:latest \
  --volume /data:/mnt/data

Helm Chart Changes

  1. enhanced Helm chart support for team-level role association

SageMaker HyperPod CLI v1.0.0

10 Sep 00:05
f365f57

Choose a tag to compare

SageMaker HyperPod CLI is a command line tool that helps create and manage training jobs on the SageMaker HyperPod clusters orchestrated by Amazon EKS.

Data scientist users can train foundational models using the EKS cluster set as the orchestrator for the SageMaker HyperPod cluster. Scientists leverage the SageMaker HyperPod CLI to find available SageMaker HyperPod clusters, submit training jobs (Pods), and manage their workloads. The SageMaker HyperPod CLI enables job submission using a training job schema file, and provides capabilities for job listing, description, cancellation, and execution. Scientists can use Kubeflow Training Operator, Kueue (K8s tool for job queuing) and SageMaker-managed MLflow to manage ML experiments and training runs.