Skip to content

Conversation

@ananthsub
Copy link
Contributor

@ananthsub ananthsub commented Sep 27, 2025

Part of #720

Support sample based training based Megatron-LM support. Changes are made for the following:

  1. Include train_samples in the TrainingConfig
  2. Add sample-based fields to the LR Scheduler Config
  3. Validate mutual exclusivity between samples vs iters: users should fully specify either iteration-based training or sample-based training
  4. Update data loader logic to account for calculating samples
  5. Add a sampler-based optimizer utility for distributed adam w/ cosine annealing to match the utility offered for iteration-based training

TODO: update docs

@ananthsub ananthsub requested a review from maanug-nv September 27, 2025 03:59
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 27, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ananthsub
Copy link
Contributor Author

/ok to test b11340b

@ananthsub
Copy link
Contributor Author

/ok to test bb40ad7

Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
@ananthsub ananthsub requested a review from maanug-nv October 6, 2025 20:21
@ananthsub
Copy link
Contributor Author

/ok to test 4905b94

Copy link
Contributor

@maanug-nv maanug-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great, thanks!

@ananthsub ananthsub enabled auto-merge (squash) October 6, 2025 20:59
@ananthsub
Copy link
Contributor Author

/ok to test 2f1761c

@ananthsub ananthsub merged commit 1fa5fa1 into NVIDIA-NeMo:main Oct 7, 2025
43 of 46 checks passed
@ananthsub ananthsub deleted the train-samples branch October 7, 2025 15:32
paul-gibbons pushed a commit to paul-gibbons/Megatron-Bridge that referenced this pull request Oct 29, 2025
* support sample based training

Signed-off-by: Ananth Subramaniam <[email protected]>

* updates

Signed-off-by: Ananth Subramaniam <[email protected]>

* cleanup

Signed-off-by: Ananth Subramaniam <[email protected]>

* address feedback

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Paul Gibbons <[email protected]>
nv-mollys pushed a commit that referenced this pull request Oct 31, 2025
* support sample based training

Signed-off-by: Ananth Subramaniam <[email protected]>

* updates

Signed-off-by: Ananth Subramaniam <[email protected]>

* cleanup

Signed-off-by: Ananth Subramaniam <[email protected]>

* address feedback

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: mollys <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants