Skip to content

Conversation

@ananthsub
Copy link
Contributor

fix #794

@ananthsub ananthsub requested a review from yaoyu-33 September 26, 2025 23:02
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ananthsub ananthsub changed the title cleanup process group at end of train Conditionally destroy process group at end of train Sep 26, 2025
@ananthsub
Copy link
Contributor Author

/ok to test 10b69ee

@ananthsub
Copy link
Contributor Author

/ok to test a0512a2

@ananthsub
Copy link
Contributor Author

/ok to test 3aac03f

Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
@ananthsub
Copy link
Contributor Author

/ok to test e2f2a60

@ananthsub ananthsub removed the request for review from yaoyu-33 October 10, 2025 20:30
@ananthsub ananthsub merged commit 2835f7d into NVIDIA-NeMo:main Oct 13, 2025
44 of 46 checks passed
@ananthsub ananthsub deleted the cond-destroy-pg branch October 13, 2025 18:00
paul-gibbons pushed a commit to paul-gibbons/Megatron-Bridge that referenced this pull request Oct 29, 2025
* cleanup process group at end of train

Signed-off-by: Ananth Subramaniam <[email protected]>

* barrier

Signed-off-by: Ananth Subramaniam <[email protected]>

* self-contained within pretrain

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Paul Gibbons <[email protected]>
nv-mollys pushed a commit that referenced this pull request Oct 31, 2025
* cleanup process group at end of train

Signed-off-by: Ananth Subramaniam <[email protected]>

* barrier

Signed-off-by: Ananth Subramaniam <[email protected]>

* self-contained within pretrain

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: mollys <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatic global process group cleanup post training

2 participants