-
Notifications
You must be signed in to change notification settings - Fork 167
Closed
Labels
PerformanceRelated to improving performanceRelated to improving performancebugSomething isn't workingSomething isn't workingmcoreresearchTag for research team's issuesTag for research team's issues
Description
Describe the bug
The speed of SFT jobs with megatron backend is much slower than expected.
Two issues we've identified so far:
- The step itself is much slower than corresponding configuration in nemo-aligner. Seems some problem with dp comms
- The GPUs are idle between steps. This takes from 25% to 50% of total job time with no GPU activity.
Steps/Code to reproduce bug
We see this in multiple SFT jobs, but for nsys above the configuration is like this
1.5b model with tp=1, cp=16, 2 nodes, 128k context length, sequence packing enabled with megatron backend.
We also see this with 16 nodes and 48k context length.
Can provide full repro script internally.
Metadata
Metadata
Assignees
Labels
PerformanceRelated to improving performanceRelated to improving performancebugSomething isn't workingSomething isn't workingmcoreresearchTag for research team's issuesTag for research team's issues