Skip to content

Performance issues for SFT with megatron backend #917

@Kipok

Description

@Kipok

Describe the bug

The speed of SFT jobs with megatron backend is much slower than expected.

Two issues we've identified so far:

  1. The step itself is much slower than corresponding configuration in nemo-aligner. Seems some problem with dp comms
  2. The GPUs are idle between steps. This takes from 25% to 50% of total job time with no GPU activity.
Image

Steps/Code to reproduce bug

We see this in multiple SFT jobs, but for nsys above the configuration is like this

1.5b model with tp=1, cp=16, 2 nodes, 128k context length, sequence packing enabled with megatron backend.

We also see this with 16 nodes and 48k context length.

Can provide full repro script internally.

Metadata

Metadata

Assignees

Labels

PerformanceRelated to improving performancebugSomething isn't workingmcoreresearchTag for research team's issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions