-
Notifications
You must be signed in to change notification settings - Fork 527
[Qwen3] Add 32b training configs #1690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why you pick FSDP=8, TP=4 when FSDP 16 seems giving better MFU?
data_parallel_replicate_degree = 1 | ||
data_parallel_shard_degree = -1 | ||
fsdp_reshard_after_forward = "default" # default / never / always | ||
tensor_parallel_degree = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to mention the total number of GPUs, similar to https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/train_configs/llama3_8b.toml#L1
|
||
[optimizer] | ||
name = "AdamW" | ||
lr = 3e-4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how far we'd like to go, but ideally a recommended config should have lr and batch size verified by some loss converging runs. So it's about both perf + reasonable converging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For learning rate, I checked the official tech report didn't mention recommended learning rate. I will kick off a training job to verify converging as well
I check the overall all for each step, the total time for each profiler step is:
Comparing these 3 setting, I choose 2 because it overall runs faster with same number of data samples, with ok mfu performance. But I would also want to let user choose if they want better MFU/tflops or shortest total training time. |
@wwwjn I think naively, the better MFU, the faster you train? |
Agree, better tflops (suppose total flops for a model are the same) , the faster we train. That would be simpler - so we will target optimizing tflops/mfu (gbs 64 seems ok). Here's FSDP=16 profiler: good comm/comp overlap, but CPU bounded |
@@ -0,0 +1,64 @@ | |||
# NOTE: this toml config is a preset for 8 H100 GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FSDP 8 sounds good enough, although not all H100 have 96GB memory (we are using something https://github.com/pytorch/torchtitan/blob/main/benchmarks/llama3_h100_202412_torchtitan.md?plain=1#L13) -- some of them only have 80GB. We can add it to the note here.
As titled.
recommended training config with might tps and mfu:
When
torch.compile
is enabled, FSDP=8 gives following results:Some other recommended training configs (no compile applied for following benchmarking)