Skip to content

Commit cac49dc

Browse files
Malay/qwen3 2509 (#729)
* qwen3 perf configs Signed-off-by: Malay Nagda <[email protected]> * exp name Signed-off-by: Malay Nagda <[email protected]> * dependencies Signed-off-by: Malay Nagda <[email protected]> * dependencies Signed-off-by: Malay Nagda <[email protected]> * qwen3 235b a22b Signed-off-by: Malay Nagda <[email protected]> * fsdp cfgs Signed-off-by: Malay Nagda <[email protected]> * recompute cfgs Signed-off-by: Malay Nagda <[email protected]> * recompute cfgs Signed-off-by: Malay Nagda <[email protected]> * moe router fusion Signed-off-by: Malay Nagda <[email protected]> * 405b cfg Signed-off-by: Malay Nagda <[email protected]> * mbs Signed-off-by: Malay Nagda <[email protected]> * no CG dsv3, qwen3 235b b200 Signed-off-by: Malay Nagda <[email protected]> * dsv3 layout Signed-off-by: Malay Nagda <[email protected]> * fsdp args Signed-off-by: Malay Nagda <[email protected]> * llama3 fsdp args Signed-off-by: Malay Nagda <[email protected]> * moe token drop Signed-off-by: Malay Nagda <[email protected]> * fsdp double buffer Signed-off-by: Malay Nagda <[email protected]> * vp 2 dsv3 Signed-off-by: Malay Nagda <[email protected]> * avg in collective, grad accum fusion Signed-off-by: Malay Nagda <[email protected]> * avg in collective Signed-off-by: Malay Nagda <[email protected]> * cleanup Signed-off-by: Malay Nagda <[email protected]> * cleanup Signed-off-by: Malay Nagda <[email protected]> * cleanup Signed-off-by: Malay Nagda <[email protected]> * cleanup Signed-off-by: Malay Nagda <[email protected]> * allow grad accum fusion Signed-off-by: Malay Nagda <[email protected]> * expandable segments 405b gb200 fsdp Signed-off-by: Malay Nagda <[email protected]> * fix ut Signed-off-by: gkollu <[email protected]> --------- Signed-off-by: Malay Nagda <[email protected]> Signed-off-by: malay-nagda <[email protected]> Signed-off-by: gkollu <[email protected]> Co-authored-by: gautham-kollu <[email protected]>
1 parent 6941a93 commit cac49dc

File tree

12 files changed

+365
-83
lines changed

12 files changed

+365
-83
lines changed

scripts/performance/configs/deepseek/deepseek_v3_llm_pretrain.yaml

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -45,9 +45,6 @@ ConfigContainer:
4545

4646
mixed_precision:
4747
grad_reduce_in_fp32: false
48-
49-
comm_overlap:
50-
overlap_grad_reduce: false # TODO: enable when 1F1B_A2A is fixed in Megatron-LM
5148

5249
profiling:
5350
# For optional fields in the config, specify the target to instantiate the object.
@@ -74,7 +71,7 @@ perf_matrix:
7471
vp: 4
7572
ep: 64
7673
etp: 1
77-
fsdp: false
74+
use_megatron_fsdp: false
7875
bf16:
7976
fp8_cs:
8077
fp8_ss:
@@ -91,7 +88,7 @@ perf_matrix:
9188
vp: null
9289
ep: 8
9390
etp: 1
94-
fsdp: false
91+
use_megatron_fsdp: false
9592
bf16:
9693
cuda_graphs: false
9794
fp8_cs:
@@ -111,8 +108,8 @@ perf_matrix:
111108
vp: 8
112109
ep: 64
113110
etp: 1
114-
fsdp: false
115-
cuda_graphs: true
111+
use_megatron_fsdp: false
112+
cuda_graphs: false
116113
bf16:
117114
fp8_cs:
118115
fp8_mx:

scripts/performance/configs/llama3/llama3_70b_llm_pretrain.yaml

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,7 @@
1919
ConfigContainer:
2020
model:
2121
cross_entropy_fusion_impl: "te"
22-
enable_cuda_graph: false
23-
use_te_rng_tracker: false
24-
use_transformer_engine_op_fuser: true
22+
use_transformer_engine_op_fuser: false
2523

2624
train:
2725
train_iters: 25
@@ -40,8 +38,6 @@ ConfigContainer:
4038
ddp:
4139
check_for_nan_in_grad: false
4240
check_for_large_grads: false
43-
use_megatron_fsdp: true
44-
use_torch_fsdp2: false
4541

4642
mixed_precision:
4743
grad_reduce_in_fp32: false
@@ -69,7 +65,7 @@ perf_matrix:
6965
vp: 5
7066
ep: 1
7167
etp: null
72-
fsdp: false
68+
use_megatron_fsdp: false
7369
bf16:
7470
pp: 4
7571
cp: 2
@@ -93,60 +89,64 @@ perf_matrix:
9389
pp: 4
9490
cp: 2
9591
vp: 5
96-
fsdp: false
92+
use_megatron_fsdp: false
9793
cuda_graphs: true
9894
fp8_ds:
9995
tp: 1
10096
pp: 1
10197
cp: 1
102-
vp: 1
103-
fsdp: true
98+
vp: null
99+
use_megatron_fsdp: true
104100
cuda_graphs: false
105101
recompute_num_layers: 5
106102
fp8_cs:
107103
tp: 1
108104
pp: 1
109105
cp: 1
110-
vp: 1
111-
fsdp: true
106+
vp: null
107+
use_megatron_fsdp: true
112108
cuda_graphs: false
113109
recompute_num_layers: 5
114110
fp8_mx:
115111
tp: 2
116112
pp: 4
117113
cp: 1
118114
vp: 5
119-
fsdp: false
115+
use_megatron_fsdp: false
120116
cuda_graphs: false
121117
gb200:
122118
num_gpus_64:
123119
common:
124120
num_gpus_per_node: 4
125121
seq_length: 8192
126-
mbs: 1
127122
gbs: 128
128123
cuda_graphs: false
129124
cp: 1
130125
ep: 1
131126
etp: null
132127
bf16:
128+
mbs: 1
133129
tp: 1
134130
pp: 1
135-
vp: 1
136-
fsdp: true
137-
recompute_num_layers: 20
131+
vp: null
132+
use_megatron_fsdp: true
133+
cpu_offloading_num_layers: 20
138134
fp8_ds:
135+
mbs: 1
139136
tp: 1
140137
pp: 1
141-
vp: 1
142-
fsdp: true
138+
vp: null
139+
use_megatron_fsdp: true
143140
fp8_cs:
141+
mbs: 2
144142
tp: 1
145143
pp: 1
146-
vp: 1
147-
fsdp: true
144+
vp: null
145+
use_megatron_fsdp: true
146+
cpu_offloading_num_layers: 40
148147
fp8_mx:
148+
mbs: 1
149149
tp: 2
150150
pp: 4
151151
vp: 5
152-
fsdp: false
152+
use_megatron_fsdp: false

scripts/performance/configs/llama3/llama3_8b_llm_pretrain.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
ConfigContainer:
1616
model:
1717
cross_entropy_fusion_impl: "te"
18-
use_transformer_engine_op_fuser: true
18+
use_transformer_engine_op_fuser: false
1919

2020
train:
2121
train_iters: 25
@@ -64,15 +64,15 @@ perf_matrix:
6464
etp: null
6565
bf16:
6666
cp: 2
67-
fsdp: false
67+
use_megatron_fsdp: false
6868
fp8_ds:
6969
cp: 1
70-
fsdp: true
70+
use_megatron_fsdp: true
7171
keep_fp8_transpose_cache_when_using_custom_fsdp: true
7272
nccl_ub: true
7373
fp8_cs:
7474
cp: 1
75-
fsdp: true
75+
use_megatron_fsdp: true
7676
keep_fp8_transpose_cache_when_using_custom_fsdp: true
7777
nccl_ub: true
7878
b200:
@@ -89,7 +89,7 @@ perf_matrix:
8989
vp: null
9090
ep: 1
9191
etp: null
92-
fsdp: false
92+
use_megatron_fsdp: false
9393
bf16:
9494
fp8_ds:
9595
fp8_cs:
@@ -108,7 +108,7 @@ perf_matrix:
108108
vp: null
109109
ep: 1
110110
etp: null
111-
fsdp: false
111+
use_megatron_fsdp: false
112112
bf16:
113113
fp8_ds:
114114
fp8_cs:

scripts/performance/configs/llama3/llama31_405b_llm_pretrain.yaml renamed to scripts/performance/configs/llama31/llama31_405b_llm_pretrain.yaml

Lines changed: 10 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,8 @@
1919
ConfigContainer:
2020
model:
2121
cross_entropy_fusion_impl: "te"
22-
enable_cuda_graph: false
23-
use_te_rng_tracker: false
24-
use_transformer_engine_op_fuser: true
22+
use_transformer_engine_op_fuser: false
23+
seq_length: 8192
2524

2625
train:
2726
train_iters: 25
@@ -40,8 +39,6 @@ ConfigContainer:
4039
ddp:
4140
check_for_nan_in_grad: false
4241
check_for_large_grads: false
43-
use_megatron_fsdp: true
44-
use_torch_fsdp2: false
4542

4643
mixed_precision:
4744
grad_reduce_in_fp32: false
@@ -71,7 +68,7 @@ perf_matrix:
7168
vp: 8
7269
ep: 1
7370
etp: null
74-
fsdp: false
71+
use_megatron_fsdp: false
7572
bf16:
7673
fp8_ds:
7774
fp8_cs:
@@ -89,7 +86,7 @@ perf_matrix:
8986
vp: 8
9087
ep: 1
9188
etp: null
92-
fsdp: false
89+
use_megatron_fsdp: false
9390
bf16:
9491
fp8_ds:
9592
fp8_cs:
@@ -109,24 +106,24 @@ perf_matrix:
109106
pp: 8
110107
cp: 2
111108
vp: 8
112-
fsdp: false
109+
use_megatron_fsdp: false
113110
fp8_ds:
114111
tp: 2
115112
pp: 1
116113
cp: 1
117-
vp: 1
118-
fsdp: true
114+
vp: null
115+
use_megatron_fsdp: true
119116
cpu_offloading_num_layers: 95
120117
fp8_cs:
121118
tp: 2
122119
pp: 1
123120
cp: 1
124-
vp: 1
125-
fsdp: true
121+
vp: null
122+
use_megatron_fsdp: true
126123
cpu_offloading_num_layers: 95
127124
fp8_mx:
128125
tp: 4
129126
pp: 8
130127
cp: 2
131128
vp: 8
132-
fsdp: false
129+
use_megatron_fsdp: false
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
ConfigContainer:
16+
model:
17+
cross_entropy_fusion_impl: "te"
18+
bias_activation_fusion: true
19+
recompute_granularity: null
20+
recompute_method: null
21+
recompute_num_layers: null
22+
moe_router_fusion: true
23+
24+
train:
25+
train_iters: 25
26+
eval_iters: 0
27+
28+
rerun_state_machine:
29+
check_for_nan_in_loss: false
30+
31+
checkpoint:
32+
# Directory to save to. If null, no checkpoint will be saved.
33+
save: null
34+
35+
logger:
36+
log_interval: 1
37+
38+
ddp:
39+
check_for_nan_in_grad: false
40+
check_for_large_grads: false
41+
42+
mixed_precision:
43+
grad_reduce_in_fp32: false
44+
45+
profiling:
46+
# For optional fields in the config, specify the target to instantiate the object.
47+
_target_: megatron.bridge.training.config.ProfilingConfig
48+
use_nsys_profiler: true
49+
profile_step_start: 5
50+
profile_step_end: 6
51+
profile_ranks: [0]
52+
record_shapes: false
53+
use_pytorch_profiler: false
54+
55+
perf_matrix:
56+
h100:
57+
num_gpus_256:
58+
common:
59+
num_gpus_per_node: 8
60+
seq_length: 4096
61+
mbs: 1
62+
gbs: 2048
63+
cuda_graphs: false
64+
tp: 2
65+
pp: 8
66+
cp: 1
67+
vp: 4
68+
ep: 32
69+
etp: 1
70+
fsdp: false
71+
bf16:
72+
fp8_cs:
73+
gb200:
74+
num_gpus_64:
75+
common:
76+
num_gpus_per_node: 4
77+
seq_length: 4096
78+
mbs: 1
79+
gbs: 1024
80+
cuda_graphs: true
81+
tp: 2
82+
pp: 1
83+
cp: 1
84+
vp: null
85+
ep: 64
86+
etp: 1
87+
fsdp: false
88+
bf16:
89+
fp8_cs:
90+
fp8_mx:
91+
b200:
92+
num_gpus_64:
93+
common:
94+
num_gpus_per_node: 4
95+
seq_length: 4096
96+
mbs: 1
97+
gbs: 1024
98+
cuda_graphs: false
99+
tp: 1
100+
pp: 8
101+
cp: 1
102+
vp: 12
103+
ep: 8
104+
etp: 1
105+
fsdp: false
106+
bf16:
107+
fp8_cs:
108+
fp8_mx:

0 commit comments

Comments
 (0)