[moe training] add benchmarks for dsv3 236b, 671b shapes; reorganize benchmarks dir #2999

danielvegamyhre · 2025-09-13T19:20:18Z

Stacked PRs:

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integrate it #3004
[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise #3002
->[moe training] add benchmarks for dsv3 236b, 671b shapes; reorganize benchmarks dir #2999
[mxfp8 moe training] use dim1 cast cuda kernel for 3d weights by reshaping to 2d #2998

[moe training] organize bench dir by quant type; rename bench scripts to match kernel names

Benchmarks for Llama4 16e, DSV3 236b, DSV3 671b with various experts per device (i.e., different EP degrees)

Llama4 16e shapes with 1-8 experts per device

M,N,K,G                  recipe                  bf16_fwd_bwd_us    scaled_fwd_bwd_us  scaled_fwd_bwd_speedup      bf16_fwd_us    scaled_fwd_us  scaled_fwd_speedup
-----------------------  --------------------  -----------------  -------------------  ------------------------  -------------  ---------------  --------------------
(16384, 8192, 5120, 1)   MoEScalingType.MXFP8           4403.49              2992.1    1.472x                         1236.1            823.344  1.501x
(16384, 8192, 5120, 2)   MoEScalingType.MXFP8           4225.02              3198      1.321x                         1180.58           795.648  1.484x
(16384, 8192, 5120, 4)   MoEScalingType.MXFP8           4100.06              3540.16   1.158x                         1150.82           918.576  1.253x
(16384, 8192, 5120, 8)   MoEScalingType.MXFP8           4246.43              3752.99   1.131x                         1192.67          1029.31   1.159x
(128000, 8192, 5120, 1)  MoEScalingType.MXFP8          42062.4              24263.6    1.734x                        10494             5960.54   1.761x
(128000, 8192, 5120, 2)  MoEScalingType.MXFP8          47400.4              26359.8    1.798x                        22849.5           7594.02   3.009x
(128000, 8192, 5120, 4)  MoEScalingType.MXFP8          36716.6              25263.1    1.453x                        10177             6054.02   1.681x
(128000, 8192, 5120, 8)  MoEScalingType.MXFP8          35439.1              27204.2    1.303x                        10147.6           5802.88   1.749x

DSV3 671b shapes with 1-8 experts per device

(16384, 2048, 7168, 1)   MoEScalingType.MXFP8           1628.3               1533.95   1.062x                          455.536          419.872  1.085x
(16384, 2048, 7168, 2)   MoEScalingType.MXFP8           1456.1               1496.16   0.973x                          457.536          415.856  1.1x
(16384, 2048, 7168, 4)   MoEScalingType.MXFP8           1469.47              1544.19   0.952x                          443.296          414.72   1.069x
(16384, 2048, 7168, 8)   MoEScalingType.MXFP8           1463.71              1607.62   0.91x                           415.744          466.96   0.89x
(128000, 2048, 7168, 1)  MoEScalingType.MXFP8          16563.4              11004.9    1.505x                         3560.05          2937.95   1.212x
(128000, 2048, 7168, 2)  MoEScalingType.MXFP8          12942.6              10989.6    1.178x                         3772.45          4152.24   0.909x
(128000, 2048, 7168, 4)  MoEScalingType.MXFP8          14366.6              10668.2    1.347x                         3601.44          2892.9    1.245x
(128000, 2048, 7168, 8)  MoEScalingType.MXFP8          16338.6              10000.5    1.634x                         4963.86          2536.98   1.957x

DSV3 236b shapes with 1-8 experts per device

(16640, 5120, 1536, 1)  MoEScalingType.MXFP8            892.016             1062.02   0.84x                           234.496          220.32   1.064x
(16640, 5120, 1536, 2)  MoEScalingType.MXFP8            872.432             1008.64   0.865x                          199.712          219.968  0.908x
(16640, 5120, 1536, 4)  MoEScalingType.MXFP8            872.416              967.328  0.902x                          202.576          211.968  0.956x
(16640, 5120, 1536, 8)  MoEScalingType.MXFP8            989.312             1082.56   0.914x                          216.128          240.672  0.898x
(128000, 1536, 5120, 1)  MoEScalingType.MXFP8           7406.69              7634.85   0.97x                          2081.28          1985.12   1.048x
(128000, 1536, 5120, 2)  MoEScalingType.MXFP8           7974.45              7383.84   1.08x                          2732.9           2291.68   1.193x
(128000, 1536, 5120, 4)  MoEScalingType.MXFP8           8261.63              6750.66   1.224x                         1984.74          1701.5    1.166x
(128000, 1536, 5120, 8)  MoEScalingType.MXFP8           8549.94              6553.6    1.305x                         1846.67          1757.22   1.051x

Important caveat

The current bench script produces directionally consistent results, but the aboslute speedup factor varies. Example:

Run 1:

(128000, 1536, 5120, 8)  MoEScalingType.MXFP8            7174.21              6401.42  1.121x                          2002.75          3079.17  0.65x
(128000, 2048, 7168, 8)  MoEScalingType.MXFP8           12775.2              10721.2   1.192x                          3690.34          2919.5   1.264x

Run 2:

M,N,K,G                  recipe                  bf16_fwd_bwd_us    scaled_fwd_bwd_us  scaled_fwd_bwd_speedup      bf16_fwd_us    scaled_fwd_us  scaled_fwd_speedup
-----------------------  --------------------  -----------------  -------------------  ------------------------  -------------  ---------------  --------------------
(128000, 1536, 5120, 8)  MoEScalingType.MXFP8            6497.25               6114.9  1.063x                          2140.29          1571.84  1.362x
(128000, 2048, 7168, 8)  MoEScalingType.MXFP8           15675.5               10228.9  1.532x                          3618.43          2765.31  1.309x

…benchmarks dir stack-info: PR: #2999, branch: danielvegamyhre/stack/68

pytorch-bot · 2025-09-13T19:20:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2999

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

… to match kernel names stack-info: PR: #2999, branch: danielvegamyhre/stack/68

…benchmarks dir stack-info: PR: #2999, branch: danielvegamyhre/stack/68

drisspg · 2025-09-17T03:20:57Z

I think that we should try and have these sizes (especially for training) be based off of any known recipes. Not sure if any training details are out for DSV3 on global batch_size + ep degree
But from titan: https://github.com/pytorch/torchtitan/blob/be2c83df4869d88ef7b7b3b3a7ff0781d3a29ba3/torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml#L38-L39

looks like local tokens are 16384, w/ ep degree of 1 ? I am actually not sure If this means 1 expert per device or 128?

danielvegamyhre · 2025-09-17T03:33:41Z

I think that we should try and have these sizes (especially for training) be based off of any known recipes. Not sure if any training details are out for DSV3 on global batch_size + ep degree But from titan: https://github.com/pytorch/torchtitan/blob/be2c83df4869d88ef7b7b3b3a7ff0781d3a29ba3/torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml#L38-L39

looks like local tokens are 16384, w/ ep degree of 1 ? I am actually not sure If this means 1 expert per device or 128?

Makes sense, I agree, although I don't think the parallelism default configs for DSV3 671b look right, it's configured to use FSDP2 only with no EP, no TP, for the largest model. Pretty sure this will OOM, it has 256 experts, some EP + TP is required for this. Anyway I will look into it and make sure we have sensible benchmarks.

drisspg · 2025-09-17T03:53:43Z

@danielvegamyhre agreed the config was interesting.. cc @tianyu-l

If gemini w/ access to dsv3 paper is to believed:

DeepSeek-V3 was trained with a specific setup to optimize for both performance and cost-effectiveness. Here are the details on its batch sizes, sequence length, and expert parallelism (EP) degree.

Batch Sizes and Sequence Length

Batch Size: The training process used a dynamically increasing batch size. It started with 3,072 tokens and ramped up to a maximum of 15,360 tokens over the course of the first 469 billion tokens of training.

Sequence Length: The model was initially trained with a maximum context window of 4,096 tokens. This was later extended in a two-stage process: first to 32,000 tokens, and finally to a maximum of 128,000 tokens. This long context capability is one of the model's key features.

Parallelism Strategy

DeepSeek-V3's training framework uses a combination of parallelism strategies to manage its massive size and distribute the workload across 2,048 NVIDIA H800 GPUs.

Expert Parallelism (EP): The MoE architecture is sharded across multiple devices using Expert Parallelism. Specifically, DeepSeek-V3 uses 64-way EP across 8 nodes. The model has 256 routed experts and 1 shared expert per layer. For each token, the router selects the top 8 routed experts. The architecture includes a node-limited routing mechanism, which ensures that each token is sent to a maximum of 4 nodes, thus optimizing cross-node communication.

tianyu-l · 2025-09-17T04:17:08Z

The config in torchtitan is not tuned at all (as you can tell it's naive all FSDP). We still have some obvious perf bugs to fix.
pytorch/torchtitan#1624
pytorch/torchtitan#1631

…benchmarks dir stack-info: PR: #2999, branch: danielvegamyhre/stack/68

[moe training] add benchmarks for dsv3 236b, 671b shapes; reorganize …

644d635

…benchmarks dir stack-info: PR: #2999, branch: danielvegamyhre/stack/68

danielvegamyhre added a commit that referenced this pull request Sep 13, 2025

[moe training] organize bench dir by quant type; rename bench scripts…

585cac0

… to match kernel names stack-info: PR: #2999, branch: danielvegamyhre/stack/68

danielvegamyhre force-pushed the danielvegamyhre/stack/68 branch from ef8dd2d to 585cac0 Compare September 13, 2025 19:20

danielvegamyhre mentioned this pull request Sep 13, 2025

[mxfp8 moe training] add compile support #2990

Merged

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 13, 2025

danielvegamyhre mentioned this pull request Sep 13, 2025

[mxfp8 moe training] use dim1 cast cuda kernel for 3d weights by reshaping to 2d #2998

Merged

danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 13, 2025 19:46

danielvegamyhre added a commit that referenced this pull request Sep 13, 2025

[moe training] organize bench dir by quant type; rename bench scripts…

5c21d7a

… to match kernel names stack-info: PR: #2999, branch: danielvegamyhre/stack/68

danielvegamyhre force-pushed the danielvegamyhre/stack/68 branch from 585cac0 to 5c21d7a Compare September 13, 2025 19:46

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 13, 2025 19:46

danielvegamyhre changed the title ~~[moe training] organize bench dir by quant type; rename bench scripts to match kernel names~~ [moe training] add benchmarks for dsv3 236b, 671b shapes Sep 13, 2025

danielvegamyhre changed the title ~~[moe training] add benchmarks for dsv3 236b, 671b shapes~~ [moe training] add benchmarks for dsv3 236b, 671b shapes; reorganize benchmarks dir Sep 13, 2025

danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 13, 2025 21:06

danielvegamyhre added a commit that referenced this pull request Sep 13, 2025

[moe training] add benchmarks for dsv3 236b, 671b shapes; reorganize …

7a948c8

…benchmarks dir stack-info: PR: #2999, branch: danielvegamyhre/stack/68

danielvegamyhre force-pushed the danielvegamyhre/stack/68 branch from 5c21d7a to 7a948c8 Compare September 13, 2025 21:06

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 13, 2025 21:06

danielvegamyhre added topic: not user facing Use this tag if you don't want this PR to show up in release notes moe labels Sep 13, 2025

danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 14, 2025 23:28

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 14, 2025 23:28

danielvegamyhre mentioned this pull request Sep 14, 2025

[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise #3002

Merged

danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 14, 2025 23:51

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 14, 2025 23:51

danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 15, 2025 00:16

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 15, 2025 00:16

danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 15, 2025 02:18

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 15, 2025 02:18

danielvegamyhre mentioned this pull request Sep 15, 2025

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integrate it #3003

Closed

danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 15, 2025 02:19

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 16, 2025 02:59

danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 16, 2025 05:06

danielvegamyhre force-pushed the danielvegamyhre/stack/68 branch from f458bc2 to 427e482 Compare September 16, 2025 05:07

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 16, 2025 05:07

vkuzo approved these changes Sep 16, 2025

View reviewed changes

danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 16, 2025 16:05

danielvegamyhre force-pushed the danielvegamyhre/stack/68 branch from 427e482 to 2d48e35 Compare September 16, 2025 16:05

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 16, 2025 16:05

danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 17, 2025 03:15

danielvegamyhre force-pushed the danielvegamyhre/stack/68 branch from 2d48e35 to 374cda6 Compare September 17, 2025 03:15

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 17, 2025 03:15

danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 17, 2025 15:28

danielvegamyhre force-pushed the danielvegamyhre/stack/68 branch from 374cda6 to 5106b92 Compare September 17, 2025 15:28

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 17, 2025 15:28

danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 17, 2025 15:47

danielvegamyhre force-pushed the danielvegamyhre/stack/68 branch from 5106b92 to 213a554 Compare September 17, 2025 15:47

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 17, 2025 15:48

danielvegamyhre force-pushed the danielvegamyhre/stack/67 branch from d229482 to c36adb9 Compare September 17, 2025 16:14

danielvegamyhre added a commit that referenced this pull request Sep 17, 2025

[moe training] add benchmarks for dsv3 236b, 671b shapes; reorganize …

fefb1e0

…benchmarks dir stack-info: PR: #2999, branch: danielvegamyhre/stack/68

danielvegamyhre force-pushed the danielvegamyhre/stack/68 branch from 213a554 to fefb1e0 Compare September 17, 2025 16:14

danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 17, 2025 16:19

danielvegamyhre force-pushed the danielvegamyhre/stack/68 branch from fefb1e0 to 6dd01fc Compare September 17, 2025 16:19

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 17, 2025 16:19

danielvegamyhre force-pushed the danielvegamyhre/stack/68 branch from 6dd01fc to 644d635 Compare September 17, 2025 16:24

danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 17, 2025 16:24

danielvegamyhre merged commit f75b251 into main Sep 17, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[moe training] add benchmarks for dsv3 236b, 671b shapes; reorganize benchmarks dir #2999

[moe training] add benchmarks for dsv3 236b, 671b shapes; reorganize benchmarks dir #2999

Uh oh!

danielvegamyhre commented Sep 13, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 13, 2025 •

edited

Loading

Uh oh!

drisspg commented Sep 17, 2025

Uh oh!

danielvegamyhre commented Sep 17, 2025

Uh oh!

drisspg commented Sep 17, 2025

Uh oh!

tianyu-l commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

[moe training] add benchmarks for dsv3 236b, 671b shapes; reorganize benchmarks dir #2999

[moe training] add benchmarks for dsv3 236b, 671b shapes; reorganize benchmarks dir #2999

Uh oh!

Conversation

danielvegamyhre commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks for Llama4 16e, DSV3 236b, DSV3 671b with various experts per device (i.e., different EP degrees)

Important caveat

Uh oh!

pytorch-bot bot commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2999

Uh oh!

drisspg commented Sep 17, 2025

Uh oh!

danielvegamyhre commented Sep 17, 2025

Uh oh!

drisspg commented Sep 17, 2025

Uh oh!

tianyu-l commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Sep 13, 2025 •

edited

Loading

pytorch-bot bot commented Sep 13, 2025 •

edited

Loading