Skip to content

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Sep 13, 2025

Stacked PRs:


[moe training] organize bench dir by quant type; rename bench scripts to match kernel names

Benchmarks for Llama4 16e, DSV3 236b, DSV3 671b with various experts per device (i.e., different EP degrees)

Llama4 16e shapes with 1-8 experts per device

M,N,K,G                  recipe                  bf16_fwd_bwd_us    scaled_fwd_bwd_us  scaled_fwd_bwd_speedup      bf16_fwd_us    scaled_fwd_us  scaled_fwd_speedup
-----------------------  --------------------  -----------------  -------------------  ------------------------  -------------  ---------------  --------------------
(16384, 8192, 5120, 1)   MoEScalingType.MXFP8           4403.49              2992.1    1.472x                         1236.1            823.344  1.501x
(16384, 8192, 5120, 2)   MoEScalingType.MXFP8           4225.02              3198      1.321x                         1180.58           795.648  1.484x
(16384, 8192, 5120, 4)   MoEScalingType.MXFP8           4100.06              3540.16   1.158x                         1150.82           918.576  1.253x
(16384, 8192, 5120, 8)   MoEScalingType.MXFP8           4246.43              3752.99   1.131x                         1192.67          1029.31   1.159x
(128000, 8192, 5120, 1)  MoEScalingType.MXFP8          42062.4              24263.6    1.734x                        10494             5960.54   1.761x
(128000, 8192, 5120, 2)  MoEScalingType.MXFP8          47400.4              26359.8    1.798x                        22849.5           7594.02   3.009x
(128000, 8192, 5120, 4)  MoEScalingType.MXFP8          36716.6              25263.1    1.453x                        10177             6054.02   1.681x
(128000, 8192, 5120, 8)  MoEScalingType.MXFP8          35439.1              27204.2    1.303x                        10147.6           5802.88   1.749x

DSV3 671b shapes with 1-8 experts per device

(16384, 2048, 7168, 1)   MoEScalingType.MXFP8           1628.3               1533.95   1.062x                          455.536          419.872  1.085x
(16384, 2048, 7168, 2)   MoEScalingType.MXFP8           1456.1               1496.16   0.973x                          457.536          415.856  1.1x
(16384, 2048, 7168, 4)   MoEScalingType.MXFP8           1469.47              1544.19   0.952x                          443.296          414.72   1.069x
(16384, 2048, 7168, 8)   MoEScalingType.MXFP8           1463.71              1607.62   0.91x                           415.744          466.96   0.89x
(128000, 2048, 7168, 1)  MoEScalingType.MXFP8          16563.4              11004.9    1.505x                         3560.05          2937.95   1.212x
(128000, 2048, 7168, 2)  MoEScalingType.MXFP8          12942.6              10989.6    1.178x                         3772.45          4152.24   0.909x
(128000, 2048, 7168, 4)  MoEScalingType.MXFP8          14366.6              10668.2    1.347x                         3601.44          2892.9    1.245x
(128000, 2048, 7168, 8)  MoEScalingType.MXFP8          16338.6              10000.5    1.634x                         4963.86          2536.98   1.957x

DSV3 236b shapes with 1-8 experts per device

(16640, 5120, 1536, 1)  MoEScalingType.MXFP8            892.016             1062.02   0.84x                           234.496          220.32   1.064x
(16640, 5120, 1536, 2)  MoEScalingType.MXFP8            872.432             1008.64   0.865x                          199.712          219.968  0.908x
(16640, 5120, 1536, 4)  MoEScalingType.MXFP8            872.416              967.328  0.902x                          202.576          211.968  0.956x
(16640, 5120, 1536, 8)  MoEScalingType.MXFP8            989.312             1082.56   0.914x                          216.128          240.672  0.898x
(128000, 1536, 5120, 1)  MoEScalingType.MXFP8           7406.69              7634.85   0.97x                          2081.28          1985.12   1.048x
(128000, 1536, 5120, 2)  MoEScalingType.MXFP8           7974.45              7383.84   1.08x                          2732.9           2291.68   1.193x
(128000, 1536, 5120, 4)  MoEScalingType.MXFP8           8261.63              6750.66   1.224x                         1984.74          1701.5    1.166x
(128000, 1536, 5120, 8)  MoEScalingType.MXFP8           8549.94              6553.6    1.305x                         1846.67          1757.22   1.051x

Important caveat

The current bench script produces directionally consistent results, but the aboslute speedup factor varies. Example:

Run 1:

(128000, 1536, 5120, 8)  MoEScalingType.MXFP8            7174.21              6401.42  1.121x                          2002.75          3079.17  0.65x
(128000, 2048, 7168, 8)  MoEScalingType.MXFP8           12775.2              10721.2   1.192x                          3690.34          2919.5   1.264x

Run 2:

M,N,K,G                  recipe                  bf16_fwd_bwd_us    scaled_fwd_bwd_us  scaled_fwd_bwd_speedup      bf16_fwd_us    scaled_fwd_us  scaled_fwd_speedup
-----------------------  --------------------  -----------------  -------------------  ------------------------  -------------  ---------------  --------------------
(128000, 1536, 5120, 8)  MoEScalingType.MXFP8            6497.25               6114.9  1.063x                          2140.29          1571.84  1.362x
(128000, 2048, 7168, 8)  MoEScalingType.MXFP8           15675.5               10228.9  1.532x                          3618.43          2765.31  1.309x

…benchmarks dir

stack-info: PR: #2999, branch: danielvegamyhre/stack/68
Copy link

pytorch-bot bot commented Sep 13, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2999

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre added a commit that referenced this pull request Sep 13, 2025
… to match kernel names

stack-info: PR: #2999, branch: danielvegamyhre/stack/68
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 13, 2025
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 13, 2025 19:46
danielvegamyhre added a commit that referenced this pull request Sep 13, 2025
… to match kernel names

stack-info: PR: #2999, branch: danielvegamyhre/stack/68
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 13, 2025 19:46
@danielvegamyhre danielvegamyhre changed the title [moe training] organize bench dir by quant type; rename bench scripts to match kernel names [moe training] add benchmarks for dsv3 236b, 671b shapes Sep 13, 2025
@danielvegamyhre danielvegamyhre changed the title [moe training] add benchmarks for dsv3 236b, 671b shapes [moe training] add benchmarks for dsv3 236b, 671b shapes; reorganize benchmarks dir Sep 13, 2025
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 13, 2025 21:06
danielvegamyhre added a commit that referenced this pull request Sep 13, 2025
…benchmarks dir

stack-info: PR: #2999, branch: danielvegamyhre/stack/68
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 13, 2025 21:06
@danielvegamyhre danielvegamyhre added topic: not user facing Use this tag if you don't want this PR to show up in release notes moe labels Sep 13, 2025
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 14, 2025 23:28
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 14, 2025 23:28
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 14, 2025 23:51
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 14, 2025 23:51
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 15, 2025 00:16
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 15, 2025 00:16
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 15, 2025 02:18
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 15, 2025 02:18
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 15, 2025 02:19
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 16, 2025 02:59
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 16, 2025 05:06
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 16, 2025 05:07
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 16, 2025 16:05
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 16, 2025 16:05
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 17, 2025 03:15
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 17, 2025 03:15
@drisspg
Copy link
Contributor

drisspg commented Sep 17, 2025

I think that we should try and have these sizes (especially for training) be based off of any known recipes. Not sure if any training details are out for DSV3 on global batch_size + ep degree
But from titan: https://github.com/pytorch/torchtitan/blob/be2c83df4869d88ef7b7b3b3a7ff0781d3a29ba3/torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml#L38-L39

looks like local tokens are 16384, w/ ep degree of 1 ? I am actually not sure If this means 1 expert per device or 128?

@danielvegamyhre
Copy link
Contributor Author

I think that we should try and have these sizes (especially for training) be based off of any known recipes. Not sure if any training details are out for DSV3 on global batch_size + ep degree But from titan: https://github.com/pytorch/torchtitan/blob/be2c83df4869d88ef7b7b3b3a7ff0781d3a29ba3/torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml#L38-L39

looks like local tokens are 16384, w/ ep degree of 1 ? I am actually not sure If this means 1 expert per device or 128?

Makes sense, I agree, although I don't think the parallelism default configs for DSV3 671b look right, it's configured to use FSDP2 only with no EP, no TP, for the largest model. Pretty sure this will OOM, it has 256 experts, some EP + TP is required for this. Anyway I will look into it and make sure we have sensible benchmarks.

@drisspg
Copy link
Contributor

drisspg commented Sep 17, 2025

@danielvegamyhre agreed the config was interesting.. cc @tianyu-l

If gemini w/ access to dsv3 paper is to believed:

DeepSeek-V3 was trained with a specific setup to optimize for both performance and cost-effectiveness. Here are the details on its batch sizes, sequence length, and expert parallelism (EP) degree.

Batch Sizes and Sequence Length

Batch Size: The training process used a dynamically increasing batch size. It started with 3,072 tokens and ramped up to a maximum of 15,360 tokens over the course of the first 469 billion tokens of training.

Sequence Length: The model was initially trained with a maximum context window of 4,096 tokens. This was later extended in a two-stage process: first to 32,000 tokens, and finally to a maximum of 128,000 tokens. This long context capability is one of the model's key features.

Parallelism Strategy

DeepSeek-V3's training framework uses a combination of parallelism strategies to manage its massive size and distribute the workload across 2,048 NVIDIA H800 GPUs.

Expert Parallelism (EP): The MoE architecture is sharded across multiple devices using Expert Parallelism. Specifically, DeepSeek-V3 uses 64-way EP across 8 nodes. The model has 256 routed experts and 1 shared expert per layer. For each token, the router selects the top 8 routed experts. The architecture includes a node-limited routing mechanism, which ensures that each token is sent to a maximum of 4 nodes, thus optimizing cross-node communication.

@tianyu-l
Copy link

The config in torchtitan is not tuned at all (as you can tell it's naive all FSDP). We still have some obvious perf bugs to fix.
pytorch/torchtitan#1624
pytorch/torchtitan#1631

@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 17, 2025 15:28
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 17, 2025 15:28
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 17, 2025 15:47
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 17, 2025 15:48
danielvegamyhre added a commit that referenced this pull request Sep 17, 2025
…benchmarks dir

stack-info: PR: #2999, branch: danielvegamyhre/stack/68
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 17, 2025 16:19
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/67 September 17, 2025 16:19
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/67 to main September 17, 2025 16:24
@danielvegamyhre danielvegamyhre merged commit f75b251 into main Sep 17, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. moe topic: not user facing Use this tag if you don't want this PR to show up in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants