[Qwen3] Qwen3 MoE initial support #1685

wwwjn · 2025-09-05T21:04:54Z

As Qwen3 dense model and MoE model share a lot of common parts (eg, Attention), I added MoE module on top of Qwen3 dense model.

Initial verification with FSDP=8, EP=2

jthomy · 2025-09-05T23:14:06Z

Great to see!
I was about to open a PR with a Qwen3 MoE implementation I have, but happy to see it here as well (it is similar, except that it uses the RoPE implementation with complex numbers and has a state dict adapter from/to hf).
Looking over the code, I see that the MFU implementation still needs to take into account the sparse activations.
Are you interested in adding sequence packing support as well?
If not, I also have an implementation for flash attention with sequence packing using the flash_attn_varlen_func and could open a PR on it, but I am not sure if the torchtitan repo wants to stick with flex attention only.

tianyu-l · 2025-09-05T23:41:00Z

@jthomy
IIUC these are two separate questions

Looking over the code, I see that the MFU implementation still needs to take into account the sparse activations.
Are you interested in adding sequence packing support as well?

We do have sequence packing / document masking support in torchtitan using FlexAttention, which can (and should) be added to Qwen MoE. cc @wwwjn
The MFU computation is more of following convention, lol.

If not, I also have an implementation for flash attention with sequence packing using the flash_attn_varlen_func and could open a PR on it, but I am not sure if the torchtitan repo wants to stick with flex attention only.

There used to be discussions on this. IIRC the worry was around supporting flash_attn_varlen_func with CP? Even if we support flash_attn_varlen_func, IMO that should go into pytorch SDPA instead of into torchtitan directly.

cc @fegin @drisspg

vwxyzjn · 2025-09-06T02:38:44Z

@jthomy quite curious about your implementation as well. Especially the hf state dict adapter. Would you mind sharing your branch?

jthomy · 2025-09-08T16:32:59Z

@vwxyzjn sure, I made a draft pull request here, feel free to have a look: #1688
Let me know if there's anything I can help with.

fegin · 2025-09-09T06:43:57Z

@jthomy FlexAttention has document masking implementation. The main blocker now is the composability with SAC, which we are working a workaround. As for SDPA version, @drisspg has prototyped a version, pytorch/pytorch#162326. There may be also change to how we provide these API calls due to the composability issues when enabling CP with FlexAttention. I can provide more detail later this week once I try out the proposal.

tianyu-l · 2025-09-10T03:42:24Z

torchtitan/experiments/qwen3/README.md

+    - Supports FSDP/HSDP, TP, DDP, EP.
+    - Supports AC, torch.compile.
+    - MoE models use Token Choice routing, which is using auxiluary-loss-free load balancing algorithm.
+    - [WIP] CP is not supported currently becase we used different RoPE embeding.


what's our status on this? why it works on dense by not MoE? I thought RoPE only matters in the Attention layer.

When you say we used different RoPE, doesn't it mean we could've switched to (alternative but also correct) complex number based RoPE (e.g. #1688) and CP would automatically work?

CP is not supported is correct but it is due to Flex?

@fegin oh is Flex only enabled for Qwen MoE but not Qwen dense? Either way we should update both to be consistent.

I wrote this because of minor issue not being adressed: In Qwen, we used cos/sin RoPE embeddings, so there's no freqs_cis field and CP is explicitly adding freqs_cis here:

torchtitan/torchtitan/train.py

Line 425 in 71dea16

cp_buffers=[inputs, labels] + [m.freqs_cis for m in model_parts],

. This is a known issue but I plan to address it later.

For Flex support - Qwen3 MoE only have one test model here, and I'm testing bigger size model. I will update bigger size MoE model to use Flex Attention. And update README to state the CP issue with FlexAttention

what's your plan on

This is a known issue but I plan to address it later.

I think I don't mind switching to freqs_cis based RoPE, if that makes more sense to you.

Do we know Qwen is using Flash attention of Mem Efficient attention under SDPA? @fegin

Since (1) Flex + CP is WIP and (2) sdpa may not be the bottleneck, personally for Qwen I think it's OK to use SDPA until Flex+CP is ready. But I don't have strong opinion on this.

@fegin oh is Flex only enabled for Qwen MoE but not Qwen dense? Either way we should update both to be consistent.

The whole model can only use one type of attention, that is the design, unless Qwen implementation explicitly allows users to configure. But I would not suggest this approach.

Since Qwen is not using Flex, I think CP would work once the freq_cis issue @wwwjn mentioned is fixed.

torchtitan/experiments/qwen3/README.md

torchtitan/models/moe.py

torchtitan/experiments/qwen3/train_configs/qwen3_moe_debug.toml

torchtitan/experiments/qwen3/model/model.py

torchtitan/experiments/qwen3/README.md

tianyu-l · 2025-09-11T03:16:29Z

torchtitan/experiments/qwen3/README.md

-    - MoE alternatives
+## To be added
+- MoE model
+    - `StateDictAdapter` support for MoE model


this is next step right?

Yes, this is on my plate

torchtitan/experiments/qwen3/README.md

torchtitan/experiments/qwen3/__init__.py

tianyu-l

LGTM!

jthomy · 2025-09-11T17:59:45Z

Did you verify that this model is the same as HF Qwen3?
E.g. I see that no scaling factor is passed to the attention, instead of self.scaling = self.head_dim**-0.5 in HF if I am not mistaken.

wwwjn · 2025-09-11T18:10:55Z

Did you verify that this model is the same as HF Qwen3? E.g. I see that no scaling factor is passed to the attention, instead of self.scaling = self.head_dim**-0.5 in HF if I am not mistaken.

Thanks for pointing out! For dense model, yes we did some check on end-to-end forward results , eg: the description #1590. But we haven't done finer-granularity checks on intermediate results. I saw you have a great test scripts in you PR and I'm thinking leveraging that! For MoE, the numerical verification is in progress

I agree the Attention part is not the same as Qwen3 - Nice catch! And I will create a fix for that

jthomy · 2025-09-11T18:28:37Z

Nice, thank you! Yes feel free to us any of my code.

wwwjn added 2 commits September 4, 2025 22:55

add moe config

dc16128

moe implementation

5d26019

wwwjn requested review from tianyu-l, fegin and wconstab as code owners September 5, 2025 21:04

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 5, 2025

wwwjn changed the title ~~[Qwen3] Qwen3 MoE support~~ [Qwen3] Qwen3 MoE initial support Sep 5, 2025

wwwjn mentioned this pull request Sep 5, 2025

[Qwen3 MoE] Add initial implementation #1674

Closed

tianyu-l reviewed Sep 10, 2025

View reviewed changes

wwwjn added 2 commits September 10, 2025 14:42

fix comments

be6251c

fix dsv3

0a405f0

wwwjn requested a review from tianyu-l September 11, 2025 02:59

tianyu-l reviewed Sep 11, 2025

View reviewed changes

clean up comments

ee2d0d7

tianyu-l approved these changes Sep 11, 2025

View reviewed changes

wwwjn merged commit bd3850b into pytorch:main Sep 11, 2025
8 checks passed

[Qwen3] Qwen3 MoE initial support #1685

[Qwen3] Qwen3 MoE initial support #1685

Conversation

wwwjn commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jthomy commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vwxyzjn commented Sep 6, 2025

Uh oh!

jthomy commented Sep 8, 2025

Uh oh!

fegin commented Sep 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

jthomy commented Sep 11, 2025

Uh oh!

wwwjn commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jthomy commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

wwwjn commented Sep 5, 2025 •

edited

Loading

jthomy commented Sep 5, 2025 •

edited

Loading

tianyu-l commented Sep 5, 2025 •

edited

Loading

wwwjn commented Sep 11, 2025 •

edited

Loading