-
Notifications
You must be signed in to change notification settings - Fork 523
[Qwen3] Qwen3 MoE initial support #1685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Great to see! |
@jthomy
We do have sequence packing / document masking support in torchtitan using FlexAttention, which can (and should) be added to Qwen MoE. cc @wwwjn
There used to be discussions on this. IIRC the worry was around supporting |
@jthomy quite curious about your implementation as well. Especially the hf state dict adapter. Would you mind sharing your branch? |
@jthomy FlexAttention has document masking implementation. The main blocker now is the composability with SAC, which we are working a workaround. As for SDPA version, @drisspg has prototyped a version, pytorch/pytorch#162326. There may be also change to how we provide these API calls due to the composability issues when enabling CP with FlexAttention. I can provide more detail later this week once I try out the proposal. |
- Supports FSDP/HSDP, TP, DDP, EP. | ||
- Supports AC, torch.compile. | ||
- MoE models use Token Choice routing, which is using auxiluary-loss-free load balancing algorithm. | ||
- [WIP] CP is not supported currently becase we used different RoPE embeding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's our status on this? why it works on dense by not MoE? I thought RoPE only matters in the Attention layer.
When you say we used different RoPE, doesn't it mean we could've switched to (alternative but also correct) complex number based RoPE (e.g. #1688) and CP would automatically work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CP is not supported
is correct but it is due to Flex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fegin oh is Flex only enabled for Qwen MoE but not Qwen dense? Either way we should update both to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this because of minor issue not being adressed: In Qwen, we used cos/sin RoPE embeddings, so there's no freqs_cis
field and CP is explicitly adding freqs_cis here:
torchtitan/torchtitan/train.py
Line 425 in 71dea16
cp_buffers=[inputs, labels] + [m.freqs_cis for m in model_parts], |
For Flex support - Qwen3 MoE only have one test model here, and I'm testing bigger size model. I will update bigger size MoE model to use Flex Attention. And update README to state the CP issue with FlexAttention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's your plan on
This is a known issue but I plan to address it later.
I think I don't mind switching to freqs_cis
based RoPE, if that makes more sense to you.
Do we know Qwen is using Flash attention of Mem Efficient attention under SDPA? @fegin
Since (1) Flex + CP is WIP and (2) sdpa may not be the bottleneck, personally for Qwen I think it's OK to use SDPA until Flex+CP is ready. But I don't have strong opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fegin oh is Flex only enabled for Qwen MoE but not Qwen dense? Either way we should update both to be consistent.
The whole model can only use one type of attention, that is the design, unless Qwen implementation explicitly allows users to configure. But I would not suggest this approach.
Since Qwen is not using Flex, I think CP would work once the freq_cis issue @wwwjn mentioned is fixed.
torchtitan/experiments/qwen3/train_configs/qwen3_moe_debug.toml
Outdated
Show resolved
Hide resolved
torchtitan/experiments/qwen3/train_configs/qwen3_moe_debug.toml
Outdated
Show resolved
Hide resolved
- MoE alternatives | ||
## To be added | ||
- MoE model | ||
- `StateDictAdapter` support for MoE model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is next step right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is on my plate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Did you verify that this model is the same as HF Qwen3? |
Thanks for pointing out! For dense model, yes we did some check on end-to-end forward results , eg: the description #1590. But we haven't done finer-granularity checks on intermediate results. I saw you have a great test scripts in you PR and I'm thinking leveraging that! For MoE, the numerical verification is in progress I agree the Attention part is not the same as Qwen3 - Nice catch! And I will create a fix for that |
Nice, thank you! Yes feel free to us any of my code. |
As Qwen3 dense model and MoE model share a lot of common parts (eg, Attention), I added MoE module on top of Qwen3 dense model.
Initial verification with FSDP=8, EP=2
