[mxfp8 moe training] add torchao MXFP8 MoE training integration; bump version guard #1701

danielvegamyhre · 2025-09-12T00:21:19Z

WIP

planning to update PR with perf numbers, loss curves before landing

Summary

We've recently landed prototype mxfp8 MoE training support in torchao (eager only currently - compile support in progress)
This PR adds support for MXFP8 MoE training through torchao, and bumps the version guard (torchao 0.13.0 was just recently, and we want users on nightly for now, since this is a prototype feature we are actively adding compile support, etc)
Add default toml configs for MX component, filtering out output,router.gate from conversion (fixes MXFP8 error for Llama4 from MXLinear #1703)
Set default MoE target FQNs as experts,shared_expert

Test plan

Llama4 debug model config:

    "debugmodel": TransformerModelArgs(
        dim=5120,
        n_layers=2,
        n_heads=40,
        n_kv_heads=8,
        ffn_dim_multiplier=1.2,
        multiple_of=2048,
        rope_theta=500000,
        max_seq_len=10485760,
        moe_args=MoEArgs(num_experts=4),
        interleave_moe_layer_step=1,
    ),

FSDP=4, eager, converting routed experts and shared expert: https://www.internalfb.com/phabricator/paste/view/P1943369944

Limitations

TP support: error in dense layer? https://www.internalfb.com/phabricator/paste/view/P1943363433
Tests with EP blocked by EP: token alignment not working as expected #1651
Compile support in progress, debugging tricky tensor subclass issues

…ump version guard

danielvegamyhre · 2025-09-12T01:05:01Z

cc @tianyu-l @drisspg for review

drisspg · 2025-09-12T01:07:11Z

torchtitan/components/quantization/mx.py

@@ -39,8 +40,8 @@ def __init__(self, job_config: JobConfig, parallel_dims: ParallelDims):
            )
        torchao_version = version("torchao")

-        # Last torchao release was 0.12.0, so nightly build starts with 0.13.0+git...
-        is_nightly_build = torchao_version.startswith("0.13.0")
+        # Last torchao release was 0.13.0, so nightly build starts with 0.13.0+git...


nit: update this comment to not be specific to any specific verison

drisspg · 2025-09-12T01:08:16Z

torchtitan/components/quantization/mx.py

+            logger.info(
+                f"Setting token group alignment size to {self.mxfp8_token_group_alignment_size}"
+            )
+            set_token_group_alignment_size_m(self.mxfp8_token_group_alignment_size)


If we were making this alignment a variable, maybe update the name so it's not mxfp8 specific. If your intention so that if/when we add nvfp4/mxfp4 we just bumpthis

drisspg · 2025-09-12T01:11:56Z

torchtitan/components/quantization/mx.py

+        # TODO: add warning in torchao when this happens, or find a better way to avoid this.
+        if self.moe_fqns:
+            self._convert_moe_layers(model)
+


it feels like the converter registered for self.config should handle this case specifically

as in having separate converters registered for mxfp8 moe, fp8 moe etc? If so, I kind of agree actually, the converters are becoming a bit of a mess, and having separate converters would also allow users to convert just dense or just MoE (or both), rather than the current state of having to convert dense in order to convert MoE.

Yeah that is one way, another thing I was thinking is that https://github.com/pytorch/ao/blob/93030e750186ace1c1c2ee7a849e2818a9f0ffde/torchao/prototype/moe_training/conversion_utils.py#L50 should be able to gracefully handle the case where the module has already be converted

torchtitan/components/quantization/mx.py

tianyu-l

In the log

Swapped w1.weight to ScaledGroupedMMTensor

it's hard to tell which w1 is converted. Can we include full fqn?

(eager only currently - compile support in progress)

Does it mean we shouldn't care too much about performance with this PR?
How about numerics?

I'll try to find some time to look into #1651

tianyu-l · 2025-09-12T06:45:14Z

torchtitan/components/quantization/mx.py

@@ -30,6 +30,7 @@ class MXConverter(ModelConverter):
    enabled: bool
    filter_fqns: List[str]
    mx_config: Any  # MXLinearConfig type when imported
+    token_group_alignment_size = 32


Maybe make this a MACRO if it's not really a "variable". E.g. you can keep a central map in quantization/__init__.py for both fp8 and mx.

tianyu-l · 2025-09-12T06:46:51Z

torchtitan/experiments/llama4/__init__.py

please revert change in this file, as I believe they are accidentally submitted

tianyu-l · 2025-09-12T06:49:15Z

torchtitan/experiments/llama4/train_configs/debug_model.toml

+
+[mx]
+filter_fqns = ["output", "router.gate"]
+moe_fqns_prototype = ["experts,shared_expert"]


shared_expert is now made up of 3 nn.Linear instead of GroupedExperts, so maybe you don't want to include them? In your paste they are converted to MXLinear

oh, i didn't know that this had changed. will update accordingly

tianyu-l · 2025-09-12T06:51:21Z

torchtitan/components/quantization/mx.py

+        def moe_module_filter_fn(mod: nn.Module, cur_fqn: str) -> bool:
+            for target_fqn in self.moe_fqns:
+                if target_fqn in cur_fqn:
+                    return True
+            return False


do you think we can put this in quantization/utils.py, together with the similar thing in fp8 file?

danielvegamyhre requested review from tianyu-l, fegin, wwwjn and wconstab as code owners September 12, 2025 00:21

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 12, 2025

danielvegamyhre marked this pull request as draft September 12, 2025 00:21

danielvegamyhre marked this pull request as ready for review September 12, 2025 00:27

danielvegamyhre force-pushed the mx-moe-update branch from d416130 to 960840d Compare September 12, 2025 00:30

[mxfp8 moe training] update torchao MXFP8 MoE training integration; b…

90582f1

…ump version guard

danielvegamyhre force-pushed the mx-moe-update branch from 960840d to 90582f1 Compare September 12, 2025 00:43

update default filter_fqns

e7e055b

danielvegamyhre requested review from vkuzo and drisspg September 12, 2025 01:03

drisspg reviewed Sep 12, 2025

View reviewed changes

torchtitan/components/quantization/mx.py Outdated Show resolved Hide resolved

address comments

2b329a7

tianyu-l reviewed Sep 12, 2025

View reviewed changes

danielvegamyhre marked this pull request as draft September 14, 2025 00:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mxfp8 moe training] add torchao MXFP8 MoE training integration; bump version guard #1701

[mxfp8 moe training] add torchao MXFP8 MoE training integration; bump version guard #1701

danielvegamyhre commented Sep 12, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Sep 12, 2025

Uh oh!

drisspg Sep 12, 2025

Uh oh!

drisspg Sep 12, 2025

Uh oh!

drisspg Sep 12, 2025

Uh oh!

danielvegamyhre Sep 12, 2025

Uh oh!

drisspg Sep 12, 2025

Uh oh!

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Sep 12, 2025

Uh oh!

tianyu-l Sep 12, 2025

Uh oh!

tianyu-l Sep 12, 2025

Uh oh!

danielvegamyhre Sep 12, 2025

Uh oh!

tianyu-l Sep 12, 2025

Uh oh!

Uh oh!

[mxfp8 moe training] add torchao MXFP8 MoE training integration; bump version guard #1701

Are you sure you want to change the base?

[mxfp8 moe training] add torchao MXFP8 MoE training integration; bump version guard #1701

Conversation

danielvegamyhre commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WIP

Summary

Test plan

Limitations

Uh oh!

danielvegamyhre commented Sep 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danielvegamyhre commented Sep 12, 2025 •

edited

Loading