[mxfp8 moe training] add compile support #2990

danielvegamyhre · 2025-09-12T05:12:12Z

Stacked PRs:

[mxfp8 moe training] add compile support

wrap triton kernels in custom ops
fix d2h sync caused by doing padded_rows = output_scale_group_offsets[-1]. By looking at the trace I found this was doing a .item() under the hood and causing a d2h sync, so now instead I'm just using the upper bound of the padding needed via padded_rows = rows + num_groups * 128 and avoiding the d2h sync.

Test plan

pytest test/prototype/moe_training/test_training.py

Microbenchmarks with compile

A_shape        B_shape           recipe                  bf16_e2e_us    scaled_e2e_us  scaled_e2e_speedup      bf16_fwd_us    scaled_fwd_us  scaled_fwd_speedup
-------------  ----------------  --------------------  -------------  ---------------  --------------------  -------------  ---------------  --------------------
(16640, 5120)  (1, 8192, 5120)   MoEScalingType.MXFP8        4268.5           3402.75  1.254x                      1513.76          1675.81  0.903x
(16640, 5120)  (4, 8192, 5120)   MoEScalingType.MXFP8        3968.88          4282.53  0.927x                      1126.21          2222.37  0.507x
(16640, 5120)  (16, 8192, 5120)  MoEScalingType.MXFP8        4900.77          8091.55  0.606x                      1262.66          9047.7   0.14x
(16640, 5120)  (64, 8192, 5120)  MoEScalingType.MXFP8        8432.61         21453.3   0.393x                      1788.94         14476.4   0.124x

Perf analysis for M=16640, G=4, N=8192, K=5120

Looking at the trace to see why it's slower, here are my initial takeways:

Mxfp8 grouped GEMMs are 1.57x faster than bf16 grouped gemms, but should be 2.1x faster: (mxpf8 avg = ~700us, bf16 avg = ~1.1ms). This contradicts the grouped GEMM microbenchmarking we did, which indicates for shape M=16640, G=4, K=5120, N=8192 we should be getting a ~2.1x speedup, but we only are getting a ~1.57x speedup. So we need to determine what is going on here. cc @cthi @slayton58 who may be able to help with this
Blocked scale swizzling kernels look okay: 2 of the 3 of the handwritten triton kernels for converting scales to blocked swizzled format are very fast. One may need more optimization (TBD):
- triton_scale_swizzle_M_groups = 18us avg
- triton_scale_swizzle_K_groups = 14us avg
- triton_scale_swizzle_per_group_3d = ~250us avg (not surprising it's longer since 3d tensor is much more data than 2d activations, will get some mem bw benchmarks on these kernels though)
Mxfp8 dim1 cast CUDA kernel looks good: ~100us average, used on 2d RHS operands. This kernel achieves ~5300gbps mem bw, around 66% of the peak 8TB/s bandwidth, so it can potentially still be improved.
Torch inductor codegen kernels are pretty slow (worst offender is ~1.2ms, longer than the grouped GEMM itself). This is probably largely due to stray *.contiguous() call I was forced to do by to_mx API limitations but I'm guessing it's also that inductor codegen is slow, as it has been historically for various cases in fp8 rowwise, fp8 blockwise, and mxfp8. So we should try to get the mxfp8 dim1 cuda kernel working for 3d tensors so we can get the double win of avoiding the .contiguous() calls and using the faster kernel / achieves higher mem bw utilization than torch.compile / triton. I can do this next, or perhaps @slayton58 may be interested in this. (Perhaps we can just reshape 3d->2d, use the existing kernel, then reshape back to 3d? Need to think about this.)

pytorch-bot · 2025-09-12T05:12:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2990

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 10 Pending

As of commit d4a26bd with merge base 66384a9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #2990, branch: danielvegamyhre/stack/66

danielvegamyhre · 2025-09-12T15:37:01Z

Possible wrap_triton bug:

Unit tests using compile pass: pytest test/prototype/moe_training/test_training.py -s
Benchmarks using compile hit this error, inside the custom op (??):

Command: python benchmarks/prototype/moe_training/benchmark_scaled_grouped_mm_dq.py --compile

Update: disabling dynamic shapes resolved the error

Error:


  File "/home/danvm/ao/torchao/prototype/moe_training/kernels/mxfp8_blocked_scales.py", line 228, in triton_mx_block_rearrange_2d_M_groups
    output = scales_tensor.new_empty((padded_rows, padded_cols))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function torchao.triton_mx_block_rearrange_2d_M_groups.default(*(FakeTensor(..., device='cuda:0', size=(16640, 160), dtype=torch.float8_e8m0fnu), FakeTensor(..., device='cuda:0', size=(s21,), dtype=torch.int32), FakeTensor(..., device='cuda:0', size=(s21 + 1,), dtype=torch.int64)), **{}): got RuntimeError('Cannot call numel() on tensor with symbolic sizes/strides\nException raised from throw_cannot_call_with_symbolic at /home/danvm/pytorch/c10/core/TensorImpl.cpp:291 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x88 (0x7f35c53946c8 in /home/danvm/pytorch/torch/lib/libc10.so)\nframe #1: c10::TensorImpl::throw_cannot_call_with_symbolic(char const*) const + 0x78 (0x7f35c5325476 in /home/danvm/pytorch/torch/lib/libc10.so)\nframe #2: <unknown function> + 0x7938f (0x7f35c537038f in /home/danvm/pytorch/torch/lib/libc10.so)\nframe #3: <unknown function> + 0x49f856 (0x7f35c589f856 in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #4: <unknown function> + 0x484fde (0x7f35c5884fde in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #5: python() [0x545869]\n<omitting python frames>\nframe #11: python() [0x42c4c2]\nframe #13: python() [0x625fb6]\nframe #23: python() [0x457d8d]\nframe #24: python() [0x5d6bb6]\nframe #26: torch::handle_torch_function_no_python_arg_parser(c10::ArrayRef<_object*>, _object*, _object*, char const*, _object*, char const*, torch::TorchFunctionName) + 0x405 (0x7f35c603d6d5 in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #27: <unknown function> + 0x6a1865 (0x7f35c5aa1865 in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #28: <unknown function> + 0x17bd9d3 (0x7f35b93bd9d3 in /home/danvm/pytorch/torch/lib/libtorch_cpu.so)\nframe #29: <unknown function> + 0x6a7d90 (0x7f35c5aa7d90 in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #30: <unknown function> + 0x6a4c8d (0x7f35c5aa4c8d in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #31: <unknown function> + 0xc56c70 (0x7f35c6056c70 in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #32: <unknown function> + 0xc5708d (0x7f35c605708d in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #33: <unknown function> + 0x3b7dc6 (0x7f35c57b7dc6 in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #34: python() [0x5429c4]\nframe #36: python() [0x56b000]\nframe #38: python() [0x457d8d]\nframe #41: <unknown function> + 0xc66254 (0x7f35c6066254 in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #42: <unknown function> + 0xc58d6b (0x7f35c6058d6b in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #43: <unknown function> + 0xc66a96 (0x7f35c6066a96 in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #44: <unknown function> + 0x6a7d90 (0x7f35c5aa7d90 in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #45: <unknown function> + 0x6a4c8d (0x7f35c5aa4c8d in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #46: <unknown function> + 0x17bd7cb (0x7f35b93bd7cb in /home/danvm/pytorch/torch/lib/libtorch_cpu.so)\nframe #47: <unknown function> + 0x6a7d90 (0x7f35c5aa7d90 in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #48: <unknown function> + 0x6a4c8d (0x7f35c5aa4c8d in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #49: <unknown function> + 0x596736b (0x7f35bd56736b in /home/danvm/pytorch/torch/lib/libtorch_cpu.so)\nframe #50: <unknown function> + 0x9e685f (0x7f35c5de685f in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #51: <unknown function> + 0x9e6b5c (0x7f35c5de6b5c in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #52: <unknown function> + 0x8de7e9 (0x7f35c5cde7e9 in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #53: <unknown function> + 0x3b7dc6 (0x7f35c57b7dc6 in /home/danvm/pytorch/torch/lib/libtorch_python.so)\nframe #54: python() [0x5429c4]\nframe #57: python() [0x42c4c2]\nframe #59: python() [0x625fb6]\n')

from user code:
   File "/home/danvm/ao/benchmarks/utils.py", line 12, in fwd_bwd
    out = fn(*args, **kwargs)
  File "/home/danvm/ao/torchao/prototype/moe_training/scaled_grouped_mm.py", line 71, in _scaled_grouped_mm
    return _MXFP8GroupedMM.apply(
  File "/home/danvm/ao/torchao/prototype/moe_training/scaled_grouped_mm.py", line 323, in forward
    A_scales_blocked = triton_mx_block_rearrange_2d_M_groups(
  File "/home/danvm/pytorch/torch/_library/custom_ops.py", line 676, in __call__
    return self._opoverload(*args, **kwargs)

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

cc @zou3519 any idea what could be going on here? this is with torch built from source on 9/10

danielvegamyhre · 2025-09-12T20:18:08Z

Disable dynamic shapes worked as a workaround

danielvegamyhre · 2025-09-12T20:20:21Z

Only forward is getting compiled for some reason, not backward: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp7B4CM7/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

This is with torch built from source with cuda 12.9 at 53b8bdb97774114ca02948fed47f2fd49996c564 (sept 12th)

stack-info: PR: #2990, branch: danielvegamyhre/stack/66

zou3519 · 2025-09-12T22:53:11Z

Do you have a function in C++ that calls Tensor.sizes()? The main fix for that error message is to change Tensor.sizes() in C++ to Tensor.sym_sizes()

stack-info: PR: #2990, branch: danielvegamyhre/stack/66

bdhirsh · 2025-09-15T20:27:09Z

torchao/prototype/moe_training/tensor.py

-            lambda x: ScaledGroupedMMTensor(x, scaling_type),
-            out,
-        )
+        # Only wrap tensor outputs, prevent double wrapping


what would cause double wrapping to have happened previously? I would expect that as long as args_unwrapped above only consists of plain tensors, then out here should be only plain tensors as well

Tried to repro the original issue to answer this, but now using the latest torch nightly build it doesn't repro anymore.... good news I guess

bdhirsh · 2025-09-15T20:28:56Z

torchao/prototype/moe_training/tensor.py

    torch.ops.aten.view.default,
    torch.ops.aten.as_strided.default,
-    torch.ops.aten._to_copy.default,
+    torch.ops.aten._to_copy.default,  # for *.to(dtype)


it seems a bit strange to me that _to_copy should always preserve the subclass. For example - if I do aten._to_copy(mxfp8_tensor, dtype=torch.bfloat16), I would expect the output to not be a subclass, right?

Put another way - I'd imagine that you need to carefully implement _to_copy to only preserve the subclass if we are not doing a cross-dtype copy to a higher precision

For example - if I do aten._to_copy(mxfp8_tensor, dtype=torch.bfloat16), I would expect the output to not be a subclass, right?

It would seem that way, but we actually do want to preserve the subclass in this case, because of how torchtitan integrates with grouped_mm here

Basically, kernels that torch._grouped_mm and torch._scaled_grouped_mm dispatch to only accept bf16 as the input/output dtype, yet torchtitan wants to support doing training in full fp32 precision, so they do this cast here.

vkuzo · 2025-09-16T12:58:32Z

benchmarks/prototype/moe_training/benchmark_scaled_grouped_mm_dq.py

 torch._dynamo.config.cache_size_limit = 1000

+# Workaround for https://github.com/pytorch/ao/pull/2990#issuecomment-3285762681
+torch._dynamo.config.automatic_dynamic_shapes = False


is this needed for fast e2e? if yes, add to README.md?

Oh, we can remove this workaround now - seems like using the latest nightly, the issue is resolved. Updated.

So actually yes, while the bug/crash no longer occurs with dynamic shapes, perf is worse with dynamic shapes: https://www.internalfb.com/phabricator/paste/view/P1949814669

So I think we should recommend disabling. Will make a note of this to put in the readme

No we dont need to recommend in the readme. Benchmarks artificially hit automatic dynamic shapes because we are sweeping certain dims. In practice the dims that are being swept won't be dynamic or if they are then its up to the users to decide how they want to handle this -> dynamic=false w/ more recompiles or the default behavior

test/prototype/moe_training/test_kernels.py

drisspg · 2025-09-17T03:10:46Z

torchao/prototype/moe_training/kernels/mxfp8_blocked_scales.py


    # Final offset is the total number of rows in the tensor
-    padded_rows = output_group_start_offsets[-1]
+    padded_rows = rows + num_groups * 128


can you write a note: padding to max_possible: ...
and then link to that note throughout the codebase this is a pretty common pattern

drisspg · 2025-09-17T03:11:10Z

torchao/prototype/moe_training/kernels/mxfp8_blocked_scales.py

    # output_group_start_offsets always starts with 0 and ends with the total number of cols
-    padded_cols = output_group_start_offsets[-1]
-    output = scales_tensor.new_empty((padded_rows, padded_cols))
+    padded_cols = cols + num_groups * 4


danielvegamyhre · 2025-09-17T03:12:11Z

torchao/prototype/moe_training/kernels/mxfp8_blocked_scales.py

    assert x_scales.ndim == 2, "x_scales must be 2D"
    assert block_size == 32, "Only block_size=32 is supported for now"
-    blocked_scales_list = []
+    M, total_K = x_scales.shape


Note for reviewers: the changes to the torch reference implementation torch_to_blocked_2d_K_groups (to account for upper-bound based padding, i.e. total_padded_K = cols + num_groups * 4) is not working properly yet. It does not match the triton kernel in unit tests, but I am sure the triton kernel is correct because the e2e training tests validate correct gradient numerics. As further evidence, replacing the triton kernel with this torch impl in the training code, and the gradient numerics are garbage.

Need to figure out the correct way to represent "row of blocks"-major PER GROUP in torch native code, before landing this.

stack-info: PR: #2990, branch: danielvegamyhre/stack/66

danielvegamyhre added a commit that referenced this pull request Sep 12, 2025

[mxfp8 moe training] add compile support

b294802

stack-info: PR: #2990, branch: danielvegamyhre/stack/66

danielvegamyhre force-pushed the danielvegamyhre/stack/66 branch from 7660825 to b294802 Compare September 12, 2025 05:12

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2025

danielvegamyhre added a commit that referenced this pull request Sep 12, 2025

[mxfp8 moe training] add compile support

bceaa66

stack-info: PR: #2990, branch: danielvegamyhre/stack/66

danielvegamyhre force-pushed the danielvegamyhre/stack/66 branch from b294802 to bceaa66 Compare September 12, 2025 15:34

danielvegamyhre added a commit that referenced this pull request Sep 12, 2025

[mxfp8 moe training] add compile support

1f692de

stack-info: PR: #2990, branch: danielvegamyhre/stack/66

danielvegamyhre force-pushed the danielvegamyhre/stack/66 branch from bceaa66 to 1f692de Compare September 12, 2025 20:24

danielvegamyhre added a commit that referenced this pull request Sep 12, 2025

[mxfp8 moe training] add compile support

bc4aef6

stack-info: PR: #2990, branch: danielvegamyhre/stack/66

danielvegamyhre force-pushed the danielvegamyhre/stack/66 branch from 1f692de to bc4aef6 Compare September 12, 2025 20:27

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Sep 12, 2025

danielvegamyhre added a commit that referenced this pull request Sep 12, 2025

[mxfp8 moe training] add compile support

6327f2e

stack-info: PR: #2990, branch: danielvegamyhre/stack/66

danielvegamyhre force-pushed the danielvegamyhre/stack/66 branch from bc4aef6 to 6327f2e Compare September 12, 2025 22:37

danielvegamyhre changed the title ~~[mxfp8 moe training] add compile support~~ [mxfp8 moe training] add compile support and fix d2h sync Sep 12, 2025

danielvegamyhre added a commit that referenced this pull request Sep 12, 2025

[mxfp8 moe training] add compile support

d6e908b

stack-info: PR: #2990, branch: danielvegamyhre/stack/66

danielvegamyhre force-pushed the danielvegamyhre/stack/66 branch from 6327f2e to d6e908b Compare September 12, 2025 23:46

danielvegamyhre changed the title ~~[mxfp8 moe training] add compile support and fix d2h sync~~ [mxfp8 moe training] add compile support Sep 12, 2025

danielvegamyhre added a commit that referenced this pull request Sep 12, 2025

[mxfp8 moe training] add compile support

01bfca9

stack-info: PR: #2990, branch: danielvegamyhre/stack/66

danielvegamyhre force-pushed the danielvegamyhre/stack/66 branch from d6e908b to 01bfca9 Compare September 12, 2025 23:47

danielvegamyhre mentioned this pull request Sep 13, 2025

[mxfp8 moe training] use dim1 cast cuda kernel for 3d weights by reshaping to 2d #2998

Merged

danielvegamyhre added mx moe labels Sep 13, 2025

danielvegamyhre requested review from drisspg and vkuzo September 15, 2025 16:41

bdhirsh reviewed Sep 15, 2025

View reviewed changes

danielvegamyhre force-pushed the danielvegamyhre/stack/66 branch 2 times, most recently from e7247f1 to 68ed781 Compare September 15, 2025 21:20

danielvegamyhre mentioned this pull request Sep 16, 2025

[mxfp8 moe training] fix kernel test for per group blocked format conversion #3008

Closed

danielvegamyhre force-pushed the danielvegamyhre/stack/66 branch from 68ed781 to e381f71 Compare September 16, 2025 02:59

vkuzo approved these changes Sep 16, 2025

View reviewed changes

danielvegamyhre force-pushed the danielvegamyhre/stack/66 branch from e381f71 to e3f64f0 Compare September 16, 2025 16:05

drisspg reviewed Sep 17, 2025

View reviewed changes

test/prototype/moe_training/test_kernels.py Outdated Show resolved Hide resolved

drisspg reviewed Sep 17, 2025

View reviewed changes

danielvegamyhre commented Sep 17, 2025

View reviewed changes

danielvegamyhre force-pushed the danielvegamyhre/stack/66 branch 2 times, most recently from f5228f7 to 3f644c5 Compare September 17, 2025 15:28

[mxfp8 moe training] add compile support

d4a26bd

stack-info: PR: #2990, branch: danielvegamyhre/stack/66

danielvegamyhre force-pushed the danielvegamyhre/stack/66 branch from 3f644c5 to d4a26bd Compare September 17, 2025 15:47

danielvegamyhre merged commit afe5cab into main Sep 17, 2025
17 of 18 checks passed

[mxfp8 moe training] add compile support #2990

[mxfp8 moe training] add compile support #2990

Uh oh!

Conversation

danielvegamyhre commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!