[MoE][compile][full ac] weave torch.compile around the FSDP(GroupedExperts) graph break (#1895)

xmfan · web-flow · commit 2a7a14841759 · 2025-10-29T11:21:45.000-07:00
Stacked PRs: * __->__#1895 --- --- --- This PR changes how we compile MoE layers to work around Compile + AC limitations. When you `AC(Compile(block))` or `Compile(AC(block))` and there is a graph break in `block`, we fall back the entire block to eager. For llama3, we've worked around this problem by addressing all graph breaks. With MoE models particularly dp2ep, we need to wrap`FSDP(block.moe.experts)`, meaning that we will have graph breaks when tracing `block.moe.experts.__call__`, meaning that whenever AC was enabled, the entire block for MoE would fallback to eager: https://gist.github.com/xmfan/50f4de1e89d789cd63a21aca9e600132 (Note in the tlparse, graph 0/1 is empty and it corresponds to the block containing the MoE). The workaround in this PR is to avoid tracing `block.moe.experts.__call__`. This is done by individually wrapping torch.compile on submodules of TransformerBlock. Note that we are leaving some perf on the table as this might exclude some ops in TransformerBlock.forward and MoE.forward. This is an API limitation, as we have no way to acquire those ops while decoupling the wrapper from model code. This workaround will no longer be necessity when either: - We can do Compile + AC with graph breaks - We remove the FSDP graph break This change introduces a small regression to the non-AC configuration. You can see a small perf dip from [before this PR](https://gist.github.com/xmfan/0b32e95980d263cf3f62869fa4d85921) and [after this PR](https://gist.github.com/xmfan/11561b5406b3f92ecd08da94bc5ee4e3). Given that AC is a necessity to run non-toy configurations of these models, I chose to stick to this implementation to make comparisons easier. Validated on DSv3 debug model: - dp2ep, no AC, no compile: https://gist.github.com/xmfan/927f354158ad36f4c5c1ffedde4e4ebe - dp2ep, no AC, compile: https://gist.github.com/xmfan/11561b5406b3f92ecd08da94bc5ee4e3 - before this PR (compile w/ nested graph break): https://gist.github.com/xmfan/0b32e95980d263cf3f62869fa4d85921 - dp2ep, full AC, compile: https://gist.github.com/xmfan/6ed5b48aa51ce0ac2b6bfceb86a0c482 - before this PR (whole moe block in eager): https://gist.github.com/xmfan/50f4de1e89d789cd63a21aca9e600132 - dp2ep, full AC, no compile: https://gist.github.com/xmfan/2308355c2aa4814fe3d12243445555fa - dp2ep, pp, full AC, compile: https://gist.github.com/xmfan/5a1ac23f00abdf93dbcc1539f552e840 - dp2ep, pp, full AC, no compile: https://gist.github.com/xmfan/302cda7191e53ffad5c4dc1e4b8f02de
diff --git a/torchtitan/models/llama4/infra/parallelize.py b/torchtitan/models/llama4/infra/parallelize.py
@@ -6,6 +6,9 @@
 
 import torch
 import torch.nn as nn
+from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
+    CheckpointWrapper,
+)
 from torch.distributed.device_mesh import DeviceMesh
 from torch.distributed.fsdp import CPUOffloadPolicy, fully_shard, MixedPrecisionPolicy
 from torch.distributed.tensor import Partial, Replicate, Shard
@@ -30,6 +33,7 @@
 )
 from torchtitan.distributed.tensor_parallel import maybe_enable_async_tp
 from torchtitan.models.llama3.infra.parallelize import apply_ddp
+from torchtitan.models.moe import moe as moe_module
 from torchtitan.tools.logging import logger
 
 
@@ -509,17 +513,69 @@ def apply_compile(model: nn.Module, compile_config: CompileConfig):
     """
     # NOTE: This flag is needed for torch.compile to avoid graph breaking on dynamic shapes in token-choice MoE
     # but it is experimental.
-    # torch._dynamo.config.capture_scalar_outputs = True
+    torch._dynamo.config.capture_scalar_outputs = True
     for layer_id, transformer_block in model.layers.named_children():
-        # TODO: remove when torch.compile supports fullgraph=True for MoE
-        fullgraph = True
         if transformer_block.moe_enabled:
-            fullgraph = False
-        transformer_block = torch.compile(
-            transformer_block,
-            backend=compile_config.backend,
-            fullgraph=fullgraph,
-        )
+            # If it is a MoE layer, FSDP(GroupedExperts) will cause a graph break
+            # So we must weave compile wrappers around those FSDP hooks to
+            # prevent AC from falling back the whole graph to eager.
+            # TODO: Fix Compile(AC(graph break))
+
+            if isinstance(transformer_block, CheckpointWrapper):
+                # TODO: Make CheckpointWrapper a transparent wrapper
+                # unwrap so that .named_children() works
+                block = transformer_block._checkpoint_wrapped_module
+            else:
+                block = transformer_block
+
+            for attr_name, submod in block.named_children():
+                assert getattr(block, attr_name) == getattr(
+                    transformer_block, attr_name
+                )
+
+                if isinstance(submod, moe_module.MoE):
+                    # avoid graph breaking on the GroupedExperts' FSDP hooks
+                    # by wrapping each submod's forward instead of their __call__
+                    moe = submod
+                    for attr_name, submod in moe.named_children():
+                        if attr_name == "experts":
+                            # NOTE: We don't compile token dispatch and token combine due to an issue on B200:
+                            # https://github.com/pytorch/torchtitan/issues/1940
+                            continue
+                        setattr(
+                            moe,
+                            attr_name,
+                            torch.compile(
+                                submod, backend=compile_config.backend, fullgraph=True
+                            ),
+                        )
+                else:
+                    setattr(
+                        block,
+                        attr_name,
+                        torch.compile(
+                            submod, backend=compile_config.backend, fullgraph=True
+                        ),
+                    )
+
+        else:
+            # If it's not a MoE layer, there is no FSDP(GroupedExperts)
+            # So we can compile the whole block
+            transformer_block = torch.compile(
+                transformer_block,
+                backend=compile_config.backend,
+                fullgraph=True,
+            )
+
         model.layers.register_module(layer_id, transformer_block)
 
+    moe_module._run_experts_grouped_mm = torch.compile(
+        moe_module._run_experts_grouped_mm,
+        backend=compile_config.backend,
+        fullgraph=True,
+    )
+
+    # NOTE: We don't compile for loop code path due to an issue with unbacked symints:
+    # https://github.com/pytorch/pytorch/issues/166460
+
     logger.info("Compiling each TransformerBlock with torch.compile")