Skip to content

Conversation

@xmfan
Copy link
Member

@xmfan xmfan commented Nov 4, 2025

FIXES #1935

Stacked PRs:

tlparse: https://fburl.com/sqxd6c0w


Workaround AC HOP mutation issue when tracing token dispatch

TORCH_COMPILE_FORCE_DISABLE_CACHES=1 HF_TOKEN=<token> HF_HUB_DISABLE_XET=1 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" with-proxy ./run_train.sh --model.name simple_fsdp.deepseek_v3

This is a problem for SimpleFSDP where we want to fullgraph the entire model, these "mutation" cause graph break

It is less of a problem outside SimpleFSDP, because we don't currently compile token dispatch

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 4, 2025
@xmfan xmfan marked this pull request as ready for review November 5, 2025 04:03
@xmfan xmfan requested a review from ruisizhang123 November 5, 2025 04:03
Comment on lines +146 to +149
input_shape,
permuted_indices,
input_splits,
output_splits,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These shouldn't be exposed to single-device model code. Plus, I don't think it will work if EP is not used.

If it's getting too hard, maybe we should use local_map / to_local to re-implement MoE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SimpleFSDP AC HOP mutation issue when tracing token dispatch

3 participants