-
Notifications
You must be signed in to change notification settings - Fork 338
[moe training] add benchmarks for dsv3 236b, 671b shapes; reorganize benchmarks dir #2999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…benchmarks dir stack-info: PR: #2999, branch: danielvegamyhre/stack/68
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2999
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
… to match kernel names stack-info: PR: #2999, branch: danielvegamyhre/stack/68
ef8dd2d
to
585cac0
Compare
… to match kernel names stack-info: PR: #2999, branch: danielvegamyhre/stack/68
585cac0
to
5c21d7a
Compare
…benchmarks dir stack-info: PR: #2999, branch: danielvegamyhre/stack/68
5c21d7a
to
7a948c8
Compare
f458bc2
to
427e482
Compare
427e482
to
2d48e35
Compare
2d48e35
to
374cda6
Compare
I think that we should try and have these sizes (especially for training) be based off of any known recipes. Not sure if any training details are out for DSV3 on global batch_size + ep degree looks like local tokens are 16384, w/ ep degree of 1 ? I am actually not sure If this means 1 expert per device or 128? |
Makes sense, I agree, although I don't think the |
@danielvegamyhre agreed the config was interesting.. cc @tianyu-l If gemini w/ access to dsv3 paper is to believed:
|
The config in torchtitan is not tuned at all (as you can tell it's naive all FSDP). We still have some obvious perf bugs to fix. |
374cda6
to
5106b92
Compare
5106b92
to
213a554
Compare
d229482
to
c36adb9
Compare
…benchmarks dir stack-info: PR: #2999, branch: danielvegamyhre/stack/68
213a554
to
fefb1e0
Compare
fefb1e0
to
6dd01fc
Compare
6dd01fc
to
644d635
Compare
Stacked PRs:
[moe training] organize bench dir by quant type; rename bench scripts to match kernel names
Benchmarks for Llama4 16e, DSV3 236b, DSV3 671b with various experts per device (i.e., different EP degrees)
Llama4 16e shapes with 1-8 experts per device
DSV3 671b shapes with 1-8 experts per device
DSV3 236b shapes with 1-8 experts per device
Important caveat
The current bench script produces directionally consistent results, but the aboslute speedup factor varies. Example:
Run 1:
Run 2: