Skip to content

Conversation

@xuzhao9
Copy link
Contributor

@xuzhao9 xuzhao9 commented Nov 1, 2025

We are adding kernels on AMD in #604

Aiter seems to be strangely too fast? 10x faster than H100. Need to look deeper into this...

Test plan (on MI300) :

$ python run.py --op flash_attention --only flex_attention,triton_tutorial_flash_v2,aiter 

  (Batch, Heads, SeqLen, SeqLen_KV, Dhead)    flex_attention-latency    triton_tutorial_flash_v2-latency       aiter-latency
------------------------------------------  ------------------------  ----------------------------------  ------------------
                     (4, 48, 128, 128, 64)        0.078788 (±70.82%)                 0.024714 (±254.78%)  0.072060 (±67.70%)
                     (4, 48, 256, 256, 64)        0.085598 (±52.27%)                   0.028199 (±4.26%)  0.073061 (±67.38%)
                     (4, 48, 512, 512, 64)         0.126614 (±2.02%)                   0.066972 (±6.52%)   0.048226 (±8.14%)
                   (4, 48, 1024, 1024, 64)         0.417134 (±1.97%)                   0.210209 (±4.61%)   0.087360 (±3.26%)
                   (4, 48, 2048, 2048, 64)         1.316211 (±3.77%)                   0.630747 (±2.90%)   0.144759 (±3.87%)
                   (4, 48, 4096, 4096, 64)         5.113835 (±1.29%)                   2.290631 (±3.78%)   0.278904 (±4.06%)
                   (4, 48, 8192, 8192, 64)        19.780565 (±1.54%)                   8.985677 (±0.74%)   0.614846 (±3.44%)
                                   average

In comparison, H100:

  (Batch, Heads, SeqLen, SeqLen_KV, Dhead)    flex_attention-latency    triton_tutorial_flash_v2-latency    triton_tutorial_flash_v2_tma-latency    cudnn-91002-latency
------------------------------------------  ------------------------  ----------------------------------  --------------------------------------  ---------------------
                     (4, 48, 128, 128, 64)         0.027424 (±6.07%)                   0.014112 (±4.31%)                       0.017888 (±3.40%)      0.019584 (±4.25%)
                     (4, 48, 256, 256, 64)         0.038368 (±6.01%)                   0.023648 (±2.98%)                       0.031232 (±2.97%)      0.024128 (±3.98%)
                     (4, 48, 512, 512, 64)         0.065472 (±3.13%)                   0.048160 (±1.59%)                       0.062976 (±1.32%)      0.042592 (±2.10%)
                   (4, 48, 1024, 1024, 64)         0.166912 (±1.38%)                   0.146848 (±0.63%)                       0.170944 (±0.66%)      0.125120 (±0.72%)
                   (4, 48, 2048, 2048, 64)         0.564256 (±0.59%)                   0.530784 (±0.40%)                       0.564352 (±1.28%)      0.440384 (±0.97%)
                   (4, 48, 4096, 4096, 64)         2.111392 (±0.81%)                   2.040704 (±0.98%)                       2.135584 (±2.41%)      1.676480 (±0.64%)
                   (4, 48, 8192, 8192, 64)         8.297536 (±0.14%)                   7.925888 (±2.41%)                       8.023680 (±0.31%)      6.624832 (±1.55%)
                                   average

@xuzhao9 xuzhao9 temporarily deployed to docker-s3-upload November 1, 2025 04:20 — with GitHub Actions Inactive
@xuzhao9 xuzhao9 temporarily deployed to docker-s3-upload November 1, 2025 04:20 — with GitHub Actions Inactive
@xuzhao9 xuzhao9 temporarily deployed to docker-s3-upload November 1, 2025 04:20 — with GitHub Actions Inactive
@xuzhao9 xuzhao9 temporarily deployed to docker-s3-upload November 1, 2025 04:20 — with GitHub Actions Inactive
@xuzhao9 xuzhao9 requested review from robieta and removed request for robieta November 1, 2025 04:24
@xuzhao9 xuzhao9 changed the title [aiter][flash_attention] update aiter and add attn [WIP][aiter][flash_attention] update aiter and add attn Nov 1, 2025
@xuzhao9 xuzhao9 requested a review from robieta November 1, 2025 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants