[WIP][FA][Blackwell] Implementation with explicit data partitioning #384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

manman-ren wants to merge 3 commits into meta-pytorch:main from manman-ren:fa-dp

Contributor

manman-ren commented Sep 2, 2025 •

edited

Loading

python run.py --op blackwell_attentions --only triton_tutorial_flash_dp_blackwell, --seq-len 1024 --batch 1152 --n-heads 4 --d-head 128

manman-ren added 2 commits

September 2, 2025 11:37


          add explicit data partition variant

4f313c3

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:


          tuning

a24fa6a

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

manman-ren temporarily deployed to docker-s3-upload

September 2, 2025 20:21

— with

GitHub Actions Inactive

manman-ren temporarily deployed to docker-s3-upload

September 2, 2025 20:21

— with

GitHub Actions Inactive

meta-cla bot added the cla signed label

manman-ren marked this pull request as draft

September 2, 2025 20:21


          persistent

82c5e65

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

manman-ren temporarily deployed to docker-s3-upload

September 10, 2025 21:37

— with

GitHub Actions Inactive

manman-ren temporarily deployed to docker-s3-upload

September 10, 2025 21:37

— with

GitHub Actions Inactive

manman-ren marked this pull request as ready for review

September 10, 2025 21:38

manman-ren requested a review from njriasan

September 10, 2025 21:38

njriasan approved these changes

View reviewed changes

Contributor

njriasan left a comment

LGTM! If you could clarify my understanding of the kernel that would be great.

tritonbench/kernels/blackwell_triton_fused_attention_dp.py

+                          desc_q = TensorDescriptor(
+                              q,
+                              shape=[y_dim, HEAD_DIM_K],
+                              strides=[HEAD_DIM_K, 1],

Contributor

njriasan Sep 10, 2025

This assumes that q, k, and v are all contiguous and not transposed. Can we add an assert to enforce this requirement?

tritonbench/kernels/blackwell_triton_fused_attention_dp.py

+                              strides=[HEAD_DIM_K, 1],
+                              block_shape=dummy_block,
+                          )
+                      else:

Contributor

njriasan Sep 10, 2025

Just error on the else case? The kernel requires TMA

Contributor Author

manman-ren Sep 11, 2025

This is to support on-device TMA. I will need to call _maybe_make_tensor_desc.

tritonbench/kernels/blackwell_triton_fused_attention_dp.py

+                            OUTER_LOOP: tl.constexpr,
+                            dtype: tl.constexpr,
+                            ):
+                  n_tile_num = tl.cdiv(N_CTX, BLOCK_M)

Contributor

njriasan Sep 10, 2025

Is this correct? This stands out to me as a possible typo, possibly just in variable name.

Contributor Author

manman-ren Sep 11, 2025

Yeah should rename n_tile_num to num_pid_m.

tritonbench/kernels/blackwell_triton_fused_attention_dp.py

+                  tile_idx = prog_id
+                  # inner loop warpspec vs. outer loop warpspec
+                  for _ in tl.range(0, tiles_per_sm, warp_specialize=warp_specialize and OUTER_LOOP):

Contributor

njriasan Sep 10, 2025

For my understanding why don't we need FLATTEN=True here? Is it not viable with FA because its too complicated and we actually need more complex AutoWS in the compiler?

Contributor Author

manman-ren Sep 11, 2025

FLATTEN only works on very simple case such as GEMM. For FA, FLATTEN doesn't work and we will need to handle nested control flow. In OSS, NVidia is driving the work.

Contributor

njriasan Sep 11, 2025

Thanks for clarifying/confirming my understanding.

tritonbench/kernels/blackwell_triton_fused_attention_dp.py

+                  offs_m1 = start_m * BLOCK_M + tl.arange(BLOCK_M//2, BLOCK_M)
+                  offs_n = tl.arange(0, BLOCK_N)
+                  m_i0 = tl.zeros([BLOCK_M//2], dtype=tl.float32) - float("inf")

Contributor

njriasan Sep 10, 2025

To clarify this is the explicit data partitioning to enable subtiling + ping pong?

Contributor Author

manman-ren Sep 11, 2025

This is to set up the outputs for data partitioning, we will call _attn_fwd_inner_oss_dp twice, each one working on one half of the full block size.
We will have one data partition working on q0, the other working on q1, where q0 + q1 is the original q.

tritonbench/kernels/blackwell_triton_fused_attention_dp.py

+                  BN: tl.constexpr = acc.shape[1]
+                  acc0, acc1 = acc.reshape([BM, 2, BN//2]).permute(0, 2, 1).split()
+                  acc0 = acc0 * alpha[:, None]

Contributor

njriasan Sep 11, 2025

This is another form of partitioning/subtiling?

Contributor Author

manman-ren Sep 11, 2025

Yes, this is subtiling for correction.
FA tutorial has this too: https://github.com/triton-lang/triton/blob/main/python/tutorials/06-fused-attention.py#L88

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels