Skip to content

Conversation

@QiZhangNV
Copy link
Contributor

@QiZhangNV QiZhangNV commented Nov 7, 2025

Description

Introduces a CUTLASS-based grouped GEMM implementation that reads m_splits directly on the device.

This optimization removes the need for device-to-host data transfers and synchronization in MCore, while allowing the number of quantization kernels to be reduced to one.

The kernel is fully compatible with CUDA Graphs.

Key points:
• Does not break the existing API. The operator now accepts m_splits as either a torch.Tensor (on CPU or GPU) or a Python list.
• Reduces CPU overhead, especially for large expert counts, by using a single quantization kernel instead of one per GEMM.
• Currently supports only MXFP8.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change m_splits from List[int] to torch.Tensor, but can still run correctly with List[int] (will be internally converted to a tensor)
  • Add te_general_device_initiated_grouped_gemm

Unit Test

pytest -v -s tests/pytorch/test_numerics.py::test_grouped_linear_accuracy_cutlass_device

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Code clean & Add Check & Fix Arch

Chnage cutlass submodule to https

Support BF16

Fix assertion

Fix when all local experts have no tokens

Fix

Support save_original_input for cutlass backend

Fix remove cudaMallocAsync & modify CUTLASS config

Pass nullptr if C is not needed

Tune kernel Performance

Add dtype check for m_split

Optimize setGroupedGemmWgradArguments when fuse_wgrad_accumulation=false

Support partial wgrad accumulate when using cutlass backend

use torch.empty() instead of torch.zeros for wgrad_list

Fix IMA when enable cuda graph

Use agr wgrad_accumulation_mask to handle partial wgrad accumulate

Use bitmap for partial wgrad accumulate to avoid cudaMemcpyAsync

Allow m_splits to be List, convert to torch tensor

Use pinned memory instead of pageable memory

Refactor and add dispatcher
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR introduces device-initiated grouped GEMM support that eliminates CPU-GPU synchronization overhead for MoE (Mixture of Experts) workloads by reading m_splits directly on the device.

Key Changes:

  • Modified m_splits parameter from List[int] to torch.Tensor, maintaining backward compatibility by auto-converting lists
  • Added new CUTLASS-based kernel path (nvte_device_cutlass_grouped_gemm and nvte_device_cutlass_grouped_gemm_wgrad) for Blackwell GPUs (SM 10.0)
  • Implemented device-side argument preparation that reads m_splits tensor on GPU, avoiding D2H transfer
  • Added support for partial weight gradient accumulation via wgrad_accumulation_mask parameter
  • Uses pinned host memory buffer for CUDA Graph compatibility when transferring weight/scale factor addresses

Critical Issues Found:

  1. Race condition in global buffer index (gemm.cpp:563): The pinned_host_buffer_index global variable lacks thread safety and never resets, causing buffer overflow after multiple calls
  2. Multiple typos: Variable name m_splits_on_devie should be m_splits_on_device in several locations
  3. Complex lambda expression (grouped_linear.py:265-267): Nested immediately-invoked lambda makes code maintenance difficult

Limitations:

  • Device-initiated path only supports MXFP8 format on Blackwell GPUs
  • No bias support when m_splits is on device
  • Requires m dimension alignment to 128 for MXFP8 (increased from 32)

Confidence Score: 2/5

  • This PR has critical concurrency bugs that will cause failures in production
  • The global pinned_host_buffer_index variable creates a race condition and memory corruption risk. Without proper synchronization or reset mechanism, the buffer index grows unbounded and will overflow workspace memory. Additionally, multiple typos in variable names indicate insufficient review/testing
  • transformer_engine/pytorch/csrc/extensions/gemm.cpp requires immediate attention to fix the global buffer index race condition before merging

Important Files Changed

File Analysis

Filename Score Overview
transformer_engine/pytorch/module/grouped_linear.py 3/5 Changed m_splits from List[int] to torch.Tensor, added device-initiated path with m_splits_on_device flag, complex lambda expression for conditional wgrad accumulation
transformer_engine/pytorch/csrc/extensions/gemm.cpp 2/5 Implements te_general_device_initiated_grouped_gemm with global pinned_host_buffer_index for CUDA Graph support, manages H2D copies for weight/SF addresses
transformer_engine/common/gemm/cutlass_device_grouped_gemm.cu 4/5 New CUTLASS kernel implementation for device-initiated grouped GEMM, supports fprop/dgrad/wgrad with MXFP8, includes WgradAccumulatePolicy for partial accumulation

Sequence Diagram

sequenceDiagram
    participant User
    participant GroupedLinear
    participant _GroupedLinear
    participant gemm.py
    participant gemm.cpp
    participant CUTLASS_CUDA
    
    User->>GroupedLinear: forward(inp, m_splits)
    Note over GroupedLinear: Convert m_splits to tensor if list
    GroupedLinear->>_GroupedLinear: forward()
    
    alt m_splits on device
        Note over _GroupedLinear: Single quantize (no split)
        _GroupedLinear->>_GroupedLinear: tex.split_quantize(inp, [total_size], quantizers[:1])
    else m_splits on CPU
        Note over _GroupedLinear: Split quantize per expert
        _GroupedLinear->>_GroupedLinear: tex.split_quantize(inp, m_splits.tolist(), quantizers)
    end
    
    _GroupedLinear->>gemm.py: general_grouped_gemm(A, B, out, m_splits, m_splits_on_device)
    
    alt m_splits_on_device
        gemm.py->>gemm.cpp: te_general_device_initiated_grouped_gemm()
        gemm.cpp->>gemm.cpp: Prepare B/SF address arrays in pinned memory
        gemm.cpp->>gemm.cpp: H2D copy (async, non-blocking)
        gemm.cpp->>CUTLASS_CUDA: nvte_device_cutlass_grouped_gemm()
        CUTLASS_CUDA->>CUTLASS_CUDA: setGroupedGemmArguments kernel (reads m_splits on device)
        CUTLASS_CUDA->>CUTLASS_CUDA: CUTLASS grouped GEMM kernel
    else m_splits on CPU
        gemm.py->>gemm.cpp: te_general_grouped_gemm()
        gemm.cpp->>gemm.cpp: cuBLAS multi-stream grouped GEMM
    end
    
    gemm.cpp-->>gemm.py: result
    gemm.py-->>_GroupedLinear: out
    _GroupedLinear-->>GroupedLinear: out
    GroupedLinear-->>User: result
Loading

11 files reviewed, 8 comments

Edit Code Review Agent Settings | Greptile

// reuse and overwrite the same buffer.
// A global variable is used because the function doesn't know how many instances there are and which
// instance is calling.
int pinned_host_buffer_index = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: global variable without synchronization creates race condition

The global pinned_host_buffer_index is not thread-safe. If multiple operators call this function concurrently (even from different streams), they will increment the same global variable, leading to:

  1. Buffer corruption as operators overwrite each other's data
  2. Index overflow beyond workspace bounds

Even with CUDA Graph capture being single-threaded per graph, eager mode or multiple graphs could cause issues.

Suggested change
int pinned_host_buffer_index = 0;
// Use atomic operations or per-instance tracking
std::atomic<int> pinned_host_buffer_index = 0;

Consider using thread-local storage or passing buffer offset through the call chain instead.

Comment on lines 244 to +267
ctx.main_grad_funcs = [
lambda j=i: weights[j].main_grad for i in range(num_gemms)
(lambda idx=i: (lambda: weights[idx].main_grad) if (wgrad_accumulation_mask is None or wgrad_accumulation_mask[idx]) else (lambda: None))()
for i in range(num_gemms)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: overly complex lambda creates hard-to-debug closure

This nested lambda with conditional execution (lambda idx=i: (lambda: ...) if condition else (lambda: None))() is executed immediately and difficult to understand. The inner lambdas capture weights[idx] but the outer lambda is called immediately with ().

Suggested change
ctx.main_grad_funcs = [
lambda j=i: weights[j].main_grad for i in range(num_gemms)
(lambda idx=i: (lambda: weights[idx].main_grad) if (wgrad_accumulation_mask is None or wgrad_accumulation_mask[idx]) else (lambda: None))()
for i in range(num_gemms)
def make_grad_func(idx):
if wgrad_accumulation_mask is None or wgrad_accumulation_mask[idx]:
return lambda: weights[idx].main_grad
else:
return lambda: None
ctx.main_grad_funcs = [make_grad_func(i) for i in range(num_gemms)]

# Device buffer for cutlass arguments and kernel
torch.empty(get_cutlass_grouped_gemm_workspace_size_bytes(), dtype=torch.uint8, device="cuda"),
# TODO: Only allocate pinned buffer when cuda graph is enabled
# Host pinned buffer for the source of H2D copy of cutlass arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: pinned memory buffer allocated unconditionally wastes memory

The comment says "TODO: Only allocate pinned buffer when cuda graph is enabled" but the buffer is always allocated. Pinned memory is a limited resource (typically capped at a fraction of system RAM).

Consider lazy allocation or checking torch.cuda.is_current_stream_capturing() state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device buffer will always be used. The pinned buffer cannot be lazy allocation since cuda graph capture does not support .pinned_memory()

Comment on lines 628 to 634
NVTE_CHECK(pinned_host_buffer_index + num_gemms * 2 <= workspace[1].size(0),
"Pinned host buffer out of bounds, please increase the capacity by setting "
"NVTE_CUTLASS_HOST_PINNED_U64_CAPACITY. "
"Current buffer size: ",
workspace[1].size(0));
inputB_and_SF_addrs = workspace[1].narrow(0, pinned_host_buffer_index, num_gemms * 2);
pinned_host_buffer_index += num_gemms * 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: buffer index never resets causing accumulation across calls

The pinned_host_buffer_index increments on every call but is never reset to 0. After multiple forward/backward passes, it will:

  1. Exceed workspace bounds
  2. Trigger the error check on line 628

For CUDA Graph, the index should reset at graph capture start or be managed per-graph instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA Graph capture records only one iteration; after capture, the graph is replayed each step, so those lines (including the pinned_host_buffer_index increment) are not reached again outside capture.

if not self.fuse_wgrad_accumulation:
for i in range(self.num_gemms):
for i in range(self.num_gemms):
if not self.fuse_wgrad_accumulation or (self.wgrad_accumulation_mask is not None and not self.wgrad_accumulation_mask[i]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is wgrad_accumulation_mask?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for MoE Echo, which is designed to avoid dropping tokens when a fixed buffer size is used for CUDA graph. When MoE Echo is enabled, each GroupedLinear layer contains both home experts and echo experts. Echo experts correspond to weights from other ranks, whose main_grad are not available on the current rank. Therefore, their wgrad should be excluded from accumulation, which is controlled by the wgrad_accumulation_mask.


assert fp8 and FP8GlobalStateManager.get_fp8_recipe().mxfp8(), "Only MXFP8 is supported when m_splits is on devie"
# Cannot split because the m_splits is not available on host.
inputmats = tex.split_quantize(inp_view, [inp_view.size(0)], input_quantizers[:1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can support this dispatch in C++ split quantize. I made the dispatching for NVFP4 here: #2351

Copy link
Collaborator

@timmoon10 timmoon10 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just call input_quantizers[0](inp_view)? If we're not doing a split, then no need to deal with the extra overhead of split_quantize. In general, I don't see much value in forcing the device-initiated and host-initiated impls to use the same internal APIs.

Copy link
Collaborator

@zhongbozhu zhongbozhu Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, we should either call input_quantizers[0](inp_view), or if we can tweak the mxfp8 quantize kernel to respect the 128x4 scaling factor padding (so that we don't need 128 aligned padding), then we can use the split_quantize.

Copy link
Contributor Author

@QiZhangNV QiZhangNV Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. I’ll go with input_quantizers[0](inp_view) for now. Also noticed you looped in Oleg to discuss making the MXFP8 quantize kernel respect the 128×4 scaling-factor padding, we can wait for his suggestions

Copy link
Contributor Author

@QiZhangNV QiZhangNV Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @timmoon10, after discussing with zhongbozhu, we’d like to introduce a new API called batch_quantize_device. This API would support quantization not only for mxfp8, but also for other narrow precision types such as nvfp4 in device-initiated mode. For mxfp8, its behavior would be similar to calling input_quantizers[0](inp_view). For nvfp4, in addition to quantized outputs, it would also return the necessary information like amax tensors and the prefix sum of m_split tensors. How do you think?

split_size = 16
if recipe.mxfp8() or recipe.nvfp4():
split_size = 32
split_size = 128
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Oleg-Goncharov Can we modify the MXFP8 quantize kernel to respect the 128x4 padding of mxfp8 scaling factors? Or do you think it's better if we should do a larger padding here.

ctx,
inp: torch.Tensor,
m_splits: List[int],
m_splits: torch.Tensor,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any PRs to Mcore that will support this API change? It's also a bit headache that Mcore might have to check TE versions and dispatch for compatibility.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree backward compatibility is important. How about adding logic like:

if not isinstance(m_splits, torch.Tensor):
    m_splits = torch.tensor(m_splits, dtype=torch.int, device="cpu")

This could be refined further so we don't keep going back-and-forth between List[int] and torch.Tensor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already have the check in GroupedLinear forward(): https://github.com/QiZhangNV/TransformerEngine/blob/cutlass_device_grouped_gemm/transformer_engine/pytorch/module/grouped_linear.py#L832-L833
Does it still needed in _GroupedLinear forward()?

return 4_194_304

def get_cutlass_grouped_gemm_workspace_size_bytes() -> None:
return 16_777_216
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this size requirement for cutlass grouped gemm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just allocate a sufficiently large workspace so the CUTLASS grouped GEMM kernel can initialize and run safely; the exact size depends on the kernel config.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that it only requires 16MB, while we already used 32MB for a single GEMM for cublas

Copy link
Contributor Author

@QiZhangNV QiZhangNV Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 32MB is just a safe upper bound for the workspace allocation in cublas(see https://docs.nvidia.com/cuda/cublas/index.html#cublassetworkspace); actual usage may typically much lower. For CUTLASS, 16MB is sufficient in my experience, and I’ve added an assert to catch insufficient workspace.
Since you’re concerned about this, would it be better to allow setting this value via an environment variable?

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 11, 2025

Greptile Overview

Greptile Summary

Adds device-initiated grouped GEMM that reads m_splits directly on the GPU, eliminating CPU-GPU synchronization overhead for MoE workloads. The CUTLASS implementation is solid and the feature provides measurable performance benefits by reducing quantization kernels from N to 1.

Key Changes:

  • Changes m_splits parameter from List[int] to torch.Tensor (maintains backward compatibility via .tolist())
  • Adds nvte_device_cutlass_grouped_gemm and nvte_device_cutlass_grouped_gemm_wgrad C++ APIs
  • Implements CUTLASS kernel setGroupedGemmArguments that processes m_splits on-device
  • Currently supports MXFP8 only (Blackwell architecture)

Critical Issues Found:

  • Global pinned_host_buffer_index in gemm.cpp:563 never resets, causing buffer overflow after multiple forward/backward passes
  • No thread-safety mechanisms despite comment claiming "strictly single-threaded" execution
  • Pinned memory always allocated even when CUDA graphs not enabled (TODO acknowledged at base.py:116)

Minor Issues:

  • Multiple typos in assertion messages: "devie" instead of "device" (grouped_linear.py:138, 303, 363)

Confidence Score: 2/5

  • Critical buffer management bug will cause crashes after multiple iterations
  • The pinned_host_buffer_index accumulates indefinitely without reset mechanism, guaranteeing buffer overflow. While feature implementation is sound, this memory management bug blocks production use
  • transformer_engine/pytorch/csrc/extensions/gemm.cpp requires immediate fix for buffer index reset logic

Important Files Changed

File Analysis

Filename Score Overview
transformer_engine/pytorch/csrc/extensions/gemm.cpp 2/5 Adds device-initiated grouped GEMM with critical race condition in global pinned_host_buffer_index - never resets, causing buffer overflow after multiple calls
transformer_engine/common/gemm/cutlass_device_grouped_gemm.cu 4/5 Implements CUTLASS kernel that reads m_splits directly on device - reduces synchronization overhead, well-structured implementation
transformer_engine/pytorch/module/grouped_linear.py 3/5 Updates API to accept m_splits as tensor instead of list, maintains backward compatibility but has typos in assertion messages
transformer_engine/pytorch/module/base.py 3/5 Adds global pinned memory workspace - always allocated even when not needed (TODO comment acknowledges this waste)

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant GL as GroupedLinear
    participant GGG as general_grouped_gemm
    participant CPP as gemm.cpp
    participant CUDA as CUTLASS Kernel
    participant Device as GPU Device

    User->>GL: forward(inp, m_splits=tensor)
    GL->>GL: Check m_splits.is_cuda
    alt m_splits on device
        GL->>GL: Single quantize (no split)
        GL->>GGG: Call with m_splits_on_device=True
        GGG->>CPP: te_general_device_initiated_grouped_gemm
        CPP->>CPP: Allocate pinned buffer (index++)
        CPP->>Device: H2D copy addresses
        CPP->>CUDA: nvte_device_cutlass_grouped_gemm
        CUDA->>CUDA: setGroupedGemmArguments kernel
        Note over CUDA: Read m_splits from device memory
        CUDA->>CUDA: Execute grouped GEMM
        CUDA-->>CPP: Return result
    else m_splits on host
        GL->>GL: Split & quantize per expert
        GL->>GGG: Call with m_splits_on_device=False
        GGG->>CPP: Multi-stream cuBLAS path
    end
    CPP-->>User: Output tensor
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

assert TE_DType[out[0].dtype] == out_dtype, "Output dtype mismatch: out[0].dtype=" + \
str(out[0].dtype) + ", out_dtype=" + str(out_dtype)
bias = tex.te_general_device_initiated_grouped_gemm(
A,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that this A (maybe B), despite its type is a list of inputs, it's actually a list with a single element in it? Does it make more sense if we still calculate the pointer offsets on device in API split_quantize so that the input to general_grouped_gemm are still two lists with the same length already splitted? cc @timmoon10

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, activations is a single-element list. It would be good if split_quantize can handle the 'split' logical. However, can it work well when the m_splits is not available on Host? How will you do that?

@huanghua1994 huanghua1994 self-requested a review November 12, 2025 21:27
Comment on lines +242 to +243
* \param[in] A The list of A matrices.
* \param[in] B_and_SF_addrs The list of B and SF matrices addresses.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* \param[in] A The list of A matrices.
* \param[in] B_and_SF_addrs The list of B and SF matrices addresses.
* \param[in] A_and_SF_addrs The list of A and SF matrices addresses.
* \param[in] B The list of B matrices.

Match with the API signature.

* \param[in] stream CUDA stream to wait on.
*/

void nvte_device_cutlass_grouped_gemm_wgrad(const NVTETensor* A, const NVTETensor* B, void** D,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that wgrad and dgrad have different transpose targets, this API should be just for wgrad only?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants