[CUDA] W8A8-STATIC GLM #70

yiliu30 · 2025-09-10T03:40:39Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: yiliu30 <[email protected]>

gemini-code-assist

Code Review

This pull request introduces support for QDQ (Quantize-Dequantize) for MoE (Mixture of Experts) layers, enabling new quantization capabilities. My review has identified several critical issues that need to be addressed. These include leftover debugging code like breakpoint() and pdb calls, a hardcoded mechanism to override model configuration via an environment variable, and a duplicated function definition. Additionally, there are some high-severity code quality issues, such as unused functions and misplaced imports. Please address these findings to ensure the code is production-ready.

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

vllm/model_executor/models/glm4_moe.py

vllm/model_executor/parameter.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

gemini-code-assist · 2025-09-10T03:43:42Z

vllm/model_executor/layers/quantization/compressed_tensors/utils.py

+def fp8_qdq__(x, scale, eps=torch.finfo(torch.bfloat16).eps):
+    # return x
+    # scale = scale.to(x.dtype)
+    # eps = torch.finfo(scale.dtype).eps
+    eps = torch.finfo(torch.bfloat16).eps
+    safe_scale = torch.maximum(
+        scale.data, torch.tensor(eps, device=scale.device, dtype=scale.dtype)
+    )
+    # scale.copy_(new_val)
+
+    # safe_scale = scale.to(x.dtype)
+
+    # Compute quantized values
+    qx = x / safe_scale
+
+    # Clamp values to FP8 E4M3 representable range (approximately ±448)
+    # max_fp8_value = 448.0
+    # qx = torch.clamp(qx, -max_fp8_value, max_fp8_value)
+
+    # Convert to FP8
+    qx_fp8 = qx.to(torch.float8_e4m3fn)
+
+    # Convert back and rescale
+    dq_x = qx_fp8.to(x.dtype) * safe_scale
+
+    return dq_x
+
+
+def fp8_qdq_(x, scale):
+    # return x
+    scale = scale.to(x.dtype)
+    qx = x / scale
+    qx_fp8 = qx.to(torch.float8_e4m3fn)
+    dq_x = qx_fp8.to(x.dtype) * scale
+    return dq_x.to(x.dtype)
+


The functions fp8_qdq__ and fp8_qdq_ appear to be unused development artifacts. They should be removed to keep the codebase clean.

Signed-off-by: yiliu30 <[email protected]>

yiliu30 · 2025-09-10T06:20:57Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for Quantize-Dequantize (QDQ) operations in Mixture-of-Experts (MoE) layers, specifically for W8A8 quantization. The changes include new environment variables to control this functionality, utility functions for QDQ, and modifications to weight loading and forward pass logic.

My review has identified several critical issues that need to be addressed. There is significant code duplication, particularly of helper and debugging functions across multiple files. The pull request also contains leftover debugging artifacts, such as Pdb calls and commented-out code, which should be removed. Most critically, there is a testing hack that modifies the model architecture via an environment variable, and an incomplete implementation in the MoE forward pass marked with a FIXME. These issues must be resolved to ensure code quality, maintainability, and correctness.

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

vllm/model_executor/models/glm4_moe.py

vllm/model_executor/parameter.py

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/layers/quantization/compressed_tensors/utils.py

vllm/model_executor/models/glm4_moe.py

Signed-off-by: yiliu30 <[email protected]>

yiliu30 · 2025-09-10T10:23:54Z

/gemini review

yiliu30 · 2025-09-10T10:27:26Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for Quantize-Dequantize (QDQ) operations in Mixture of Experts (MoE) layers, likely to enable FP8 quantization on hardware that lacks native support. The changes include new environment variables to control this functionality and the corresponding implementation logic in quantization and model execution layers. My review has identified several critical and high-severity issues, primarily related to leftover debugging code, code duplication, and style violations that should be addressed to ensure code quality and maintainability.

gemini-code-assist · 2025-09-10T10:27:46Z

vllm/model_executor/layers/fused_moe/layer.py

+import sys
+import pdb

+class ForkedPdb(pdb.Pdb):
+    """A Pdb subclass that may be used
+    from a forked multiprocessing child
+
+    """
+    def interaction(self, *args, **kwargs):
+        _stdin = sys.stdin
+        try:
+            sys.stdin = open('/dev/stdin')
+            pdb.Pdb.interaction(self, *args, **kwargs)
+        finally:
+            sys.stdin = _stdin
+


The ForkedPdb class and its related imports appear to be debugging tools. Including pdb or any debugging-related code in production is highly discouraged as it can introduce security vulnerabilities and unexpected behavior. This should be removed before merging. I've also noticed other commented-out debugging calls like breakpoint() and ForkedPdb().set_trace() in this file and others, which should also be removed.

gemini-code-assist · 2025-09-10T10:27:46Z

vllm/model_executor/parameter.py

+def update_tensor_shape(dst_tensor, src_tensor):
+    # if the number of elements is the same, we can reshape
+    if dst_tensor.numel() == src_tensor.numel() and dst_tensor.shape != src_tensor.shape:
+        logger.warning_once(f"Reshaping tensor from {src_tensor.shape} to {dst_tensor.shape}")
+        return src_tensor.reshape(dst_tensor.shape)
+    return src_tensor


The function update_tensor_shape is also defined in vllm/model_executor/layers/fused_moe/layer.py. To avoid code duplication and improve maintainability, this function should be defined in a single, shared utility file and imported where needed.

gemini-code-assist · 2025-09-10T10:27:46Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

+                intermediate_size_per_partition_x2 // 2
+            )
+            # FIXME: Handle mask
+            import torch.nn.functional as F


Imports such as torch.nn.functional (here) and from vllm.distributed (line 881) should be placed at the top of the file, not inside a function. This follows standard Python style guides (PEP 8) and avoids repeated import overhead and potential circular import issues.

gemini-code-assist · 2025-09-10T10:27:46Z

vllm/model_executor/models/glm4_moe.py

+        import os
+        config.num_hidden_layers =  int(os.getenv("VLLM_NUM_LAYERS", config.num_hidden_layers))


Using an environment variable VLLM_NUM_LAYERS to override the number of hidden layers seems like a debugging or testing feature. Such features can lead to unexpected behavior in production if the environment variable is set unintentionally. It's better to control this via explicit configuration or command-line arguments if needed for testing, and avoid leaving it in the main codebase. Also, the import os should be at the top of the file.

gemini-code-assist · 2025-09-10T10:27:46Z

vllm/model_executor/layers/quantization/compressed_tensors/utils.py

+def fp8_qdq__(x, scale, eps=torch.finfo(torch.bfloat16).eps):
+    # return x
+    # scale = scale.to(x.dtype)
+    # eps = torch.finfo(scale.dtype).eps
+    eps = torch.finfo(torch.bfloat16).eps
+    safe_scale = torch.maximum(
+        scale.data, torch.tensor(eps, device=scale.device, dtype=scale.dtype)
+    )
+    # scale.copy_(new_val)
+
+    # safe_scale = scale.to(x.dtype)
+
+    # Compute quantized values
+    qx = x / safe_scale
+
+    # Clamp values to FP8 E4M3 representable range (approximately ±448)
+    # max_fp8_value = 448.0
+    # qx = torch.clamp(qx, -max_fp8_value, max_fp8_value)
+
+    # Convert to FP8
+    qx_fp8 = qx.to(torch.float8_e4m3fn)
+
+    # Convert back and rescale
+    dq_x = qx_fp8.to(x.dtype) * safe_scale
+
+    return dq_x
+
+
+def fp8_qdq_(x, scale):
+    # return x
+    scale = scale.to(x.dtype)
+    qx = x / scale
+    qx_fp8 = qx.to(torch.float8_e4m3fn)
+    dq_x = qx_fp8.to(x.dtype) * scale
+    return dq_x.to(x.dtype)


The functions fp8_qdq__ and fp8_qdq_ appear to be unused debugging or alternative implementations, and contain commented-out code. They should be removed to keep the codebase clean and maintainable.

gemini-code-assist · 2025-09-10T10:27:46Z

vllm/model_executor/models/glm4_moe.py

+                # if name not in params_dict:
+                #     continue


This commented-out line appears multiple times in the load_weights method (lines 540, 575). Dead or commented-out code should be removed to improve readability and maintainability.

gemini-code-assist

Code Review

This pull request introduces a Quantize-Dequantize (QDQ) path for W8A8 Mixture of Experts (MoE) models, likely to enable FP8 emulation on older hardware. While the changes add valuable functionality, the current implementation includes several issues that need to be addressed before merging. There is a significant amount of debugging code, including pdb traces, commented-out checks, and experimental functions. A critical issue is the runtime modification of the model's layer count via an environment variable, which is unsafe for production. Additionally, there is code duplication and local imports that violate best practices. I've left specific comments on these points. Refactoring to remove debugging artifacts, centralize duplicated code, and clean up the experimental MoE path will greatly improve the quality and maintainability of this contribution.

gemini-code-assist · 2025-09-10T10:34:37Z

vllm/model_executor/layers/fused_moe/layer.py

+import sys
+import pdb

+class ForkedPdb(pdb.Pdb):
+    """A Pdb subclass that may be used
+    from a forked multiprocessing child
+
+    """
+    def interaction(self, *args, **kwargs):
+        _stdin = sys.stdin
+        try:
+            sys.stdin = open('/dev/stdin')
+            pdb.Pdb.interaction(self, *args, **kwargs)
+        finally:
+            sys.stdin = _stdin
+


This PR introduces debugging utilities like ForkedPdb and local imports (sys, pdb). This kind of code is not suitable for production and should be removed. Please move the imports to the top of the file if they are necessary for non-debugging purposes, and remove the ForkedPdb class entirely.

gemini-code-assist · 2025-09-10T10:34:37Z

vllm/model_executor/layers/fused_moe/layer.py

+            # if loaded_weight.shape != expert_data.shape:
+            #     ForkedPdb().set_trace()


Commented-out debugging code should be removed before merging.

gemini-code-assist · 2025-09-10T10:34:37Z

vllm/model_executor/layers/fused_moe/layer.py

+                # breakpoint()
+                # if expert_data.numel() != loaded_weight.numel():
+                #     ForkedPdb().set_trace()


Commented-out debugging code, including breakpoint() and ForkedPdb, should be removed before merging.

gemini-code-assist · 2025-09-10T10:34:37Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

+        if envs.VLLM_W8A8_STATIC_MOE:
+
+            num_experts, intermediate_size_per_partition_x2, _ = (
+                layer.w13_weight.shape
+            )
+            intermediate_size_per_partition = (
+                intermediate_size_per_partition_x2 // 2
+            )
+            # FIXME: Handle mask
+            import torch.nn.functional as F
+            act_fn = F.silu
+            num_all_tokens, hidden_dim = x.shape
+            num_experts = layer.local_num_experts
+            total_num_experts = router_logits.size(-1)
+            experts_mask = torch.zeros(
+                (x.size(0), total_num_experts), dtype=x.dtype, device=x.device
+            )
+            topk_ids = topk_ids.to(torch.int64)
+            topk_weights = topk_weights.to(x.dtype)
+            experts_mask.scatter_(-1, topk_ids, topk_weights)
+            experts_mask = experts_mask.transpose(0, 1)
+
+            mask_weights = torch.zeros(
+                (num_all_tokens, total_num_experts),
+                dtype=x.dtype,
+                device=x.device,
+            )
+            mask_weights.scatter_(-1, topk_ids, 1)
+            mask_weights = mask_weights.transpose(0, 1)
+            # Note: ep_size equal tp_size
+            from vllm.distributed import get_tensor_model_parallel_rank, get_ep_group
+            if expert_map is None:
+                ep_rank = 0
+            else:
+                ep_group = get_ep_group()
+                ep_rank = ep_group.rank
+            ep_shift = ep_rank * num_experts
+            for expert_index in range(num_experts):
+                mask_weight = mask_weights[expert_index + ep_shift].unsqueeze(1)
+                current_state_static = x * mask_weight
+
+                local_w13 = layer.w13_weight[expert_index]
+                local_w13_scale = layer.w13_weight_scale[expert_index]
+                # local_w13_global_scale = layer.w13_weight_global_scale[expert_index]
+                local_w13_input_scale = layer.w13_input_scale[expert_index]
+                local_w2 = layer.w2_weight[expert_index]
+                local_w2_scale = layer.w2_weight_scale[expert_index]
+                # local_w2_global_scale = layer.w2_weight_global_scale[expert_index]
+                local_w2_input_scale = layer.w2_input_scale[expert_index]
+
+                local_w1 = local_w13[:intermediate_size_per_partition, ...]
+                local_w1_scale = local_w13_scale[
+                    :intermediate_size_per_partition, ...
+                ]
+                # local_w1_global_scale = local_w13_scale[0]
+                local_w1_input_scale = local_w13_input_scale
+
+                local_w3 = local_w13[intermediate_size_per_partition:, ...]
+                local_w3_scale = local_w13_scale[
+                    intermediate_size_per_partition:, ...
+                ]
+                # local_w3_global_scale = local_w13_global_scale[1]
+                local_w3_input_scale = local_w13_input_scale
+
+                # from vllm.model_executor.layers.quantization.utils.nvfp4_emulation_utils import (
+                #     run_nvfp4_emulations_no_swizzle,
+                # )
+
+                # local_w13_input_global_scale_max = local_w13_input_global_scale.max()
+
+                local_w1_out = qdq_fp8_gemm(
+                    x=current_state_static,
+                    x_scale=local_w1_input_scale,
+                    qweight=local_w1,
+                    w_scale=local_w1_scale,
+                )
+                local_w3_out = qdq_fp8_gemm(
+                    x=current_state_static,
+                    x_scale=local_w3_input_scale,
+                    qweight=local_w3,
+                    w_scale=local_w3_scale,
+                )
+                w13_out = act_fn(local_w1_out) * local_w3_out
+
+                local_w2_out = qdq_fp8_gemm(
+                    x=w13_out,
+                    x_scale=local_w2_input_scale,
+                    qweight=local_w2,
+                    w_scale=local_w2_scale,
+                )
+
+                padded_weight = experts_mask[expert_index + ep_shift].unsqueeze(
+                    1
+                )
+                local_w2_out = local_w2_out * padded_weight
+                if expert_index == 0:
+                    final_hidden_states = local_w2_out
+                else:
+                    final_hidden_states += local_w2_out
+            return final_hidden_states
+


This new forward path for static MoE is very large and complex. It contains a FIXME comment, local imports, and commented-out debugging code. This indicates the code is experimental and not ready for production. Please consider the following refactoring:

Move the entire logic into a separate, well-documented function or method.

Move all imports (torch.nn.functional, vllm.distributed) to the top of the file.

Remove all commented-out code and debugging artifacts.

Address the FIXME regarding mask handling.

gemini-code-assist · 2025-09-10T10:34:37Z

vllm/model_executor/models/glm4_moe.py

+        import os
+        config.num_hidden_layers =  int(os.getenv("VLLM_NUM_LAYERS", config.num_hidden_layers))


Overriding config.num_hidden_layers from an environment variable (VLLM_NUM_LAYERS) is a dangerous pattern for production code. It makes model behavior dependent on environment settings, which can be difficult to track and debug. This appears to be a debugging or testing utility and should be removed. If this functionality is required, it should be exposed through a more explicit configuration option.

gemini-code-assist · 2025-09-10T10:34:37Z

vllm/model_executor/layers/quantization/compressed_tensors/utils.py

+def fp8_qdq__(x, scale, eps=torch.finfo(torch.bfloat16).eps):
+    # return x
+    # scale = scale.to(x.dtype)
+    # eps = torch.finfo(scale.dtype).eps
+    eps = torch.finfo(torch.bfloat16).eps
+    safe_scale = torch.maximum(
+        scale.data, torch.tensor(eps, device=scale.device, dtype=scale.dtype)
+    )
+    # scale.copy_(new_val)
+
+    # safe_scale = scale.to(x.dtype)
+
+    # Compute quantized values
+    qx = x / safe_scale
+
+    # Clamp values to FP8 E4M3 representable range (approximately ±448)
+    # max_fp8_value = 448.0
+    # qx = torch.clamp(qx, -max_fp8_value, max_fp8_value)
+
+    # Convert to FP8
+    qx_fp8 = qx.to(torch.float8_e4m3fn)
+
+    # Convert back and rescale
+    dq_x = qx_fp8.to(x.dtype) * safe_scale
+
+    return dq_x
+
+
+def fp8_qdq_(x, scale):
+    # return x
+    scale = scale.to(x.dtype)
+    qx = x / scale
+    qx_fp8 = qx.to(torch.float8_e4m3fn)
+    dq_x = qx_fp8.to(x.dtype) * scale
+    return dq_x.to(x.dtype)
+


The functions fp8_qdq__ and fp8_qdq_ appear to be unused or are alternative implementations left over from development. They should be removed to keep the codebase clean.

gemini-code-assist · 2025-09-10T10:34:37Z

vllm/model_executor/models/glm4_moe.py

+                # if name not in params_dict:
+                #     continue


This commented-out code appears to be a debugging artifact and should be removed.

gemini-code-assist · 2025-09-10T10:34:38Z

vllm/model_executor/models/glm4_moe.py

+                    # if name not in params_dict:
+                    #     continue


This commented-out code appears to be a debugging artifact and should be removed.

gemini-code-assist · 2025-09-10T10:34:38Z

vllm/model_executor/models/glm4_moe.py

+                    # if name not in params_dict:
+                    #     continue


This commented-out code appears to be a debugging artifact and should be removed.

gemini-code-assist · 2025-09-10T10:34:38Z

vllm/model_executor/parameter.py

+def update_tensor_shape(dst_tensor, src_tensor):
+    # if the number of elements is the same, we can reshape
+    if dst_tensor.numel() == src_tensor.numel() and dst_tensor.shape != src_tensor.shape:
+        logger.warning_once(f"Reshaping tensor from {src_tensor.shape} to {dst_tensor.shape}")
+        return src_tensor.reshape(dst_tensor.shape)
+    return src_tensor


The update_tensor_shape function is also defined in vllm/model_executor/layers/fused_moe/layer.py. To avoid code duplication and improve maintainability, this function should be defined in a single, shared utility file and imported where needed.

Signed-off-by: yiliu30 <[email protected]>

add qdq moe

411e2e3

Signed-off-by: yiliu30 <[email protected]>

gemini-code-assist bot reviewed Sep 10, 2025

View reviewed changes

update ep

997a02b

Signed-off-by: yiliu30 <[email protected]>

gemini-code-assist bot reviewed Sep 10, 2025

View reviewed changes

add bias to fix acc

ca885d6

Signed-off-by: yiliu30 <[email protected]>

gemini-code-assist bot reviewed Sep 10, 2025

View reviewed changes

add batched moe

90d62f3

Signed-off-by: yiliu30 <[email protected]>

yiliu30 changed the title ~~add qdq moe~~ [CUDA] W8A8-STATIC GLM Sep 16, 2025

add torch step profiler

3fa4ebc

Signed-off-by: yiliu30 <[email protected]>

		import os
		config.num_hidden_layers = int(os.getenv("VLLM_NUM_LAYERS", config.num_hidden_layers))

		# if loaded_weight.shape != expert_data.shape:
		# ForkedPdb().set_trace()

Uh oh!

[CUDA] W8A8-STATIC GLM #70

Are you sure you want to change the base?

[CUDA] W8A8-STATIC GLM #70

Uh oh!

Conversation

yiliu30 commented Sep 10, 2025

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

yiliu30 commented Sep 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiliu30 commented Sep 10, 2025

Uh oh!

yiliu30 commented Sep 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 10, 2025

Choose a reason for hiding this comment