Skip to content

Conversation

@yiliu30
Copy link
Owner

@yiliu30 yiliu30 commented Sep 10, 2025

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: yiliu30 <[email protected]>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for QDQ (Quantize-Dequantize) for MoE (Mixture of Experts) layers, enabling new quantization capabilities. My review has identified several critical issues that need to be addressed. These include leftover debugging code like breakpoint() and pdb calls, a hardcoded mechanism to override model configuration via an environment variable, and a duplicated function definition. Additionally, there are some high-severity code quality issues, such as unused functions and misplaced imports. Please address these findings to ensure the code is production-ready.

Comment on lines +45 to +80
def fp8_qdq__(x, scale, eps=torch.finfo(torch.bfloat16).eps):
# return x
# scale = scale.to(x.dtype)
# eps = torch.finfo(scale.dtype).eps
eps = torch.finfo(torch.bfloat16).eps
safe_scale = torch.maximum(
scale.data, torch.tensor(eps, device=scale.device, dtype=scale.dtype)
)
# scale.copy_(new_val)

# safe_scale = scale.to(x.dtype)

# Compute quantized values
qx = x / safe_scale

# Clamp values to FP8 E4M3 representable range (approximately ±448)
# max_fp8_value = 448.0
# qx = torch.clamp(qx, -max_fp8_value, max_fp8_value)

# Convert to FP8
qx_fp8 = qx.to(torch.float8_e4m3fn)

# Convert back and rescale
dq_x = qx_fp8.to(x.dtype) * safe_scale

return dq_x


def fp8_qdq_(x, scale):
# return x
scale = scale.to(x.dtype)
qx = x / scale
qx_fp8 = qx.to(torch.float8_e4m3fn)
dq_x = qx_fp8.to(x.dtype) * scale
return dq_x.to(x.dtype)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The functions fp8_qdq__ and fp8_qdq_ appear to be unused development artifacts. They should be removed to keep the codebase clean.

Signed-off-by: yiliu30 <[email protected]>
@yiliu30
Copy link
Owner Author

yiliu30 commented Sep 10, 2025

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Quantize-Dequantize (QDQ) operations in Mixture-of-Experts (MoE) layers, specifically for W8A8 quantization. The changes include new environment variables to control this functionality, utility functions for QDQ, and modifications to weight loading and forward pass logic.

My review has identified several critical issues that need to be addressed. There is significant code duplication, particularly of helper and debugging functions across multiple files. The pull request also contains leftover debugging artifacts, such as Pdb calls and commented-out code, which should be removed. Most critically, there is a testing hack that modifies the model architecture via an environment variable, and an incomplete implementation in the MoE forward pass marked with a FIXME. These issues must be resolved to ensure code quality, maintainability, and correctness.

Signed-off-by: yiliu30 <[email protected]>
@yiliu30
Copy link
Owner Author

yiliu30 commented Sep 10, 2025

/gemini review

1 similar comment
@yiliu30
Copy link
Owner Author

yiliu30 commented Sep 10, 2025

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Quantize-Dequantize (QDQ) operations in Mixture of Experts (MoE) layers, likely to enable FP8 quantization on hardware that lacks native support. The changes include new environment variables to control this functionality and the corresponding implementation logic in quantization and model execution layers. My review has identified several critical and high-severity issues, primarily related to leftover debugging code, code duplication, and style violations that should be addressed to ensure code quality and maintainability.

Comment on lines +738 to +753
import sys
import pdb

class ForkedPdb(pdb.Pdb):
"""A Pdb subclass that may be used
from a forked multiprocessing child
"""
def interaction(self, *args, **kwargs):
_stdin = sys.stdin
try:
sys.stdin = open('/dev/stdin')
pdb.Pdb.interaction(self, *args, **kwargs)
finally:
sys.stdin = _stdin

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The ForkedPdb class and its related imports appear to be debugging tools. Including pdb or any debugging-related code in production is highly discouraged as it can introduce security vulnerabilities and unexpected behavior. This should be removed before merging. I've also noticed other commented-out debugging calls like breakpoint() and ForkedPdb().set_trace() in this file and others, which should also be removed.

Comment on lines +25 to +30
def update_tensor_shape(dst_tensor, src_tensor):
# if the number of elements is the same, we can reshape
if dst_tensor.numel() == src_tensor.numel() and dst_tensor.shape != src_tensor.shape:
logger.warning_once(f"Reshaping tensor from {src_tensor.shape} to {dst_tensor.shape}")
return src_tensor.reshape(dst_tensor.shape)
return src_tensor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function update_tensor_shape is also defined in vllm/model_executor/layers/fused_moe/layer.py. To avoid code duplication and improve maintainability, this function should be defined in a single, shared utility file and imported where needed.

intermediate_size_per_partition_x2 // 2
)
# FIXME: Handle mask
import torch.nn.functional as F

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Imports such as torch.nn.functional (here) and from vllm.distributed (line 881) should be placed at the top of the file, not inside a function. This follows standard Python style guides (PEP 8) and avoids repeated import overhead and potential circular import issues.

Comment on lines +404 to +405
import os
config.num_hidden_layers = int(os.getenv("VLLM_NUM_LAYERS", config.num_hidden_layers))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using an environment variable VLLM_NUM_LAYERS to override the number of hidden layers seems like a debugging or testing feature. Such features can lead to unexpected behavior in production if the environment variable is set unintentionally. It's better to control this via explicit configuration or command-line arguments if needed for testing, and avoid leaving it in the main codebase. Also, the import os should be at the top of the file.

Comment on lines +45 to +79
def fp8_qdq__(x, scale, eps=torch.finfo(torch.bfloat16).eps):
# return x
# scale = scale.to(x.dtype)
# eps = torch.finfo(scale.dtype).eps
eps = torch.finfo(torch.bfloat16).eps
safe_scale = torch.maximum(
scale.data, torch.tensor(eps, device=scale.device, dtype=scale.dtype)
)
# scale.copy_(new_val)

# safe_scale = scale.to(x.dtype)

# Compute quantized values
qx = x / safe_scale

# Clamp values to FP8 E4M3 representable range (approximately ±448)
# max_fp8_value = 448.0
# qx = torch.clamp(qx, -max_fp8_value, max_fp8_value)

# Convert to FP8
qx_fp8 = qx.to(torch.float8_e4m3fn)

# Convert back and rescale
dq_x = qx_fp8.to(x.dtype) * safe_scale

return dq_x


def fp8_qdq_(x, scale):
# return x
scale = scale.to(x.dtype)
qx = x / scale
qx_fp8 = qx.to(torch.float8_e4m3fn)
dq_x = qx_fp8.to(x.dtype) * scale
return dq_x.to(x.dtype)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The functions fp8_qdq__ and fp8_qdq_ appear to be unused debugging or alternative implementations, and contain commented-out code. They should be removed to keep the codebase clean and maintainable.

Comment on lines +517 to +518
# if name not in params_dict:
# continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This commented-out line appears multiple times in the load_weights method (lines 540, 575). Dead or commented-out code should be removed to improve readability and maintainability.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Quantize-Dequantize (QDQ) path for W8A8 Mixture of Experts (MoE) models, likely to enable FP8 emulation on older hardware. While the changes add valuable functionality, the current implementation includes several issues that need to be addressed before merging. There is a significant amount of debugging code, including pdb traces, commented-out checks, and experimental functions. A critical issue is the runtime modification of the model's layer count via an environment variable, which is unsafe for production. Additionally, there is code duplication and local imports that violate best practices. I've left specific comments on these points. Refactoring to remove debugging artifacts, centralize duplicated code, and clean up the experimental MoE path will greatly improve the quality and maintainability of this contribution.

Comment on lines +738 to +753
import sys
import pdb

class ForkedPdb(pdb.Pdb):
"""A Pdb subclass that may be used
from a forked multiprocessing child
"""
def interaction(self, *args, **kwargs):
_stdin = sys.stdin
try:
sys.stdin = open('/dev/stdin')
pdb.Pdb.interaction(self, *args, **kwargs)
finally:
sys.stdin = _stdin

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This PR introduces debugging utilities like ForkedPdb and local imports (sys, pdb). This kind of code is not suitable for production and should be removed. Please move the imports to the top of the file if they are necessary for non-debugging purposes, and remove the ForkedPdb class entirely.

Comment on lines +1089 to +1090
# if loaded_weight.shape != expert_data.shape:
# ForkedPdb().set_trace()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Commented-out debugging code should be removed before merging.

Comment on lines +1364 to +1366
# breakpoint()
# if expert_data.numel() != loaded_weight.numel():
# ForkedPdb().set_trace()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Commented-out debugging code, including breakpoint() and ForkedPdb, should be removed before merging.

Comment on lines 851 to 951
if envs.VLLM_W8A8_STATIC_MOE:

num_experts, intermediate_size_per_partition_x2, _ = (
layer.w13_weight.shape
)
intermediate_size_per_partition = (
intermediate_size_per_partition_x2 // 2
)
# FIXME: Handle mask
import torch.nn.functional as F
act_fn = F.silu
num_all_tokens, hidden_dim = x.shape
num_experts = layer.local_num_experts
total_num_experts = router_logits.size(-1)
experts_mask = torch.zeros(
(x.size(0), total_num_experts), dtype=x.dtype, device=x.device
)
topk_ids = topk_ids.to(torch.int64)
topk_weights = topk_weights.to(x.dtype)
experts_mask.scatter_(-1, topk_ids, topk_weights)
experts_mask = experts_mask.transpose(0, 1)

mask_weights = torch.zeros(
(num_all_tokens, total_num_experts),
dtype=x.dtype,
device=x.device,
)
mask_weights.scatter_(-1, topk_ids, 1)
mask_weights = mask_weights.transpose(0, 1)
# Note: ep_size equal tp_size
from vllm.distributed import get_tensor_model_parallel_rank, get_ep_group
if expert_map is None:
ep_rank = 0
else:
ep_group = get_ep_group()
ep_rank = ep_group.rank
ep_shift = ep_rank * num_experts
for expert_index in range(num_experts):
mask_weight = mask_weights[expert_index + ep_shift].unsqueeze(1)
current_state_static = x * mask_weight

local_w13 = layer.w13_weight[expert_index]
local_w13_scale = layer.w13_weight_scale[expert_index]
# local_w13_global_scale = layer.w13_weight_global_scale[expert_index]
local_w13_input_scale = layer.w13_input_scale[expert_index]
local_w2 = layer.w2_weight[expert_index]
local_w2_scale = layer.w2_weight_scale[expert_index]
# local_w2_global_scale = layer.w2_weight_global_scale[expert_index]
local_w2_input_scale = layer.w2_input_scale[expert_index]

local_w1 = local_w13[:intermediate_size_per_partition, ...]
local_w1_scale = local_w13_scale[
:intermediate_size_per_partition, ...
]
# local_w1_global_scale = local_w13_scale[0]
local_w1_input_scale = local_w13_input_scale

local_w3 = local_w13[intermediate_size_per_partition:, ...]
local_w3_scale = local_w13_scale[
intermediate_size_per_partition:, ...
]
# local_w3_global_scale = local_w13_global_scale[1]
local_w3_input_scale = local_w13_input_scale

# from vllm.model_executor.layers.quantization.utils.nvfp4_emulation_utils import (
# run_nvfp4_emulations_no_swizzle,
# )

# local_w13_input_global_scale_max = local_w13_input_global_scale.max()

local_w1_out = qdq_fp8_gemm(
x=current_state_static,
x_scale=local_w1_input_scale,
qweight=local_w1,
w_scale=local_w1_scale,
)
local_w3_out = qdq_fp8_gemm(
x=current_state_static,
x_scale=local_w3_input_scale,
qweight=local_w3,
w_scale=local_w3_scale,
)
w13_out = act_fn(local_w1_out) * local_w3_out

local_w2_out = qdq_fp8_gemm(
x=w13_out,
x_scale=local_w2_input_scale,
qweight=local_w2,
w_scale=local_w2_scale,
)

padded_weight = experts_mask[expert_index + ep_shift].unsqueeze(
1
)
local_w2_out = local_w2_out * padded_weight
if expert_index == 0:
final_hidden_states = local_w2_out
else:
final_hidden_states += local_w2_out
return final_hidden_states

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This new forward path for static MoE is very large and complex. It contains a FIXME comment, local imports, and commented-out debugging code. This indicates the code is experimental and not ready for production. Please consider the following refactoring:

  1. Move the entire logic into a separate, well-documented function or method.
  2. Move all imports (torch.nn.functional, vllm.distributed) to the top of the file.
  3. Remove all commented-out code and debugging artifacts.
  4. Address the FIXME regarding mask handling.

Comment on lines +404 to +405
import os
config.num_hidden_layers = int(os.getenv("VLLM_NUM_LAYERS", config.num_hidden_layers))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Overriding config.num_hidden_layers from an environment variable (VLLM_NUM_LAYERS) is a dangerous pattern for production code. It makes model behavior dependent on environment settings, which can be difficult to track and debug. This appears to be a debugging or testing utility and should be removed. If this functionality is required, it should be exposed through a more explicit configuration option.

Comment on lines +45 to +80
def fp8_qdq__(x, scale, eps=torch.finfo(torch.bfloat16).eps):
# return x
# scale = scale.to(x.dtype)
# eps = torch.finfo(scale.dtype).eps
eps = torch.finfo(torch.bfloat16).eps
safe_scale = torch.maximum(
scale.data, torch.tensor(eps, device=scale.device, dtype=scale.dtype)
)
# scale.copy_(new_val)

# safe_scale = scale.to(x.dtype)

# Compute quantized values
qx = x / safe_scale

# Clamp values to FP8 E4M3 representable range (approximately ±448)
# max_fp8_value = 448.0
# qx = torch.clamp(qx, -max_fp8_value, max_fp8_value)

# Convert to FP8
qx_fp8 = qx.to(torch.float8_e4m3fn)

# Convert back and rescale
dq_x = qx_fp8.to(x.dtype) * safe_scale

return dq_x


def fp8_qdq_(x, scale):
# return x
scale = scale.to(x.dtype)
qx = x / scale
qx_fp8 = qx.to(torch.float8_e4m3fn)
dq_x = qx_fp8.to(x.dtype) * scale
return dq_x.to(x.dtype)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The functions fp8_qdq__ and fp8_qdq_ appear to be unused or are alternative implementations left over from development. They should be removed to keep the codebase clean.

Comment on lines +517 to +518
# if name not in params_dict:
# continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This commented-out code appears to be a debugging artifact and should be removed.

Comment on lines +540 to +541
# if name not in params_dict:
# continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This commented-out code appears to be a debugging artifact and should be removed.

Comment on lines +575 to +576
# if name not in params_dict:
# continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This commented-out code appears to be a debugging artifact and should be removed.

Comment on lines +25 to +30
def update_tensor_shape(dst_tensor, src_tensor):
# if the number of elements is the same, we can reshape
if dst_tensor.numel() == src_tensor.numel() and dst_tensor.shape != src_tensor.shape:
logger.warning_once(f"Reshaping tensor from {src_tensor.shape} to {dst_tensor.shape}")
return src_tensor.reshape(dst_tensor.shape)
return src_tensor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The update_tensor_shape function is also defined in vllm/model_executor/layers/fused_moe/layer.py. To avoid code duplication and improve maintainability, this function should be defined in a single, shared utility file and imported where needed.

Signed-off-by: yiliu30 <[email protected]>
@yiliu30 yiliu30 changed the title add qdq moe [CUDA] W8A8-STATIC GLM Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants