-
-
Notifications
You must be signed in to change notification settings - Fork 2
[CUDA] W8A8-STATIC GLM #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: official-v0.10.2rc1
Are you sure you want to change the base?
Conversation
Signed-off-by: yiliu30 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for QDQ (Quantize-Dequantize) for MoE (Mixture of Experts) layers, enabling new quantization capabilities. My review has identified several critical issues that need to be addressed. These include leftover debugging code like breakpoint() and pdb calls, a hardcoded mechanism to override model configuration via an environment variable, and a duplicated function definition. Additionally, there are some high-severity code quality issues, such as unused functions and misplaced imports. Please address these findings to ensure the code is production-ready.
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
| def fp8_qdq__(x, scale, eps=torch.finfo(torch.bfloat16).eps): | ||
| # return x | ||
| # scale = scale.to(x.dtype) | ||
| # eps = torch.finfo(scale.dtype).eps | ||
| eps = torch.finfo(torch.bfloat16).eps | ||
| safe_scale = torch.maximum( | ||
| scale.data, torch.tensor(eps, device=scale.device, dtype=scale.dtype) | ||
| ) | ||
| # scale.copy_(new_val) | ||
|
|
||
| # safe_scale = scale.to(x.dtype) | ||
|
|
||
| # Compute quantized values | ||
| qx = x / safe_scale | ||
|
|
||
| # Clamp values to FP8 E4M3 representable range (approximately ±448) | ||
| # max_fp8_value = 448.0 | ||
| # qx = torch.clamp(qx, -max_fp8_value, max_fp8_value) | ||
|
|
||
| # Convert to FP8 | ||
| qx_fp8 = qx.to(torch.float8_e4m3fn) | ||
|
|
||
| # Convert back and rescale | ||
| dq_x = qx_fp8.to(x.dtype) * safe_scale | ||
|
|
||
| return dq_x | ||
|
|
||
|
|
||
| def fp8_qdq_(x, scale): | ||
| # return x | ||
| scale = scale.to(x.dtype) | ||
| qx = x / scale | ||
| qx_fp8 = qx.to(torch.float8_e4m3fn) | ||
| dq_x = qx_fp8.to(x.dtype) * scale | ||
| return dq_x.to(x.dtype) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: yiliu30 <[email protected]>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Quantize-Dequantize (QDQ) operations in Mixture-of-Experts (MoE) layers, specifically for W8A8 quantization. The changes include new environment variables to control this functionality, utility functions for QDQ, and modifications to weight loading and forward pass logic.
My review has identified several critical issues that need to be addressed. There is significant code duplication, particularly of helper and debugging functions across multiple files. The pull request also contains leftover debugging artifacts, such as Pdb calls and commented-out code, which should be removed. Most critically, there is a testing hack that modifies the model architecture via an environment variable, and an incomplete implementation in the MoE forward pass marked with a FIXME. These issues must be resolved to ensure code quality, maintainability, and correctness.
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Show resolved
Hide resolved
Signed-off-by: yiliu30 <[email protected]>
|
/gemini review |
1 similar comment
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for Quantize-Dequantize (QDQ) operations in Mixture of Experts (MoE) layers, likely to enable FP8 quantization on hardware that lacks native support. The changes include new environment variables to control this functionality and the corresponding implementation logic in quantization and model execution layers. My review has identified several critical and high-severity issues, primarily related to leftover debugging code, code duplication, and style violations that should be addressed to ensure code quality and maintainability.
| import sys | ||
| import pdb | ||
|
|
||
| class ForkedPdb(pdb.Pdb): | ||
| """A Pdb subclass that may be used | ||
| from a forked multiprocessing child | ||
| """ | ||
| def interaction(self, *args, **kwargs): | ||
| _stdin = sys.stdin | ||
| try: | ||
| sys.stdin = open('/dev/stdin') | ||
| pdb.Pdb.interaction(self, *args, **kwargs) | ||
| finally: | ||
| sys.stdin = _stdin | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ForkedPdb class and its related imports appear to be debugging tools. Including pdb or any debugging-related code in production is highly discouraged as it can introduce security vulnerabilities and unexpected behavior. This should be removed before merging. I've also noticed other commented-out debugging calls like breakpoint() and ForkedPdb().set_trace() in this file and others, which should also be removed.
| def update_tensor_shape(dst_tensor, src_tensor): | ||
| # if the number of elements is the same, we can reshape | ||
| if dst_tensor.numel() == src_tensor.numel() and dst_tensor.shape != src_tensor.shape: | ||
| logger.warning_once(f"Reshaping tensor from {src_tensor.shape} to {dst_tensor.shape}") | ||
| return src_tensor.reshape(dst_tensor.shape) | ||
| return src_tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| intermediate_size_per_partition_x2 // 2 | ||
| ) | ||
| # FIXME: Handle mask | ||
| import torch.nn.functional as F |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| import os | ||
| config.num_hidden_layers = int(os.getenv("VLLM_NUM_LAYERS", config.num_hidden_layers)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an environment variable VLLM_NUM_LAYERS to override the number of hidden layers seems like a debugging or testing feature. Such features can lead to unexpected behavior in production if the environment variable is set unintentionally. It's better to control this via explicit configuration or command-line arguments if needed for testing, and avoid leaving it in the main codebase. Also, the import os should be at the top of the file.
| def fp8_qdq__(x, scale, eps=torch.finfo(torch.bfloat16).eps): | ||
| # return x | ||
| # scale = scale.to(x.dtype) | ||
| # eps = torch.finfo(scale.dtype).eps | ||
| eps = torch.finfo(torch.bfloat16).eps | ||
| safe_scale = torch.maximum( | ||
| scale.data, torch.tensor(eps, device=scale.device, dtype=scale.dtype) | ||
| ) | ||
| # scale.copy_(new_val) | ||
|
|
||
| # safe_scale = scale.to(x.dtype) | ||
|
|
||
| # Compute quantized values | ||
| qx = x / safe_scale | ||
|
|
||
| # Clamp values to FP8 E4M3 representable range (approximately ±448) | ||
| # max_fp8_value = 448.0 | ||
| # qx = torch.clamp(qx, -max_fp8_value, max_fp8_value) | ||
|
|
||
| # Convert to FP8 | ||
| qx_fp8 = qx.to(torch.float8_e4m3fn) | ||
|
|
||
| # Convert back and rescale | ||
| dq_x = qx_fp8.to(x.dtype) * safe_scale | ||
|
|
||
| return dq_x | ||
|
|
||
|
|
||
| def fp8_qdq_(x, scale): | ||
| # return x | ||
| scale = scale.to(x.dtype) | ||
| qx = x / scale | ||
| qx_fp8 = qx.to(torch.float8_e4m3fn) | ||
| dq_x = qx_fp8.to(x.dtype) * scale | ||
| return dq_x.to(x.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # if name not in params_dict: | ||
| # continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a Quantize-Dequantize (QDQ) path for W8A8 Mixture of Experts (MoE) models, likely to enable FP8 emulation on older hardware. While the changes add valuable functionality, the current implementation includes several issues that need to be addressed before merging. There is a significant amount of debugging code, including pdb traces, commented-out checks, and experimental functions. A critical issue is the runtime modification of the model's layer count via an environment variable, which is unsafe for production. Additionally, there is code duplication and local imports that violate best practices. I've left specific comments on these points. Refactoring to remove debugging artifacts, centralize duplicated code, and clean up the experimental MoE path will greatly improve the quality and maintainability of this contribution.
| import sys | ||
| import pdb | ||
|
|
||
| class ForkedPdb(pdb.Pdb): | ||
| """A Pdb subclass that may be used | ||
| from a forked multiprocessing child | ||
| """ | ||
| def interaction(self, *args, **kwargs): | ||
| _stdin = sys.stdin | ||
| try: | ||
| sys.stdin = open('/dev/stdin') | ||
| pdb.Pdb.interaction(self, *args, **kwargs) | ||
| finally: | ||
| sys.stdin = _stdin | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # if loaded_weight.shape != expert_data.shape: | ||
| # ForkedPdb().set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # breakpoint() | ||
| # if expert_data.numel() != loaded_weight.numel(): | ||
| # ForkedPdb().set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if envs.VLLM_W8A8_STATIC_MOE: | ||
|
|
||
| num_experts, intermediate_size_per_partition_x2, _ = ( | ||
| layer.w13_weight.shape | ||
| ) | ||
| intermediate_size_per_partition = ( | ||
| intermediate_size_per_partition_x2 // 2 | ||
| ) | ||
| # FIXME: Handle mask | ||
| import torch.nn.functional as F | ||
| act_fn = F.silu | ||
| num_all_tokens, hidden_dim = x.shape | ||
| num_experts = layer.local_num_experts | ||
| total_num_experts = router_logits.size(-1) | ||
| experts_mask = torch.zeros( | ||
| (x.size(0), total_num_experts), dtype=x.dtype, device=x.device | ||
| ) | ||
| topk_ids = topk_ids.to(torch.int64) | ||
| topk_weights = topk_weights.to(x.dtype) | ||
| experts_mask.scatter_(-1, topk_ids, topk_weights) | ||
| experts_mask = experts_mask.transpose(0, 1) | ||
|
|
||
| mask_weights = torch.zeros( | ||
| (num_all_tokens, total_num_experts), | ||
| dtype=x.dtype, | ||
| device=x.device, | ||
| ) | ||
| mask_weights.scatter_(-1, topk_ids, 1) | ||
| mask_weights = mask_weights.transpose(0, 1) | ||
| # Note: ep_size equal tp_size | ||
| from vllm.distributed import get_tensor_model_parallel_rank, get_ep_group | ||
| if expert_map is None: | ||
| ep_rank = 0 | ||
| else: | ||
| ep_group = get_ep_group() | ||
| ep_rank = ep_group.rank | ||
| ep_shift = ep_rank * num_experts | ||
| for expert_index in range(num_experts): | ||
| mask_weight = mask_weights[expert_index + ep_shift].unsqueeze(1) | ||
| current_state_static = x * mask_weight | ||
|
|
||
| local_w13 = layer.w13_weight[expert_index] | ||
| local_w13_scale = layer.w13_weight_scale[expert_index] | ||
| # local_w13_global_scale = layer.w13_weight_global_scale[expert_index] | ||
| local_w13_input_scale = layer.w13_input_scale[expert_index] | ||
| local_w2 = layer.w2_weight[expert_index] | ||
| local_w2_scale = layer.w2_weight_scale[expert_index] | ||
| # local_w2_global_scale = layer.w2_weight_global_scale[expert_index] | ||
| local_w2_input_scale = layer.w2_input_scale[expert_index] | ||
|
|
||
| local_w1 = local_w13[:intermediate_size_per_partition, ...] | ||
| local_w1_scale = local_w13_scale[ | ||
| :intermediate_size_per_partition, ... | ||
| ] | ||
| # local_w1_global_scale = local_w13_scale[0] | ||
| local_w1_input_scale = local_w13_input_scale | ||
|
|
||
| local_w3 = local_w13[intermediate_size_per_partition:, ...] | ||
| local_w3_scale = local_w13_scale[ | ||
| intermediate_size_per_partition:, ... | ||
| ] | ||
| # local_w3_global_scale = local_w13_global_scale[1] | ||
| local_w3_input_scale = local_w13_input_scale | ||
|
|
||
| # from vllm.model_executor.layers.quantization.utils.nvfp4_emulation_utils import ( | ||
| # run_nvfp4_emulations_no_swizzle, | ||
| # ) | ||
|
|
||
| # local_w13_input_global_scale_max = local_w13_input_global_scale.max() | ||
|
|
||
| local_w1_out = qdq_fp8_gemm( | ||
| x=current_state_static, | ||
| x_scale=local_w1_input_scale, | ||
| qweight=local_w1, | ||
| w_scale=local_w1_scale, | ||
| ) | ||
| local_w3_out = qdq_fp8_gemm( | ||
| x=current_state_static, | ||
| x_scale=local_w3_input_scale, | ||
| qweight=local_w3, | ||
| w_scale=local_w3_scale, | ||
| ) | ||
| w13_out = act_fn(local_w1_out) * local_w3_out | ||
|
|
||
| local_w2_out = qdq_fp8_gemm( | ||
| x=w13_out, | ||
| x_scale=local_w2_input_scale, | ||
| qweight=local_w2, | ||
| w_scale=local_w2_scale, | ||
| ) | ||
|
|
||
| padded_weight = experts_mask[expert_index + ep_shift].unsqueeze( | ||
| 1 | ||
| ) | ||
| local_w2_out = local_w2_out * padded_weight | ||
| if expert_index == 0: | ||
| final_hidden_states = local_w2_out | ||
| else: | ||
| final_hidden_states += local_w2_out | ||
| return final_hidden_states | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new forward path for static MoE is very large and complex. It contains a FIXME comment, local imports, and commented-out debugging code. This indicates the code is experimental and not ready for production. Please consider the following refactoring:
- Move the entire logic into a separate, well-documented function or method.
- Move all imports (
torch.nn.functional,vllm.distributed) to the top of the file. - Remove all commented-out code and debugging artifacts.
- Address the
FIXMEregarding mask handling.
| import os | ||
| config.num_hidden_layers = int(os.getenv("VLLM_NUM_LAYERS", config.num_hidden_layers)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overriding config.num_hidden_layers from an environment variable (VLLM_NUM_LAYERS) is a dangerous pattern for production code. It makes model behavior dependent on environment settings, which can be difficult to track and debug. This appears to be a debugging or testing utility and should be removed. If this functionality is required, it should be exposed through a more explicit configuration option.
| def fp8_qdq__(x, scale, eps=torch.finfo(torch.bfloat16).eps): | ||
| # return x | ||
| # scale = scale.to(x.dtype) | ||
| # eps = torch.finfo(scale.dtype).eps | ||
| eps = torch.finfo(torch.bfloat16).eps | ||
| safe_scale = torch.maximum( | ||
| scale.data, torch.tensor(eps, device=scale.device, dtype=scale.dtype) | ||
| ) | ||
| # scale.copy_(new_val) | ||
|
|
||
| # safe_scale = scale.to(x.dtype) | ||
|
|
||
| # Compute quantized values | ||
| qx = x / safe_scale | ||
|
|
||
| # Clamp values to FP8 E4M3 representable range (approximately ±448) | ||
| # max_fp8_value = 448.0 | ||
| # qx = torch.clamp(qx, -max_fp8_value, max_fp8_value) | ||
|
|
||
| # Convert to FP8 | ||
| qx_fp8 = qx.to(torch.float8_e4m3fn) | ||
|
|
||
| # Convert back and rescale | ||
| dq_x = qx_fp8.to(x.dtype) * safe_scale | ||
|
|
||
| return dq_x | ||
|
|
||
|
|
||
| def fp8_qdq_(x, scale): | ||
| # return x | ||
| scale = scale.to(x.dtype) | ||
| qx = x / scale | ||
| qx_fp8 = qx.to(torch.float8_e4m3fn) | ||
| dq_x = qx_fp8.to(x.dtype) * scale | ||
| return dq_x.to(x.dtype) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # if name not in params_dict: | ||
| # continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # if name not in params_dict: | ||
| # continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # if name not in params_dict: | ||
| # continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def update_tensor_shape(dst_tensor, src_tensor): | ||
| # if the number of elements is the same, we can reshape | ||
| if dst_tensor.numel() == src_tensor.numel() and dst_tensor.shape != src_tensor.shape: | ||
| logger.warning_once(f"Reshaping tensor from {src_tensor.shape} to {dst_tensor.shape}") | ||
| return src_tensor.reshape(dst_tensor.shape) | ||
| return src_tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)