Skip to content

Conversation

@yiliu30
Copy link
Owner

@yiliu30 yiliu30 commented Jul 3, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

dsikka and others added 20 commits June 25, 2025 07:18
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: yiliu30 <[email protected]>

Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands vLLM's quantization capabilities by introducing support for MXFP8 and NVFP4, with a particular focus on Mixture-of-Experts (MoE) models and optimizations for Habana Processing Units (HPUs). The changes involve adding new quantization schemes, MoE methods, HPU-specific configurations, and an updated dependency, alongside a new example script to facilitate usage.

Highlights

  • MXFP8 Quantization for MoE: I've introduced a new Mixed Precision FP8 (MXFP8) quantization scheme specifically for Mixture-of-Experts (MoE) layers. This includes new dedicated MoE methods and comprehensive weight handling for MXFP8, enabling more efficient inference for MoE models.
  • NVFP4 Quantization Support: I've added support for NVIDIA FP4 (NVFP4) quantization. This involves a new quantization scheme and a corresponding MoE method, with capabilities for both direct Cutlass kernel utilization on compatible CUDA devices and software emulation for broader platform support.
  • HPU-Specific Optimizations: I've incorporated several optimizations tailored for Habana Processing Units (HPUs). This includes new environment variables like VLLM_USE_STATIC_MOE_HPU and VLLM_HPU_FORCE_CHANNEL_FP8, designed to improve performance and control behavior on HPU hardware.
  • Dependency Update & Example: I've updated the compressed-tensors library to version 0.10.2 to support the new quantization features. Additionally, I've added a new example script (basic_hpu.py) to demonstrate basic offline inference on HPU, showcasing the setup with relevant environment variables and model paths.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for MXFP8 and NVFP4 quantization schemes, particularly for Mixture of Experts (MoE) models on Habana Processing Units (HPU). The changes are extensive, adding new quantization schemes, emulation utilities, and HPU-specific logic.

I've found a few critical issues, including a syntax error and a debug script that should be removed. There are also several instances of leftover debugging code, FIXME comments, and local imports that need to be addressed. Please review the comments for details.

self, weight_quant: BaseModel,
input_quant: BaseModel) -> "CompressedTensorsScheme":

# breakpoint()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

A breakpoint() call is present in the code. This is a debugging tool and must be removed before merging.

Suggested change
# breakpoint()
# Detect If Mixed Precision

use_tqdm_on_load: bool,
) -> Generator[Tuple[str, torch.Tensor], None, None]:
"""Iterate over the weights in the model safetensor files."""
# hf_weights_files = hf_weights_files[:5]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This commented-out line appears to be for debugging. Please remove it before merging.

intermediate_size_per_partition = (
intermediate_size_per_partition_2x // 2
)
# FIXME: Handle mask

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This FIXME comment indicates that mask handling is incomplete. This should be addressed before merging.



def dequant_mx_fp8(weight_fp8, scale_e8m0, block_size):
# FIXME: (Yi) add support for scale_e8m0 in uint8

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This FIXME comment indicates that support for uint8 scales is missing. This should be addressed or clarified.

Comment on lines +196 to +201
return self.fp8_linear.apply(input=x,
weight=layer.weight,
weight_scale=layer.weight_scale,
out_dtype=self.out_dtype,
input_scale=layer.input_scale,
bias=bias)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code is commented out and appears to be dead code. It should be removed to improve code clarity.

Comment on lines +813 to +814
if name not in params_dict:
continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check if name not in params_dict: continue is repeated multiple times in load_weights. While it prevents crashes, it might hide issues where expected weights are not found in params_dict. Could you add a comment explaining why this is necessary? For example, if some weights from the checkpoint are intentionally not used. If this is a temporary workaround, it would be good to note that as well.

yiliu30 and others added 8 commits July 3, 2025 11:15
…emes/__init__.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Add Support for MXFP4 Linear and MOE
---------

Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Ziyue-Intel <[email protected]>
Co-authored-by: He, Xin3 <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
* add nvfp4 support

Signed-off-by: Yi Liu <[email protected]>

* update example for nvfp4

Signed-off-by: Yi Liu <[email protected]>

---------

Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
yiliu30 and others added 10 commits August 1, 2025 15:40
* fix global scale

Signed-off-by: Yi Liu <[email protected]>

* reduce memory usage in moe (#59)

Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>

* update transformers version

Signed-off-by: Yi Liu <[email protected]>

* update

Signed-off-by: Yi Liu <[email protected]>

* remove two per_tensor_amax_to_scale

---------

Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
* fix global scale

Signed-off-by: Yi Liu <[email protected]>

* reduce memory usage in moe (#59)

Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>

* update transformers version

Signed-off-by: Yi Liu <[email protected]>

* update

Signed-off-by: Yi Liu <[email protected]>

* enable next token task

Signed-off-by: Yi Liu <[email protected]>

* add eval code back

Signed-off-by: Yi Liu <[email protected]>

* fix

Signed-off-by: Yi Liu <[email protected]>

* clean

Signed-off-by: Yi Liu <[email protected]>

---------

Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
Support static global scale

---------

Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
* fix global scale

Signed-off-by: Yi Liu <[email protected]>

* reduce memory usage in moe (#59)

Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>

* update transformers version

Signed-off-by: Yi Liu <[email protected]>

* update

Signed-off-by: Yi Liu <[email protected]>

* enable next token task

Signed-off-by: Yi Liu <[email protected]>

* add eval code back

Signed-off-by: Yi Liu <[email protected]>

* fix

Signed-off-by: Yi Liu <[email protected]>

* clean

Signed-off-by: Yi Liu <[email protected]>

* use gs

Signed-off-by: Yi Liu <[email protected]>

* Update nvfp4_qdq.py

* Apply suggestion from @Copilot

Co-authored-by: Copilot <[email protected]>

* Update nvfp4_qdq.py

---------

Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
Co-authored-by: Copilot <[email protected]>
* start vllm cmd

Signed-off-by: Yi Liu <[email protected]>

* use exisitng torch

Signed-off-by: Yi Liu <[email protected]>

* use datasets 3.6

Signed-off-by: Yi Liu <[email protected]>

* add even rounding for mxfp4

Signed-off-by: Yi Liu <[email protected]>

* Update vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* add ao back

Signed-off-by: Yi Liu <[email protected]>

---------

Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* remove torch from requirements

Signed-off-by: Yi Liu <[email protected]>

* update bench code

Signed-off-by: Yi Liu <[email protected]>

* update

Signed-off-by: Yi Liu <[email protected]>

* update model path

Signed-off-by: Yi Liu <[email protected]>

---------

Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
* fix qwen

Signed-off-by: Yi Liu <[email protected]>

* fix topk

Signed-off-by: Yi Liu <[email protected]>

---------

Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants