Hpu mxfp8 moe #51

yiliu30 · 2025-07-03T02:04:02Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: yiliu30 <[email protected]>

Signed-off-by: Yi Liu <[email protected]>

Signed-off-by: yiliu30 <[email protected]>

Signed-off-by: Yi Liu <[email protected]>

Signed-off-by: yiliu30 <[email protected]> Signed-off-by: Yi Liu <[email protected]>

Signed-off-by: Yi Liu <[email protected]>

Signed-off-by: yiliu30 <[email protected]>

Signed-off-by: Yi Liu <[email protected]>

gemini-code-assist

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands vLLM's quantization capabilities by introducing support for MXFP8 and NVFP4, with a particular focus on Mixture-of-Experts (MoE) models and optimizations for Habana Processing Units (HPUs). The changes involve adding new quantization schemes, MoE methods, HPU-specific configurations, and an updated dependency, alongside a new example script to facilitate usage.

Highlights

MXFP8 Quantization for MoE: I've introduced a new Mixed Precision FP8 (MXFP8) quantization scheme specifically for Mixture-of-Experts (MoE) layers. This includes new dedicated MoE methods and comprehensive weight handling for MXFP8, enabling more efficient inference for MoE models.
NVFP4 Quantization Support: I've added support for NVIDIA FP4 (NVFP4) quantization. This involves a new quantization scheme and a corresponding MoE method, with capabilities for both direct Cutlass kernel utilization on compatible CUDA devices and software emulation for broader platform support.
HPU-Specific Optimizations: I've incorporated several optimizations tailored for Habana Processing Units (HPUs). This includes new environment variables like VLLM_USE_STATIC_MOE_HPU and VLLM_HPU_FORCE_CHANNEL_FP8, designed to improve performance and control behavior on HPU hardware.
Dependency Update & Example: I've updated the compressed-tensors library to version 0.10.2 to support the new quantization features. Additionally, I've added a new example script (basic_hpu.py) to demonstrate basic offline inference on HPU, showcasing the setup with relevant environment variables and model paths.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for MXFP8 and NVFP4 quantization schemes, particularly for Mixture of Experts (MoE) models on Habana Processing Units (HPU). The changes are extensive, adding new quantization schemes, emulation utilities, and HPU-specific logic.

I've found a few critical issues, including a syntax error and a debug script that should be removed. There are also several instances of leftover debugging code, FIXME comments, and local imports that need to be addressed. Please review the comments for details.

examples/offline_inference/basic/basic_hpu.py

vllm/model_executor/layers/quantization/compressed_tensors/schemes/__init__.py

gemini-code-assist · 2025-07-03T02:05:48Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

            self, weight_quant: BaseModel,
            input_quant: BaseModel) -> "CompressedTensorsScheme":
-
+        # breakpoint()


A breakpoint() call is present in the code. This is a debugging tool and must be removed before merging.

Suggested change

# breakpoint()

# Detect If Mixed Precision

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

gemini-code-assist · 2025-07-03T02:05:48Z

vllm/model_executor/model_loader/weight_utils.py

    use_tqdm_on_load: bool,
 ) -> Generator[Tuple[str, torch.Tensor], None, None]:
    """Iterate over the weights in the model safetensor files."""
+    # hf_weights_files = hf_weights_files[:5]


This commented-out line appears to be for debugging. Please remove it before merging.

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

gemini-code-assist · 2025-07-03T02:05:49Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe_mxfp8.py

+            intermediate_size_per_partition = (
+                intermediate_size_per_partition_2x // 2
+            )
+            # FIXME: Handle mask


This FIXME comment indicates that mask handling is incomplete. This should be addressed before merging.

gemini-code-assist · 2025-07-03T02:05:49Z

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_mxfp8.py

+
+
+def dequant_mx_fp8(weight_fp8, scale_e8m0, block_size):
+    # FIXME: (Yi) add support for scale_e8m0 in uint8


This FIXME comment indicates that support for uint8 scales is missing. This should be addressed or clarified.

gemini-code-assist · 2025-07-03T02:05:49Z

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_mxfp8.py

+        return self.fp8_linear.apply(input=x,
+                                     weight=layer.weight,
+                                     weight_scale=layer.weight_scale,
+                                     out_dtype=self.out_dtype,
+                                     input_scale=layer.input_scale,
+                                     bias=bias)


This block of code is commented out and appears to be dead code. It should be removed to improve code clarity.

gemini-code-assist · 2025-07-03T02:05:49Z

vllm/model_executor/models/deepseek_v2.py

+                if name not in params_dict:
+                    continue


This check if name not in params_dict: continue is repeated multiple times in load_weights. While it prevents crashes, it might hide issues where expected weights are not found in params_dict. Could you add a comment explaining why this is necessary? For example, if some weights from the checkpoint are intentionally not used. If this is a temporary workaround, it would be good to note that as well.

…emes/__init__.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: Yi Liu <[email protected]>

Add Support for MXFP4 Linear and MOE --------- Signed-off-by: yiliu30 <[email protected]> Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Ziyue-Intel <[email protected]> Co-authored-by: He, Xin3 <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Yi Liu <[email protected]>

Signed-off-by: Yi Liu <[email protected]>

* add nvfp4 support Signed-off-by: Yi Liu <[email protected]> * update example for nvfp4 Signed-off-by: Yi Liu <[email protected]> --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>

* fix global scale Signed-off-by: Yi Liu <[email protected]> * reduce memory usage in moe (#59) Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]> * update transformers version Signed-off-by: Yi Liu <[email protected]> * update Signed-off-by: Yi Liu <[email protected]> * remove two per_tensor_amax_to_scale --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>

* fix global scale Signed-off-by: Yi Liu <[email protected]> * reduce memory usage in moe (#59) Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]> * update transformers version Signed-off-by: Yi Liu <[email protected]> * update Signed-off-by: Yi Liu <[email protected]> * enable next token task Signed-off-by: Yi Liu <[email protected]> * add eval code back Signed-off-by: Yi Liu <[email protected]> * fix Signed-off-by: Yi Liu <[email protected]> * clean Signed-off-by: Yi Liu <[email protected]> --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>

Support static global scale --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>

This reverts commit c4ef813.

* fix global scale Signed-off-by: Yi Liu <[email protected]> * reduce memory usage in moe (#59) Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]> * update transformers version Signed-off-by: Yi Liu <[email protected]> * update Signed-off-by: Yi Liu <[email protected]> * enable next token task Signed-off-by: Yi Liu <[email protected]> * add eval code back Signed-off-by: Yi Liu <[email protected]> * fix Signed-off-by: Yi Liu <[email protected]> * clean Signed-off-by: Yi Liu <[email protected]> * use gs Signed-off-by: Yi Liu <[email protected]> * Update nvfp4_qdq.py * Apply suggestion from @Copilot Co-authored-by: Copilot <[email protected]> * Update nvfp4_qdq.py --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]> Co-authored-by: Copilot <[email protected]>

* start vllm cmd Signed-off-by: Yi Liu <[email protected]> * use exisitng torch Signed-off-by: Yi Liu <[email protected]> * use datasets 3.6 Signed-off-by: Yi Liu <[email protected]> * add even rounding for mxfp4 Signed-off-by: Yi Liu <[email protected]> * Update vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * add ao back Signed-off-by: Yi Liu <[email protected]> --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* remove torch from requirements Signed-off-by: Yi Liu <[email protected]> * update bench code Signed-off-by: Yi Liu <[email protected]> * update Signed-off-by: Yi Liu <[email protected]> * update model path Signed-off-by: Yi Liu <[email protected]> --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>

* fix qwen Signed-off-by: Yi Liu <[email protected]> * fix topk Signed-off-by: Yi Liu <[email protected]> --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>

Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>

dsikka and others added 20 commits June 25, 2025 07:18

port nvfp4

b840b2e

Signed-off-by: yiliu30 <[email protected]>

add example

d39c7bd

Signed-off-by: yiliu30 <[email protected]>

add hpu support

0240c06

Signed-off-by: Yi Liu <[email protected]>

update example

a14189f

Signed-off-by: Yi Liu <[email protected]>

add moe example

0418b9e

Signed-off-by: yiliu30 <[email protected]>

add moe support

3f8505f

Signed-off-by: yiliu30 <[email protected]>

ep 2

07cf1c0

Signed-off-by: Yi Liu <[email protected]>

refine log

5b987a6

Signed-off-by: Yi Liu <[email protected]>

update example

4b8e087

Signed-off-by: Yi Liu <[email protected]>

add d-qd mxfp8 support

f1d9058

Signed-off-by: yiliu30 <[email protected]> Signed-off-by: Yi Liu <[email protected]>

fix mxfp8 for hpu

d7f0178

Signed-off-by: Yi Liu <[email protected]>

mxfp8 moe support

077e5ff

Signed-off-by: yiliu30 <[email protected]>

clean code

bff154e

Signed-off-by: Yi Liu <[email protected]>

refine code

b0d7744

Signed-off-by: Yi Liu <[email protected]>

update example

a813a2a

Signed-off-by: Yi Liu <[email protected]>

for debug

b204ae8

Signed-off-by: Yi Liu <[email protected]>

convert scale to float

2d8451f

Signed-off-by: Yi Liu <[email protected]>

log graph

8c145b4

Signed-off-by: Yi Liu <[email protected]>

mask buffer

d114245

Signed-off-by: Yi Liu <[email protected]>

fix disable qdq input

28cfda0

Signed-off-by: Yi Liu <[email protected]>

gemini-code-assist bot reviewed Jul 3, 2025

View reviewed changes

yiliu30 and others added 8 commits July 3, 2025 11:15

Update vllm/model_executor/layers/quantization/compressed_tensors/sch…

59cc107

…emes/__init__.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fix oom hpu graph

3ffcbf0

Signed-off-by: Yi Liu <[email protected]>

add ao patch

3e038f8

Signed-off-by: Yi Liu <[email protected]>

update example

e9fe2a2

Signed-off-by: Yi Liu <[email protected]>

add bench code

c34f2b2

Signed-off-by: Yi Liu <[email protected]>

clean example

6c9a6a3

Signed-off-by: Yi Liu <[email protected]>

Add nvfp4 support (#56)

449515a

* add nvfp4 support Signed-off-by: Yi Liu <[email protected]> * update example for nvfp4 Signed-off-by: Yi Liu <[email protected]> --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>

yiliu30 and others added 10 commits August 1, 2025 15:40

Nvfp4 static gs (#61)

c4ef813

Support static global scale --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>

Remove CODEOWNERS (#62)

310c908

Revert "Nvfp4 static gs (#61)" (#63)

95135b6

This reverts commit c4ef813.

Fix Qwen llmc (#67)

92132fd

* fix qwen Signed-off-by: Yi Liu <[email protected]> * fix topk Signed-off-by: Yi Liu <[email protected]> --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>

fix even rounding (#68)

02734a1

Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Hpu mxfp8 moe #51

Hpu mxfp8 moe #51

Uh oh!

yiliu30 commented Jul 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

gemini-code-assist bot Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		def dequant_mx_fp8(weight_fp8, scale_e8m0, block_size):
		# FIXME: (Yi) add support for scale_e8m0 in uint8

Uh oh!

Hpu mxfp8 moe #51

Are you sure you want to change the base?

Hpu mxfp8 moe #51

Uh oh!

Conversation

yiliu30 commented Jul 3, 2025

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants