- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2
Hpu mxfp8 moe #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: habana-mai-79ba403d5
Are you sure you want to change the base?
Hpu mxfp8 moe #51
Conversation
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: yiliu30 <[email protected]> Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly expands vLLM's quantization capabilities by introducing support for MXFP8 and NVFP4, with a particular focus on Mixture-of-Experts (MoE) models and optimizations for Habana Processing Units (HPUs). The changes involve adding new quantization schemes, MoE methods, HPU-specific configurations, and an updated dependency, alongside a new example script to facilitate usage.
Highlights
- MXFP8 Quantization for MoE: I've introduced a new Mixed Precision FP8 (MXFP8) quantization scheme specifically for Mixture-of-Experts (MoE) layers. This includes new dedicated MoE methods and comprehensive weight handling for MXFP8, enabling more efficient inference for MoE models.
- NVFP4 Quantization Support: I've added support for NVIDIA FP4 (NVFP4) quantization. This involves a new quantization scheme and a corresponding MoE method, with capabilities for both direct Cutlass kernel utilization on compatible CUDA devices and software emulation for broader platform support.
- HPU-Specific Optimizations: I've incorporated several optimizations tailored for Habana Processing Units (HPUs). This includes new environment variables like VLLM_USE_STATIC_MOE_HPUandVLLM_HPU_FORCE_CHANNEL_FP8, designed to improve performance and control behavior on HPU hardware.
- Dependency Update & Example: I've updated the compressed-tensorslibrary to version0.10.2to support the new quantization features. Additionally, I've added a new example script (basic_hpu.py) to demonstrate basic offline inference on HPU, showcasing the setup with relevant environment variables and model paths.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description | 
|---|---|---|
| Code Review | /gemini review | Performs a code review for the current pull request in its current state. | 
| Pull Request Summary | /gemini summary | Provides a summary of the current pull request in its current state. | 
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. | 
| Help | /gemini help | Displays a list of available commands. | 
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
- 
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩ 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for MXFP8 and NVFP4 quantization schemes, particularly for Mixture of Experts (MoE) models on Habana Processing Units (HPU). The changes are extensive, adding new quantization schemes, emulation utilities, and HPU-specific logic.
I've found a few critical issues, including a syntax error and a debug script that should be removed. There are also several instances of leftover debugging code, FIXME comments, and local imports that need to be addressed. Please review the comments for details.
        
          
                vllm/model_executor/layers/quantization/compressed_tensors/schemes/__init__.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | self, weight_quant: BaseModel, | ||
| input_quant: BaseModel) -> "CompressedTensorsScheme": | ||
|  | ||
| # breakpoint() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
        
          
                vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | use_tqdm_on_load: bool, | ||
| ) -> Generator[Tuple[str, torch.Tensor], None, None]: | ||
| """Iterate over the weights in the model safetensor files.""" | ||
| # hf_weights_files = hf_weights_files[:5] | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
        
          
                vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | intermediate_size_per_partition = ( | ||
| intermediate_size_per_partition_2x // 2 | ||
| ) | ||
| # FIXME: Handle mask | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|  | ||
|  | ||
| def dequant_mx_fp8(weight_fp8, scale_e8m0, block_size): | ||
| # FIXME: (Yi) add support for scale_e8m0 in uint8 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return self.fp8_linear.apply(input=x, | ||
| weight=layer.weight, | ||
| weight_scale=layer.weight_scale, | ||
| out_dtype=self.out_dtype, | ||
| input_scale=layer.input_scale, | ||
| bias=bias) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if name not in params_dict: | ||
| continue | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check if name not in params_dict: continue is repeated multiple times in load_weights. While it prevents crashes, it might hide issues where expected weights are not found in params_dict. Could you add a comment explaining why this is necessary? For example, if some weights from the checkpoint are intentionally not used. If this is a temporary workaround, it would be good to note that as well.
…emes/__init__.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Add Support for MXFP4 Linear and MOE --------- Signed-off-by: yiliu30 <[email protected]> Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Ziyue-Intel <[email protected]> Co-authored-by: He, Xin3 <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
* add nvfp4 support Signed-off-by: Yi Liu <[email protected]> * update example for nvfp4 Signed-off-by: Yi Liu <[email protected]> --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>
* fix global scale Signed-off-by: Yi Liu <[email protected]> * reduce memory usage in moe (#59) Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]> * update transformers version Signed-off-by: Yi Liu <[email protected]> * update Signed-off-by: Yi Liu <[email protected]> * remove two per_tensor_amax_to_scale --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>
* fix global scale Signed-off-by: Yi Liu <[email protected]> * reduce memory usage in moe (#59) Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]> * update transformers version Signed-off-by: Yi Liu <[email protected]> * update Signed-off-by: Yi Liu <[email protected]> * enable next token task Signed-off-by: Yi Liu <[email protected]> * add eval code back Signed-off-by: Yi Liu <[email protected]> * fix Signed-off-by: Yi Liu <[email protected]> * clean Signed-off-by: Yi Liu <[email protected]> --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>
Support static global scale --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>
* fix global scale Signed-off-by: Yi Liu <[email protected]> * reduce memory usage in moe (#59) Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]> * update transformers version Signed-off-by: Yi Liu <[email protected]> * update Signed-off-by: Yi Liu <[email protected]> * enable next token task Signed-off-by: Yi Liu <[email protected]> * add eval code back Signed-off-by: Yi Liu <[email protected]> * fix Signed-off-by: Yi Liu <[email protected]> * clean Signed-off-by: Yi Liu <[email protected]> * use gs Signed-off-by: Yi Liu <[email protected]> * Update nvfp4_qdq.py * Apply suggestion from @Copilot Co-authored-by: Copilot <[email protected]> * Update nvfp4_qdq.py --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]> Co-authored-by: Copilot <[email protected]>
* start vllm cmd Signed-off-by: Yi Liu <[email protected]> * use exisitng torch Signed-off-by: Yi Liu <[email protected]> * use datasets 3.6 Signed-off-by: Yi Liu <[email protected]> * add even rounding for mxfp4 Signed-off-by: Yi Liu <[email protected]> * Update vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * add ao back Signed-off-by: Yi Liu <[email protected]> --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* remove torch from requirements Signed-off-by: Yi Liu <[email protected]> * update bench code Signed-off-by: Yi Liu <[email protected]> * update Signed-off-by: Yi Liu <[email protected]> * update model path Signed-off-by: Yi Liu <[email protected]> --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>
* fix qwen Signed-off-by: Yi Liu <[email protected]> * fix topk Signed-off-by: Yi Liu <[email protected]> --------- Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>
Essential Elements of an Effective PR Description Checklist
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)