-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[LARCH64 CPU]Provide inference acceleration optimization for Loongson CPU with 4-bit quantized models #26280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
(1) `sqnbitgemm_kernel_lasx.cpp`: Acceleration of inference for 4-bit quantized models on the LoongArch64 architecture, utilizing lasx/lsx vector instruction sets; (2) `sqnbitgemm_kernel_lasx_common.h`: Implementation of auxiliary functions used by `sqnbitgemm_kernel_lasx.cpp`; (3) `make`: Added compilation options for `sqnbitgemm_kernel_lasx.cpp` under the LoongArch64 architecture; (4) `mlasi.h`: Added interface for calling the operator in `sqnbitgemm_kernel_lasx.cpp` under the LoongArch64 architecture; (5) `platform.cpp`: Added calls to the operators in `sqnbitgemm_kernel_lasx.cpp` under the LoongArch64 architecture.
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed the pull request and left some comments regarding coding style and conventions. Overall, the changes look good. Thank you for the contribution!
From a security perspective, an AI bot has reviewed the code and found several potential vulnerabilities, primarily related to integer overflows when calculating buffer sizes and memory offsets. The project's coding conventions mandate the use of Here is a summary of the findings, from most to least critical: Critical Vulnerabilities
High-Risk Vulnerabilities
Medium-Risk Vulnerabilities
Low-Risk/Code Hygiene
|
Thank you for pointing out the flaws! I'll check and fix them right away. |
SafeInt has been added to check for overflow risks, ensuring that the code complies with the latest coding standards
SafeInt has been added to check for overflow risks, ensuring that the code complies with the latest coding standards. At the same time, comments have been added to indicate that there is no overflow risk when using memcpy, and the code has previously filled the data to ensure boundary security. Finally, boundary checks for the load_float_n_lasx function have been added.
Hello! I have made code modifications according to your suggestions, corrected all risks and errors, and look forward to your review. |
Replace static_cast<SafeInt<size_t>> with SafeInt<size_t>.
Replace static_cast<SafeInt<size_t>> with SafeInt<size_t>.
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
Azure Pipelines successfully started running 4 pipeline(s). |
Add SIMD alignment checks, implement a custom MlasAlignedAllocator and call MlasGetPreferredBufferAlignment() to resolve vector memory allocation alignment issues in sqnbitgemm_kernel_lasx_common.h.
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
Azure Pipelines successfully started running 4 pipeline(s). |
Thank you for the approval! I'm glad my contribution was accepted.
|
Description
This submission is a 4-bit quantized matrix multiplication operator suitable for the Loongson platform. It has passed the internal test checks of ONNX and has been successfully deployed for actual inference on the Loongson platform. It includes five modifications:
(1) sqnbitgemm_kernel_lasx.cpp: Acceleration of inference for 4-bit quantized models on the LoongArch64 architecture, utilizing lasx/lsx vector instruction sets;
(2) sqnbitgemm_kernel_lasx_common.h: Implementation of auxiliary functions used by sqnbitgemm_kernel_lasx.cpp`;
(3) cmake: Added compilation options for sqnbitgemm_kernel_lasx.cpp under the LoongArch64 architecture;
(4) mlasi.h: Added interface for calling the operator in sqnbitgemm_kernel_lasx.cpp under the LoongArch64 architecture;
(5) platform.cpp: Added calls to the operators in sqnbitgemm_kernel_lasx.cpp under the LoongArch64 architecture.
Motivation and Context
Loongson has a critical lack of key operations in ONNX quantized model inference tasks.
The issue of poor inference performance for 4-bit quantized models on the Loongson platform has been addressed. In tests using the Deepseek-R1-1.5B model, our operators have increased TPS by more than 7 times, with the speed of quantization matrix dequantization improving by up to 3 times.
Pictures
Dequantization Acceleration:

In the chart, the vertical axis represents time in milliseconds (ms), the horizontal axis represents the number of test matrices, and the size of the quantized matrix is rows × columns, such as the 1536*256.