[LARCH64 CPU]Provide inference acceleration optimization for Loongson CPU with 4-bit quantized models #26280

movedancer · 2025-10-12T03:45:03Z

Description

This submission is a 4-bit quantized matrix multiplication operator suitable for the Loongson platform. It has passed the internal test checks of ONNX and has been successfully deployed for actual inference on the Loongson platform. It includes five modifications:
(1) sqnbitgemm_kernel_lasx.cpp: Acceleration of inference for 4-bit quantized models on the LoongArch64 architecture, utilizing lasx/lsx vector instruction sets;
(2) sqnbitgemm_kernel_lasx_common.h: Implementation of auxiliary functions used by sqnbitgemm_kernel_lasx.cpp`;
(3) cmake: Added compilation options for sqnbitgemm_kernel_lasx.cpp under the LoongArch64 architecture;
(4) mlasi.h: Added interface for calling the operator in sqnbitgemm_kernel_lasx.cpp under the LoongArch64 architecture;
(5) platform.cpp: Added calls to the operators in sqnbitgemm_kernel_lasx.cpp under the LoongArch64 architecture.

Motivation and Context

Loongson has a critical lack of key operations in ONNX quantized model inference tasks.
The issue of poor inference performance for 4-bit quantized models on the Loongson platform has been addressed. In tests using the Deepseek-R1-1.5B model, our operators have increased TPS by more than 7 times, with the speed of quantization matrix dequantization improving by up to 3 times.

Pictures

Dequantization Acceleration：
In the chart, the vertical axis represents time in milliseconds (ms), the horizontal axis represents the number of test matrices, and the size of the quantized matrix is rows × columns, such as the 1536*256.

(1) `sqnbitgemm_kernel_lasx.cpp`: Acceleration of inference for 4-bit quantized models on the LoongArch64 architecture, utilizing lasx/lsx vector instruction sets; (2) `sqnbitgemm_kernel_lasx_common.h`: Implementation of auxiliary functions used by `sqnbitgemm_kernel_lasx.cpp`; (3) `make`: Added compilation options for `sqnbitgemm_kernel_lasx.cpp` under the LoongArch64 architecture; (4) `mlasi.h`: Added interface for calling the operator in `sqnbitgemm_kernel_lasx.cpp` under the LoongArch64 architecture; (5) `platform.cpp`: Added calls to the operators in `sqnbitgemm_kernel_lasx.cpp` under the LoongArch64 architecture.

movedancer · 2025-10-12T03:57:16Z

@microsoft-github-policy-service agree

snnn

I have reviewed the pull request and left some comments regarding coding style and conventions. Overall, the changes look good. Thank you for the contribution!

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx.cpp

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx_common.h

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx.cpp

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx_common.h

snnn · 2025-10-14T02:58:07Z

From a security perspective, an AI bot has reviewed the code and found several potential vulnerabilities, primarily related to integer overflows when calculating buffer sizes and memory offsets. The project's coding conventions mandate the use of SafeInt for such calculations to prevent these issues, but this practice is not followed in the new code.

Here is a summary of the findings, from most to least critical:

Critical Vulnerabilities

Heap Buffer Overflow in Q8ComputePackBlkSum:
- File: onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx_common.h
- Line: 252
- Issue: The allocation std::vector<float> QuantBScaleBeginCopy(N * BlockCountK); is vulnerable to an integer overflow if N * BlockCountK exceeds SIZE_MAX. This would cause std::vector to allocate a much smaller buffer than required. The following std::copy would then write out of bounds, leading to heap corruption.
- Recommendation: Use SafeInt<size_t> to calculate the size of the vector to prevent overflow.
```
SafeInt<size_t> size = static_cast<SafeInt<size_t>>(N) * BlockCountK;
std::vector<float> QuantBScaleBeginCopy(size.Value());
```
Integer Overflow in Buffer Size Calculation:
- File: onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx.cpp
- Function: QNBitGemmPackQuantBDataSize_Lasx
- Issue: This function calculates the total size required for a packed buffer. The calculations involve multiplications like N * BlockCountK * ... which can overflow. If this function returns a size that is smaller than the actual required size due to an overflow, the caller will allocate an undersized buffer, leading to heap overflows when this buffer is written to by packing functions.
- Recommendation: All size calculations in this function must use SafeInt. For example:
```
SafeInt<size_t> packed_quant_b_data_size = static_cast<SafeInt<size_t>>(N) * BlockCountK * MlasQNBitBlkDataSizeInBytes(BlkBitWidth, BlkLen);
```
  Apply this to all multiplications for PackedQuantBDataSize, ScaleSize, and BlkSumSize.

High-Risk Vulnerabilities

Integer Overflows in Offset Calculations:
- Files: sqnbitgemm_kernel_lasx.cpp, sqnbitgemm_kernel_lasx_common.h
- Issue: Throughout the code, there are numerous calculations for offsets into buffers, such as n * BlockCountK, c * StrideN, BlockCountK * blk_data_size_in_bytes. These are performed using raw size_t arithmetic and are prone to overflow. An attacker who can control tensor shapes in a model could potentially trigger an overflow, causing pointers to be miscalculated, which would lead to out-of-bounds reads or writes.
- Examples:
  - sqnbitgemm_kernel_lasx_common.h:55: const size_t src_blk_offset = n * BlockCountK + k_blk;
  - sqnbitgemm_kernel_lasx_common.h:92: const size_t src_data_offset = n * BlockCountK * BlkDataSize + k_subblk * SubBlkDataSize;
  - sqnbitgemm_kernel_lasx.cpp:600: const size_t b_data_col_stride_in_bytes = BlockCountK * blk_data_size_in_bytes;
- Recommendation: All offset calculations involving multiplications should be performed using SafeInt<size_t> to prevent overflows.

Medium-Risk Vulnerabilities

Unchecked std::memcpy:
- File: onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx.cpp
- Lines: e.g., 260, 366
- Issue: Functions like ComputeDotProducts_BlkLen32Plus_CompFp32_lasx use std::memcpy with a hardcoded size (e.g., 16 or 8 bytes) from a source pointer that is calculated with offsets. While the logic appears correct for valid inputs, there are no explicit checks to ensure that the source buffer is actually large enough to read from. If an integer overflow occurs during the offset calculation, these memcpy operations could read from arbitrary memory locations.
- Recommendation: While using SafeInt for offset calculations will mitigate most of this risk, it is good practice to be mindful of buffer boundaries when dealing with raw pointers.

Low-Risk/Code Hygiene

Fragile Stack Buffer Usage in load_float_n_lasx:
- File: onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx.cpp
- Line: 189
- Issue: The function load_float_n_lasx uses a fixed-size stack buffer buf[8]. The loop for (int i = 0; i < n; ++i) writes to this buffer. While the current callers in this file all ensure n <= 8, this creates a fragile design that could be vulnerable to future changes.
- Recommendation: Add a check like assert(n <= 8); at the beginning of the function to make the precondition explicit and prevent future misuse.

movedancer · 2025-10-14T03:00:56Z

Thank you for pointing out the flaws! I'll check and fix them right away.

SafeInt has been added to check for overflow risks, ensuring that the code complies with the latest coding standards

SafeInt has been added to check for overflow risks, ensuring that the code complies with the latest coding standards. At the same time, comments have been added to indicate that there is no overflow risk when using memcpy, and the code has previously filled the data to ensure boundary security. Finally, boundary checks for the load_float_n_lasx function have been added.

movedancer · 2025-10-14T07:03:53Z

Hello! I have made code modifications according to your suggestions, corrected all risks and errors, and look forward to your review.

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx_common.h

Replace static_cast<SafeInt<size_t>> with SafeInt<size_t>.

snnn · 2025-10-16T22:19:00Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-16T22:19:17Z

Azure Pipelines successfully started running 4 pipeline(s).

Add SIMD alignment checks, implement a custom MlasAlignedAllocator and call MlasGetPreferredBufferAlignment() to resolve vector memory allocation alignment issues in sqnbitgemm_kernel_lasx_common.h.

hariharans29 · 2025-10-17T20:46:16Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-17T20:46:34Z

Azure Pipelines successfully started running 4 pipeline(s).

movedancer · 2025-10-18T02:33:43Z

Thank you for the approval! I'm glad my contribution was accepted.

snnn reviewed Oct 14, 2025

View reviewed changes

movedancer requested a review from snnn October 14, 2025 02:55

movedancer added 2 commits October 14, 2025 14:59

Update sqnbitgemm_kernel_lasx_common.h

b3407de

SafeInt has been added to check for overflow risks, ensuring that the code complies with the latest coding standards

snnn reviewed Oct 14, 2025

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx_common.h Outdated Show resolved Hide resolved

movedancer added 2 commits October 15, 2025 11:03

Update sqnbitgemm_kernel_lasx_common.h

4cae30a

Replace static_cast<SafeInt<size_t>> with SafeInt<size_t>.

Update sqnbitgemm_kernel_lasx.cpp

a4c9b71

Replace static_cast<SafeInt<size_t>> with SafeInt<size_t>.

snnn closed this Oct 16, 2025

snnn reopened this Oct 16, 2025

Add MlasAlignedAllocator for aligned memory allocation

0a99f58

Add SIMD alignment checks, implement a custom MlasAlignedAllocator and call MlasGetPreferredBufferAlignment() to resolve vector memory allocation alignment issues in sqnbitgemm_kernel_lasx_common.h.

snnn approved these changes Oct 17, 2025

View reviewed changes

snnn merged commit 7cc28b0 into microsoft:main Oct 17, 2025
90 checks passed

movedancer deleted the add-more-loongson-support branch October 18, 2025 02:35

[LARCH64 CPU]Provide inference acceleration optimization for Loongson CPU with 4-bit quantized models #26280

[LARCH64 CPU]Provide inference acceleration optimization for Loongson CPU with 4-bit quantized models #26280

Uh oh!

Conversation

movedancer commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Pictures

Uh oh!

movedancer commented Oct 12, 2025

Uh oh!

snnn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

snnn commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Critical Vulnerabilities

High-Risk Vulnerabilities

Medium-Risk Vulnerabilities

Low-Risk/Code Hygiene

Uh oh!

movedancer commented Oct 14, 2025

Uh oh!

movedancer commented Oct 14, 2025

Uh oh!

Uh oh!

snnn commented Oct 16, 2025

Uh oh!

azure-pipelines bot commented Oct 16, 2025

Uh oh!

hariharans29 commented Oct 17, 2025

Uh oh!

azure-pipelines bot commented Oct 17, 2025

Uh oh!

Uh oh!

movedancer commented Oct 18, 2025 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

movedancer commented Oct 12, 2025 •

edited

Loading

snnn commented Oct 14, 2025 •

edited

Loading

movedancer commented Oct 18, 2025 via email •

edited

Loading