Skip to content

Conversation

movedancer
Copy link
Contributor

@movedancer movedancer commented Oct 12, 2025

Description

This submission is a 4-bit quantized matrix multiplication operator suitable for the Loongson platform. It has passed the internal test checks of ONNX and has been successfully deployed for actual inference on the Loongson platform. It includes five modifications:
(1) sqnbitgemm_kernel_lasx.cpp: Acceleration of inference for 4-bit quantized models on the LoongArch64 architecture, utilizing lasx/lsx vector instruction sets;
(2) sqnbitgemm_kernel_lasx_common.h: Implementation of auxiliary functions used by sqnbitgemm_kernel_lasx.cpp`;
(3) cmake: Added compilation options for sqnbitgemm_kernel_lasx.cpp under the LoongArch64 architecture;
(4) mlasi.h: Added interface for calling the operator in sqnbitgemm_kernel_lasx.cpp under the LoongArch64 architecture;
(5) platform.cpp: Added calls to the operators in sqnbitgemm_kernel_lasx.cpp under the LoongArch64 architecture.

Motivation and Context

Loongson has a critical lack of key operations in ONNX quantized model inference tasks.
The issue of poor inference performance for 4-bit quantized models on the Loongson platform has been addressed. In tests using the Deepseek-R1-1.5B model, our operators have increased TPS by more than 7 times, with the speed of quantization matrix dequantization improving by up to 3 times.

Pictures

Dequantization Acceleration:
In the chart, the vertical axis represents time in milliseconds (ms), the horizontal axis represents the number of test matrices, and the size of the quantized matrix is rows × columns, such as the 1536*256.
反量化加速

(1) `sqnbitgemm_kernel_lasx.cpp`: Acceleration of inference for 4-bit quantized models
      on the LoongArch64 architecture, utilizing lasx/lsx vector instruction sets;
(2) `sqnbitgemm_kernel_lasx_common.h`: Implementation of auxiliary functions used by
     `sqnbitgemm_kernel_lasx.cpp`;
(3) `make`: Added compilation options for `sqnbitgemm_kernel_lasx.cpp` under the
      LoongArch64 architecture;
(4) `mlasi.h`: Added interface for calling the operator in `sqnbitgemm_kernel_lasx.cpp`
      under the LoongArch64 architecture;
(5) `platform.cpp`: Added calls to the operators in `sqnbitgemm_kernel_lasx.cpp` under
      the LoongArch64 architecture.
@movedancer
Copy link
Contributor Author

@microsoft-github-policy-service agree

Copy link
Member

@snnn snnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the pull request and left some comments regarding coding style and conventions. Overall, the changes look good. Thank you for the contribution!

@movedancer movedancer requested a review from snnn October 14, 2025 02:55
@snnn
Copy link
Member

snnn commented Oct 14, 2025

From a security perspective, an AI bot has reviewed the code and found several potential vulnerabilities, primarily related to integer overflows when calculating buffer sizes and memory offsets. The project's coding conventions mandate the use of SafeInt for such calculations to prevent these issues, but this practice is not followed in the new code.

Here is a summary of the findings, from most to least critical:

Critical Vulnerabilities

  1. Heap Buffer Overflow in Q8ComputePackBlkSum:

    • File: onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx_common.h
    • Line: 252
    • Issue: The allocation std::vector<float> QuantBScaleBeginCopy(N * BlockCountK); is vulnerable to an integer overflow if N * BlockCountK exceeds SIZE_MAX. This would cause std::vector to allocate a much smaller buffer than required. The following std::copy would then write out of bounds, leading to heap corruption.
    • Recommendation: Use SafeInt<size_t> to calculate the size of the vector to prevent overflow.
      SafeInt<size_t> size = static_cast<SafeInt<size_t>>(N) * BlockCountK;
      std::vector<float> QuantBScaleBeginCopy(size.Value());
  2. Integer Overflow in Buffer Size Calculation:

    • File: onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx.cpp
    • Function: QNBitGemmPackQuantBDataSize_Lasx
    • Issue: This function calculates the total size required for a packed buffer. The calculations involve multiplications like N * BlockCountK * ... which can overflow. If this function returns a size that is smaller than the actual required size due to an overflow, the caller will allocate an undersized buffer, leading to heap overflows when this buffer is written to by packing functions.
    • Recommendation: All size calculations in this function must use SafeInt. For example:
      SafeInt<size_t> packed_quant_b_data_size = static_cast<SafeInt<size_t>>(N) * BlockCountK * MlasQNBitBlkDataSizeInBytes(BlkBitWidth, BlkLen);
      Apply this to all multiplications for PackedQuantBDataSize, ScaleSize, and BlkSumSize.

High-Risk Vulnerabilities

  1. Integer Overflows in Offset Calculations:
    • Files: sqnbitgemm_kernel_lasx.cpp, sqnbitgemm_kernel_lasx_common.h
    • Issue: Throughout the code, there are numerous calculations for offsets into buffers, such as n * BlockCountK, c * StrideN, BlockCountK * blk_data_size_in_bytes. These are performed using raw size_t arithmetic and are prone to overflow. An attacker who can control tensor shapes in a model could potentially trigger an overflow, causing pointers to be miscalculated, which would lead to out-of-bounds reads or writes.
    • Examples:
      • sqnbitgemm_kernel_lasx_common.h:55: const size_t src_blk_offset = n * BlockCountK + k_blk;
      • sqnbitgemm_kernel_lasx_common.h:92: const size_t src_data_offset = n * BlockCountK * BlkDataSize + k_subblk * SubBlkDataSize;
      • sqnbitgemm_kernel_lasx.cpp:600: const size_t b_data_col_stride_in_bytes = BlockCountK * blk_data_size_in_bytes;
    • Recommendation: All offset calculations involving multiplications should be performed using SafeInt<size_t> to prevent overflows.

Medium-Risk Vulnerabilities

  1. Unchecked std::memcpy:
    • File: onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx.cpp
    • Lines: e.g., 260, 366
    • Issue: Functions like ComputeDotProducts_BlkLen32Plus_CompFp32_lasx use std::memcpy with a hardcoded size (e.g., 16 or 8 bytes) from a source pointer that is calculated with offsets. While the logic appears correct for valid inputs, there are no explicit checks to ensure that the source buffer is actually large enough to read from. If an integer overflow occurs during the offset calculation, these memcpy operations could read from arbitrary memory locations.
    • Recommendation: While using SafeInt for offset calculations will mitigate most of this risk, it is good practice to be mindful of buffer boundaries when dealing with raw pointers.

Low-Risk/Code Hygiene

  1. Fragile Stack Buffer Usage in load_float_n_lasx:
    • File: onnxruntime/core/mlas/lib/sqnbitgemm_kernel_lasx.cpp
    • Line: 189
    • Issue: The function load_float_n_lasx uses a fixed-size stack buffer buf[8]. The loop for (int i = 0; i < n; ++i) writes to this buffer. While the current callers in this file all ensure n <= 8, this creates a fragile design that could be vulnerable to future changes.
    • Recommendation: Add a check like assert(n <= 8); at the beginning of the function to make the precondition explicit and prevent future misuse.

@movedancer
Copy link
Contributor Author

Thank you for pointing out the flaws! I'll check and fix them right away.

SafeInt has been added to check for overflow risks, ensuring that the code complies with the latest coding standards
SafeInt has been added to check for overflow risks, ensuring that the code complies with the latest coding standards. At the same time, comments have been added to indicate that there is no overflow risk when using memcpy, and the code has previously filled the data to ensure boundary security. Finally, boundary checks for the load_float_n_lasx function have been added.
@movedancer
Copy link
Contributor Author

Hello! I have made code modifications according to your suggestions, corrected all risks and errors, and look forward to your review.

Replace static_cast<SafeInt<size_t>> with SafeInt<size_t>.
Replace static_cast<SafeInt<size_t>> with SafeInt<size_t>.
@snnn snnn closed this Oct 16, 2025
@snnn snnn reopened this Oct 16, 2025
@snnn
Copy link
Member

snnn commented Oct 16, 2025

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Add SIMD alignment checks, implement a custom MlasAlignedAllocator and call MlasGetPreferredBufferAlignment() to resolve vector memory allocation alignment issues in sqnbitgemm_kernel_lasx_common.h.
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@snnn snnn merged commit 7cc28b0 into microsoft:main Oct 17, 2025
90 checks passed
@movedancer
Copy link
Contributor Author

movedancer commented Oct 18, 2025 via email

@movedancer movedancer deleted the add-more-loongson-support branch October 18, 2025 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants