Quantization format allocation strategy #15350

apaz-cli · 2025-08-15T17:16:00Z

apaz-cli
Aug 15, 2025

I'm trying to implement an avx512 version of the mxfp4-e8m0 dot product.

The biggest performance gain would come from loading 4 blocks of 32 elements sequentially at once from the q4 elements, and two blocks of the q8 elements at once.

But it looks like I can't do that. When I print out the addresses of the buffers in each block containing the elements, they're stored all over the place in memory.

The avx2 version with 256 wide registers just calls _mm_loadu_si128 twice to load the elements from two blocks, and then combines the two into a single register. This too, could be done in one register if the storage inside the blocks were contiguous.

// ggml_vec_dot_mxfp4_q8_0() in ggml/src/ggml-cpu/quants.c
const __m128i q4bits_1 = _mm_loadu_si128((const __m128i*)x[ib + 0].qs);
const __m128i q4bits_2 = _mm_loadu_si128((const __m128i*)x[ib + 1].qs);

vs

// These could be one load with a different allocation strategy
const __m256i q4bits_12 = _mm256_loadu_si256((const __m256i*)x[ib].qs);

I spent like an hour grepping around the codebase trying to figure out where the vectors were allocated from, but couldn't find it.

Is this something that I could/should change, and where would I do that?

apaz-cli · 2025-08-15T17:19:23Z

apaz-cli
Aug 15, 2025
Author

Closing because I'm making an issue out of this instead

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantization format allocation strategy #15350

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Quantization format allocation strategy #15350

Uh oh!

apaz-cli Aug 15, 2025

Replies: 1 comment

Uh oh!

apaz-cli Aug 15, 2025 Author

apaz-cli
Aug 15, 2025

apaz-cli
Aug 15, 2025
Author