Replies: 1 comment
-
Closing because I'm making an issue out of this instead |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm trying to implement an avx512 version of the mxfp4-e8m0 dot product.
The biggest performance gain would come from loading 4 blocks of 32 elements sequentially at once from the q4 elements, and two blocks of the q8 elements at once.
But it looks like I can't do that. When I print out the addresses of the buffers in each block containing the elements, they're stored all over the place in memory.
The avx2 version with 256 wide registers just calls
_mm_loadu_si128
twice to load the elements from two blocks, and then combines the two into a single register. This too, could be done in one register if the storage inside the blocks were contiguous.vs
I spent like an hour grepping around the codebase trying to figure out where the vectors were allocated from, but couldn't find it.
Is this something that I could/should change, and where would I do that?
Beta Was this translation helpful? Give feedback.
All reactions