CUDA: fix MMQ stream-k fixup ne1 indices #17089
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See discussion starting with ikawrakow/ik_llama.cpp#728 (comment) , the use of MMQ MoE optimizations is resulting in increased perplexity on master. The problem is that the wrong indices are being used when determining which
dstcolumns should be receiving the stream-k fixup. This is a very typical bug that I encounter during development but unfortunately? this is one of the rare cases where the impact is small enough to be overlooked. Generally speaking, the impact will be largest for the combination of small models are large GPUs where the SM count is not a power of 2 (RTX 4090 for example has 128 SMs). Example models:The bug in question has been on master since #13199 though the impact became larger with #15525 when the upper bound for CUDA blocks was tightened and more stream-k seams ended up in tiles that are not being skipped.