Skip to content

Conversation

@JohannesGaessler
Copy link
Collaborator

See discussion starting with ikawrakow/ik_llama.cpp#728 (comment) , the use of MMQ MoE optimizations is resulting in increased perplexity on master. The problem is that the wrong indices are being used when determining which dst columns should be receiving the stream-k fixup. This is a very typical bug that I encounter during development but unfortunately? this is one of the rare cases where the impact is small enough to be overlooked. Generally speaking, the impact will be largest for the combination of small models are large GPUs where the SM count is not a power of 2 (RTX 4090 for example has 128 SMs). Example models:

Model GPU PPL master PPL PR
GraniteMoe 3b q4_0 RTX 3090 10.5577 10.0799
GraniteMoe 3b q4_0 RTX 4090 10.0879 10.0877
GraniteMoe 3b q4_0 RTX 5090 15.4910 10.0853
Qwen 3 30b q4_0 RTX 3090 9.3418 9.2930
Qwen 3 30b q4_0 RTX 4090 9.2928 9.2928
Qwen 3 30b q4_0 RTX 5090 9.4052 9.2930

The bug in question has been on master since #13199 though the impact became larger with #15525 when the upper bound for CUDA blocks was tightened and more stream-k seams ended up in tiles that are not being skipped.

@JohannesGaessler JohannesGaessler merged commit e14e842 into ggml-org:master Nov 8, 2025
56 of 60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants