CUDA: fix MMQ stream-k fixup ne1 indices #17089

JohannesGaessler · 2025-11-07T22:18:37Z

See discussion starting with ikawrakow/ik_llama.cpp#728 (comment) , the use of MMQ MoE optimizations is resulting in increased perplexity on master. The problem is that the wrong indices are being used when determining which dst columns should be receiving the stream-k fixup. This is a very typical bug that I encounter during development but unfortunately? this is one of the rare cases where the impact is small enough to be overlooked. Generally speaking, the impact will be largest for the combination of small models are large GPUs where the SM count is not a power of 2 (RTX 4090 for example has 128 SMs). Example models:

Model	GPU	PPL master	PPL PR
GraniteMoe 3b q4_0	RTX 3090	10.5577	10.0799
GraniteMoe 3b q4_0	RTX 4090	10.0879	10.0877
GraniteMoe 3b q4_0	RTX 5090	15.4910	10.0853
Qwen 3 30b q4_0	RTX 3090	9.3418	9.2930
Qwen 3 30b q4_0	RTX 4090	9.2928	9.2928
Qwen 3 30b q4_0	RTX 5090	9.4052	9.2930

The bug in question has been on master since #13199 though the impact became larger with #15525 when the upper bound for CUDA blocks was tightened and more stream-k seams ended up in tiles that are not being skipped.

CUDA: fix MMQ stream-k fixup ne1 indices

d148125

This was referenced Nov 7, 2025

Fix PPL increase caused by mmq_id ikawrakow/ik_llama.cpp#913

Merged

Eval bug: data corruption on CUDA experts offload #16945

Closed

DajanaV mentioned this pull request Nov 7, 2025

UPSTREAM PR #17089: CUDA: fix MMQ stream-k fixup ne1 indices auroralabs-loci/llama.cpp#126

Open

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 7, 2025

am17an mentioned this pull request Nov 8, 2025

CUDA: only use moe_expert_reduce when n_tokens=1 #17032

Closed

am17an approved these changes Nov 8, 2025

View reviewed changes

am17an mentioned this pull request Nov 8, 2025

Eval bug: unsloth/gpt-oss-120b-GGUF:F16 produces incoherent output #17016

Closed

ikawrakow mentioned this pull request Nov 8, 2025

Adopt fix from mainline PR 17089 ikawrakow/ik_llama.cpp#920

Merged

JohannesGaessler merged commit e14e842 into ggml-org:master Nov 8, 2025
56 of 60 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: fix MMQ stream-k fixup ne1 indices #17089

CUDA: fix MMQ stream-k fixup ne1 indices #17089

Uh oh!

JohannesGaessler commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CUDA: fix MMQ stream-k fixup ne1 indices #17089

CUDA: fix MMQ stream-k fixup ne1 indices #17089

Uh oh!

Conversation

JohannesGaessler commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants