CUDA: only use moe_expert_reduce when n_tokens=1 #17032

am17an · 2025-11-05T17:08:42Z

When doing -ot ".ffn_(down)_exps.=CPU", this kernel produces garbage output for tokens > 1, it may be related to CUDA graph capture when using -ot, I will try to investigate more. For now, this fixes it

am17an · 2025-11-05T17:10:41Z

@slaren is there a way to detect that a buffer might be overridden? ggml_backend_cuda_split_buffer_type_is_host would be ideal I guess but it's not implemented yet

slaren · 2025-11-05T17:28:55Z

@slaren is there a way to detect that a buffer might be overridden? ggml_backend_cuda_split_buffer_type_is_host would be ideal I guess but it's not implemented yet

No. I am not sure what you are trying to do, but what you are asking is something that the backend should not be concerned with.

am17an · 2025-11-05T17:32:11Z

No. I am not sure what you are trying to do, but what you are asking is something that the backend should not be concerned with.

I'm trying to turn off fusion if there is --ot involved in any of the fused tensors, we have a similar check for split buffers

slaren · 2025-11-05T17:38:30Z

That would be a workaround, not an actual solution. We need to find the source of the problem and fix that. I mentioned before that I suspect that ggml_node_get_use_count may not work properly when ggml_backend_sched replaces a node, I suggest checking that fusion is not being incorrectly enabled with some combinations of -ot, when the intermediate tensors are necessary.

am17an · 2025-11-05T18:29:22Z

That's not the problem here at least. I'm thinking it might be something to do with different sizes of the tensors between mmq buffer and the cpu weights. mmq will do some padding to avoid boundary checks internally.

am17an · 2025-11-06T03:45:39Z

Interestingly this bug only manifests when there is

Multi GPU
At least one of the GPUs is a blackwell

ORippler · 2025-11-06T09:04:31Z

@am17an

Can you please specify the repro more closely? Does it happen in pre-fill phase? Or token gen phase. The default behavior for multi-GPU and split-GPU is that we split the cgraph into multiple subgraphs. This will trigger the conseuctive update check, which will effectively disable Cuda Graphs from the 2nd/3rd call to the main model onwards

llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu

Lines 3550 to 3555 in 22c8c3c

    
           // Disable CUDA graphs (from the next token) if the use-case is demanding too many consecutive graph updates. 
        
           if (use_cuda_graph && cuda_graph_update_required) { 
        
               cuda_ctx->cuda_graph->number_consecutive_updates++; 
        
           } else { 
        
               cuda_ctx->cuda_graph->number_consecutive_updates = 0; 
        
           }

. Cuda Graphs should moreover be disabled when batch-size is >1 (unless the heuristic fails to trigger if the split graph does not contain an addition operation).

Do you observe a repro when launching with GGML_CUDA_DISABLE_GRAPHS=1?

At least one of the GPUs is a blackwell

This is worry-some

am17an · 2025-11-06T09:11:06Z

Do you observe a repro when launching with GGML_CUDA_DISABLE_GRAPHS=1?

Yes I can repro with GGML_CUDA_DISABLE_GRAPHS=1, it goes away with GGML_CUDA_DISABLE_FUSION=1 and also just skipping the moe_expert_reduce kernel. Also goes away without --ot, i.e. fully offloaded with a blackwell too

am17an · 2025-11-06T09:17:34Z

Also it goes away with -ub 1, which is technically what is this PR is also doing

ORippler · 2025-11-06T09:21:09Z

Yes I can repro with GGML_CUDA_DISABLE_GRAPHS=1

I would recommend to verify via nsys/printf, but in that case it is not a cuda graph issue.

Also goes away without --ot, i.e. fully offloaded with a blackwell too

Have you inspected the graph after it is split, yet before it is being fused? Maybe we split around/in the node-pattern we match for in the fusion? Is fusion correctly disabled then?

am17an · 2025-11-06T09:23:10Z

The problem starts with -ub 32 where we start to do offload, till then it's all fine. So likely not a CUDA graph thing

llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu

Lines 4137 to 4143 in 2759ccd

    
           static bool ggml_backend_cuda_device_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) { 
        
               const int min_batch_size = 32; 
        
               return get_op_batch_size(op) >= min_batch_size; 
        
               GGML_UNUSED(dev); 
        
           }

am17an · 2025-11-06T09:26:35Z

Have you inspected the graph after it is split, yet before it is being fused? Maybe we split around/in the node-pattern we match for in the fusion? Is fusion correctly disabled then?

Although I don't have enough expertise in ggml graphs, to me they looked fine. Attaching both a good graph (2x 4090) vs a bad graph run (1 x 4090, 1 x 5090) with GGML_SCHED_DEBUG=2

bad.txt
good.txt

ORippler · 2025-11-06T09:39:23Z

Red is "bad", green is "good"

Seems to me like in "good" we go "CUDA 0 -> CPU -> CUDA0" while "bad" is "CUDA 0 -> CPU -> CUDA1". In case 1 we cannot fuse, not sure about case 0.

am17an · 2025-11-06T09:43:15Z

I think you'll find a CUDA0->CPU->CUDA1 in the second "good" graph also, since the 5090 has 32gb (unlike the 4090's 24), the allocations are a bit different

ORippler · 2025-11-06T09:43:32Z

Yup just spotted

am17an · 2025-11-06T14:09:47Z

Selectively offloading layer by layer from the back, I see the problem first occurs on offloading a layer between a 4090 to 5090.

EDIT: perhaps unsurprisingly, offloading just that layer also causes the same problem

CUDA: only use moe_expert_reduce when n_tokens=1

9609da0

am17an mentioned this pull request Nov 5, 2025

Eval bug: unsloth/gpt-oss-120b-GGUF:F16 produces incoherent output #17016

Open

DajanaV mentioned this pull request Nov 5, 2025

UPSTREAM PR #17032: CUDA: only use moe_expert_reduce when n_tokens=1 auroralabs-loci/llama.cpp#93

Open

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 5, 2025

CUDA: only use moe_expert_reduce when n_tokens=1 #17032

Are you sure you want to change the base?

CUDA: only use moe_expert_reduce when n_tokens=1 #17032

Uh oh!

Conversation

am17an commented Nov 5, 2025

Uh oh!

am17an commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Nov 5, 2025

Uh oh!

am17an commented Nov 5, 2025

Uh oh!

slaren commented Nov 5, 2025

Uh oh!

am17an commented Nov 5, 2025

Uh oh!

am17an commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ORippler commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Nov 6, 2025

Uh oh!

am17an commented Nov 6, 2025

Uh oh!

ORippler commented Nov 6, 2025

Uh oh!

am17an commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Nov 6, 2025

Uh oh!

ORippler commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Nov 6, 2025

Uh oh!

ORippler commented Nov 6, 2025

Uh oh!

am17an commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

am17an commented Nov 5, 2025 •

edited

Loading

am17an commented Nov 6, 2025 •

edited

Loading

ORippler commented Nov 6, 2025 •

edited

Loading

am17an commented Nov 6, 2025 •

edited

Loading

ORippler commented Nov 6, 2025 •

edited

Loading

am17an commented Nov 6, 2025 •

edited

Loading