Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Nov 5, 2025

When doing -ot ".ffn_(down)_exps.=CPU", this kernel produces garbage output for tokens > 1, it may be related to CUDA graph capture when using -ot, I will try to investigate more. For now, this fixes it

@am17an
Copy link
Collaborator Author

am17an commented Nov 5, 2025

@slaren is there a way to detect that a buffer might be overridden? ggml_backend_cuda_split_buffer_type_is_host would be ideal I guess but it's not implemented yet

@slaren
Copy link
Member

slaren commented Nov 5, 2025

@slaren is there a way to detect that a buffer might be overridden? ggml_backend_cuda_split_buffer_type_is_host would be ideal I guess but it's not implemented yet

No. I am not sure what you are trying to do, but what you are asking is something that the backend should not be concerned with.

@am17an
Copy link
Collaborator Author

am17an commented Nov 5, 2025

No. I am not sure what you are trying to do, but what you are asking is something that the backend should not be concerned with.

I'm trying to turn off fusion if there is --ot involved in any of the fused tensors, we have a similar check for split buffers

@slaren
Copy link
Member

slaren commented Nov 5, 2025

That would be a workaround, not an actual solution. We need to find the source of the problem and fix that. I mentioned before that I suspect that ggml_node_get_use_count may not work properly when ggml_backend_sched replaces a node, I suggest checking that fusion is not being incorrectly enabled with some combinations of -ot, when the intermediate tensors are necessary.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 5, 2025
@am17an
Copy link
Collaborator Author

am17an commented Nov 5, 2025

That's not the problem here at least. I'm thinking it might be something to do with different sizes of the tensors between mmq buffer and the cpu weights. mmq will do some padding to avoid boundary checks internally.

@am17an
Copy link
Collaborator Author

am17an commented Nov 6, 2025

Interestingly this bug only manifests when there is

  • Multi GPU
  • At least one of the GPUs is a blackwell

@ORippler
Copy link
Contributor

ORippler commented Nov 6, 2025

@am17an

Can you please specify the repro more closely? Does it happen in pre-fill phase? Or token gen phase. The default behavior for multi-GPU and split-GPU is that we split the cgraph into multiple subgraphs. This will trigger the conseuctive update check, which will effectively disable Cuda Graphs from the 2nd/3rd call to the main model onwards

// Disable CUDA graphs (from the next token) if the use-case is demanding too many consecutive graph updates.
if (use_cuda_graph && cuda_graph_update_required) {
cuda_ctx->cuda_graph->number_consecutive_updates++;
} else {
cuda_ctx->cuda_graph->number_consecutive_updates = 0;
}
. Cuda Graphs should moreover be disabled when batch-size is >1 (unless the heuristic fails to trigger if the split graph does not contain an addition operation).

Do you observe a repro when launching with GGML_CUDA_DISABLE_GRAPHS=1?

At least one of the GPUs is a blackwell

This is worry-some

@am17an
Copy link
Collaborator Author

am17an commented Nov 6, 2025

Do you observe a repro when launching with GGML_CUDA_DISABLE_GRAPHS=1?

Yes I can repro with GGML_CUDA_DISABLE_GRAPHS=1, it goes away with GGML_CUDA_DISABLE_FUSION=1 and also just skipping the moe_expert_reduce kernel. Also goes away without --ot, i.e. fully offloaded with a blackwell too

@am17an
Copy link
Collaborator Author

am17an commented Nov 6, 2025

Also it goes away with -ub 1, which is technically what is this PR is also doing

@ORippler
Copy link
Contributor

ORippler commented Nov 6, 2025

Yes I can repro with GGML_CUDA_DISABLE_GRAPHS=1

I would recommend to verify via nsys/printf, but in that case it is not a cuda graph issue.

Also goes away without --ot, i.e. fully offloaded with a blackwell too

Have you inspected the graph after it is split, yet before it is being fused? Maybe we split around/in the node-pattern we match for in the fusion? Is fusion correctly disabled then?

@am17an
Copy link
Collaborator Author

am17an commented Nov 6, 2025

The problem starts with -ub 32 where we start to do offload, till then it's all fine. So likely not a CUDA graph thing

static bool ggml_backend_cuda_device_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
const int min_batch_size = 32;
return get_op_batch_size(op) >= min_batch_size;
GGML_UNUSED(dev);
}

@am17an
Copy link
Collaborator Author

am17an commented Nov 6, 2025

Have you inspected the graph after it is split, yet before it is being fused? Maybe we split around/in the node-pattern we match for in the fusion? Is fusion correctly disabled then?

Although I don't have enough expertise in ggml graphs, to me they looked fine. Attaching both a good graph (2x 4090) vs a bad graph run (1 x 4090, 1 x 5090) with GGML_SCHED_DEBUG=2

bad.txt
good.txt

@ORippler
Copy link
Contributor

ORippler commented Nov 6, 2025

image Red is "bad", green is "good"

Seems to me like in "good" we go "CUDA 0 -> CPU -> CUDA0" while "bad" is "CUDA 0 -> CPU -> CUDA1". In case 1 we cannot fuse, not sure about case 0.

@am17an
Copy link
Collaborator Author

am17an commented Nov 6, 2025

I think you'll find a CUDA0->CPU->CUDA1 in the second "good" graph also, since the 5090 has 32gb (unlike the 4090's 24), the allocations are a bit different

image

@ORippler
Copy link
Contributor

ORippler commented Nov 6, 2025

Yup just spotted

@am17an
Copy link
Collaborator Author

am17an commented Nov 6, 2025

Selectively offloading layer by layer from the back, I see the problem first occurs on offloading a layer between a 4090 to 5090.

EDIT: perhaps unsurprisingly, offloading just that layer also causes the same problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants