-
Notifications
You must be signed in to change notification settings - Fork 13.6k
CUDA: only use moe_expert_reduce when n_tokens=1 #17032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@slaren is there a way to detect that a buffer might be overridden? |
No. I am not sure what you are trying to do, but what you are asking is something that the backend should not be concerned with. |
I'm trying to turn off fusion if there is |
|
That would be a workaround, not an actual solution. We need to find the source of the problem and fix that. I mentioned before that I suspect that |
|
That's not the problem here at least. I'm thinking it might be something to do with different sizes of the tensors between mmq buffer and the cpu weights. mmq will do some padding to avoid boundary checks internally. |
|
Interestingly this bug only manifests when there is
|
|
Can you please specify the repro more closely? Does it happen in pre-fill phase? Or token gen phase. The default behavior for multi-GPU and split-GPU is that we split the cgraph into multiple subgraphs. This will trigger the conseuctive update check, which will effectively disable Cuda Graphs from the 2nd/3rd call to the main model onwards llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu Lines 3550 to 3555 in 22c8c3c
Do you observe a repro when launching with
This is worry-some |
Yes I can repro with |
|
Also it goes away with |
I would recommend to verify via nsys/printf, but in that case it is not a cuda graph issue.
Have you inspected the graph after it is split, yet before it is being fused? Maybe we split around/in the node-pattern we match for in the fusion? Is fusion correctly disabled then? |
|
The problem starts with llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu Lines 4137 to 4143 in 2759ccd
|
Although I don't have enough expertise in ggml graphs, to me they looked fine. Attaching both a good graph (2x 4090) vs a bad graph run (1 x 4090, 1 x 5090) with |
|
Yup just spotted |
|
Selectively offloading layer by layer from the back, I see the problem first occurs on offloading a layer between a 4090 to 5090. EDIT: perhaps unsurprisingly, offloading just that layer also causes the same problem |


When doing
-ot ".ffn_(down)_exps.=CPU", this kernel produces garbage output for tokens > 1, it may be related to CUDA graph capture when using-ot, I will try to investigate more. For now, this fixes it