Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Nov 4, 2025

Possibly supersede #16813.

This PR adds support to run concurrent CUDA streams on single GPU setups.
At the moment this only targets the Q, K, V branch. I feel this is the "correct" approach in case the Q, K, V tensors are of different types/not in the same place in memory. The downside is that this approach doesn't come for free and there's some complexity involved, but I'm not an expert at the ggml graph and I feel it could be simplified.

Currently this is hidden by an env variable flag. To run you can use GGML_CUDA_GRAPH_OPT=1

TG Performance gain is more than the previous PR (2-7% gain), probably because we parallelize MUL_MAT + NORM + ROPE rather than just MUL_MAT. At the moment we leave some performance on the table where we don't fuse operations in the parallel streams themselves (e.g. MUL_MAT + BIAS, RMS_NORM + MUL etc.), I couldn't find a simple enough way to enable fusion there.

Before:

Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model size params backend ngl fa test t/s
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg32 172.10 ± 0.05
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg64 164.89 ± 0.07
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg128 162.47 ± 0.05
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg32 124.67 ± 0.03
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg64 121.77 ± 0.21
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg128 121.21 ± 0.04
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg32 210.46 ± 0.07
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg64 207.49 ± 0.03
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg128 205.36 ± 0.03

After:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model size params backend ngl fa test t/s
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg32 181.60 ± 0.11
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg64 173.92 ± 0.05
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg128 170.95 ± 0.03
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg32 128.16 ± 0.05
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg64 125.28 ± 0.03
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg128 124.18 ± 0.02
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg32 214.24 ± 0.08
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg64 211.05 ± 0.04
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg128 208.83 ± 0.03

TODO:

  • Enable fusion within a stream
  • Add tests?

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 4, 2025
@JohannesGaessler
Copy link
Collaborator

Sorry, I wanted to tell you this but I forgot: a long time ago I tried something similar, see #4719 . There the performance did not improve, I think the reason was the lack of CUDA graphs to reduce the overhead.

@am17an
Copy link
Collaborator Author

am17an commented Nov 4, 2025

Yeah, I think CUDA graphs are essential for this to work (hence this PR only looks at batch_size=1)

@IMbackK
Copy link
Collaborator

IMbackK commented Nov 6, 2025

Minimal changes to make this work on hip:

diff --git a/ggml/src/ggml-cuda/common.cuh b/ggml/src/ggml-cuda/common.cuh
index d3153f430..28bafb84e 100644
--- a/ggml/src/ggml-cuda/common.cuh
+++ b/ggml/src/ggml-cuda/common.cuh
@@ -25,6 +25,7 @@
 #include <cfloat>
 #include <cstdio>
 #include <string>
+#include <unordered_map>
 #include <vector>
 
 #if defined(GGML_USE_HIP)
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index f69b99b2a..f0df4a9a9 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -3514,7 +3514,7 @@ static void evaluate_and_capture_cuda_graph(ggml_backend_cuda_context * cuda_ctx
 
                         for (int i = 1; i <= concurrent_event->n_streams; ++i) {
                             cudaStream_t stream = cuda_ctx->stream(cuda_ctx->device, i);
-                            CUDA_CHECK(cudaStreamWaitEvent(stream, concurrent_event->fork_event));
+                            cudaStreamWaitEvent(stream, concurrent_event->fork_event);
                         }
 
                         is_concurrent_event_active = true;
diff --git a/ggml/src/ggml-cuda/vendors/hip.h b/ggml/src/ggml-cuda/vendors/hip.h
index 890c10364..b7d6edf7f 100644
--- a/ggml/src/ggml-cuda/vendors/hip.h
+++ b/ggml/src/ggml-cuda/vendors/hip.h
@@ -105,7 +105,7 @@
 #define cudaStreamNonBlocking hipStreamNonBlocking
 #define cudaStreamPerThread hipStreamPerThread
 #define cudaStreamSynchronize hipStreamSynchronize
-#define cudaStreamWaitEvent(stream, event, flags) hipStreamWaitEvent(stream, event, flags)
+#define cudaStreamWaitEvent hipStreamWaitEvent
 #define cudaGraphExec_t hipGraphExec_t
 #define cudaGraphNode_t hipGraphNode_t
 #define cudaKernelNodeParams hipKernelNodeParams

If used for real, cudaStreamWaitEvent error needs to handled of course
hipEventCreateWithFlags is also nodiscard which needs to be handled

with -DGGML_HIP_GRAPHS=On and GGML_CUDA_GRAPH_OPT=1 in env this seams performance neutral on mi100:

Model Test t/s b6955 t/s pr Speedup
llama 13B Q8_0 tg128 29.07 29.06 1.00

@am17an
Copy link
Collaborator Author

am17an commented Nov 6, 2025

The almost exact same numbers make me think that this change is not launching the streams. I would expect a shift in performance either for the worse or the better.

@IMbackK
Copy link
Collaborator

IMbackK commented Nov 6, 2025

yeah ill run a trace on it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants