CUDA: add stream-based concurrency #16991

am17an · 2025-11-04T09:25:24Z

Possibly supersede #16813.

This PR adds support to run concurrent CUDA streams on single GPU setups.
At the moment this only targets the Q, K, V branch. I feel this is the "correct" approach in case the Q, K, V tensors are of different types/not in the same place in memory. The downside is that this approach doesn't come for free and there's some complexity involved, but I'm not an expert at the ggml graph and I feel it could be simplified.

Currently this is hidden by an env variable flag. To run you can use GGML_CUDA_GRAPH_OPT=1

TG Performance gain is more than the previous PR (2-7% gain), probably because we parallelize MUL_MAT + NORM + ROPE rather than just MUL_MAT. At the moment we leave some performance on the table where we don't fuse operations in the parallel streams themselves (e.g. MUL_MAT + BIAS, RMS_NORM + MUL etc.), I couldn't find a simple enough way to enable fusion there.

Before:

Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg32	172.10 ± 0.05
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg64	164.89 ± 0.07
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg128	162.47 ± 0.05
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg32	124.67 ± 0.03
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg64	121.77 ± 0.21
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg128	121.21 ± 0.04
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg32	210.46 ± 0.07
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg64	207.49 ± 0.03
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg128	205.36 ± 0.03

After:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg32	181.60 ± 0.11
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg64	173.92 ± 0.05
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg128	170.95 ± 0.03
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg32	128.16 ± 0.05
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg64	125.28 ± 0.03
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg128	124.18 ± 0.02
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg32	214.24 ± 0.08
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg64	211.05 ± 0.04
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg128	208.83 ± 0.03

TODO:

Enable fusion within a stream
Add tests?

JohannesGaessler · 2025-11-04T09:34:28Z

Sorry, I wanted to tell you this but I forgot: a long time ago I tried something similar, see #4719 . There the performance did not improve, I think the reason was the lack of CUDA graphs to reduce the overhead.

am17an · 2025-11-04T09:40:47Z

Yeah, I think CUDA graphs are essential for this to work (hence this PR only looks at batch_size=1)

IMbackK · 2025-11-06T13:00:04Z

Minimal changes to make this work on hip:

diff --git a/ggml/src/ggml-cuda/common.cuh b/ggml/src/ggml-cuda/common.cuh
index d3153f430..28bafb84e 100644
--- a/ggml/src/ggml-cuda/common.cuh
+++ b/ggml/src/ggml-cuda/common.cuh
@@ -25,6 +25,7 @@
 #include <cfloat>
 #include <cstdio>
 #include <string>
+#include <unordered_map>
 #include <vector>
 
 #if defined(GGML_USE_HIP)
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index f69b99b2a..f0df4a9a9 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -3514,7 +3514,7 @@ static void evaluate_and_capture_cuda_graph(ggml_backend_cuda_context * cuda_ctx
 
                         for (int i = 1; i <= concurrent_event->n_streams; ++i) {
                             cudaStream_t stream = cuda_ctx->stream(cuda_ctx->device, i);
-                            CUDA_CHECK(cudaStreamWaitEvent(stream, concurrent_event->fork_event));
+                            cudaStreamWaitEvent(stream, concurrent_event->fork_event);
                         }
 
                         is_concurrent_event_active = true;
diff --git a/ggml/src/ggml-cuda/vendors/hip.h b/ggml/src/ggml-cuda/vendors/hip.h
index 890c10364..b7d6edf7f 100644
--- a/ggml/src/ggml-cuda/vendors/hip.h
+++ b/ggml/src/ggml-cuda/vendors/hip.h
@@ -105,7 +105,7 @@
 #define cudaStreamNonBlocking hipStreamNonBlocking
 #define cudaStreamPerThread hipStreamPerThread
 #define cudaStreamSynchronize hipStreamSynchronize
-#define cudaStreamWaitEvent(stream, event, flags) hipStreamWaitEvent(stream, event, flags)
+#define cudaStreamWaitEvent hipStreamWaitEvent
 #define cudaGraphExec_t hipGraphExec_t
 #define cudaGraphNode_t hipGraphNode_t
 #define cudaKernelNodeParams hipKernelNodeParams

If used for real, cudaStreamWaitEvent error needs to handled of course
hipEventCreateWithFlags is also nodiscard which needs to be handled

with -DGGML_HIP_GRAPHS=On and GGML_CUDA_GRAPH_OPT=1 in env this seams performance neutral on mi100:

Model	Test	t/s b6955	t/s pr	Speedup
llama 13B Q8_0	tg128	29.07	29.06	1.00

am17an · 2025-11-06T13:12:57Z

The almost exact same numbers make me think that this change is not launching the streams. I would expect a shift in performance either for the worse or the better.

IMbackK · 2025-11-06T13:15:08Z

yeah ill run a trace on it later.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 4, 2025

am17an requested a review from JohannesGaessler November 4, 2025 09:27

am17an force-pushed the fused-qkv-stream branch from 1e97a91 to 1c4d8f3 Compare November 4, 2025 09:31

DajanaV mentioned this pull request Nov 4, 2025

UPSTREAM PR #16991: CUDA: add stream-based concurrency auroralabs-loci/llama.cpp#72

Open

CUDA: add stream-based concurrency

70a5a01

am17an force-pushed the fused-qkv-stream branch from 1c4d8f3 to 70a5a01 Compare November 4, 2025 10:40

am17an mentioned this pull request Nov 7, 2025

Feature Request: Repeated Unecessary Activation Quantization Ops #15602

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: add stream-based concurrency #16991

CUDA: add stream-based concurrency #16991

am17an commented Nov 4, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Nov 4, 2025

Uh oh!

am17an commented Nov 4, 2025 •

edited

Loading

Uh oh!

IMbackK commented Nov 6, 2025 •

edited

Loading

Uh oh!

am17an commented Nov 6, 2025

Uh oh!

IMbackK commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CUDA: add stream-based concurrency #16991

Are you sure you want to change the base?

CUDA: add stream-based concurrency #16991

Conversation

am17an commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Nov 4, 2025

Uh oh!

am17an commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Nov 6, 2025

Uh oh!

IMbackK commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

am17an commented Nov 4, 2025 •

edited

Loading

am17an commented Nov 4, 2025 •

edited

Loading

IMbackK commented Nov 6, 2025 •

edited

Loading