-
Notifications
You must be signed in to change notification settings - Fork 13.6k
CUDA: add stream-based concurrency #16991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
1e97a91 to
1c4d8f3
Compare
|
Sorry, I wanted to tell you this but I forgot: a long time ago I tried something similar, see #4719 . There the performance did not improve, I think the reason was the lack of CUDA graphs to reduce the overhead. |
|
Yeah, I think CUDA graphs are essential for this to work (hence this PR only looks at batch_size=1) |
1c4d8f3 to
70a5a01
Compare
|
Minimal changes to make this work on hip: If used for real, cudaStreamWaitEvent error needs to handled of course with
|
|
The almost exact same numbers make me think that this change is not launching the streams. I would expect a shift in performance either for the worse or the better. |
|
yeah ill run a trace on it later. |
Possibly supersede #16813.
This PR adds support to run concurrent CUDA streams on single GPU setups.
At the moment this only targets the Q, K, V branch. I feel this is the "correct" approach in case the Q, K, V tensors are of different types/not in the same place in memory. The downside is that this approach doesn't come for free and there's some complexity involved, but I'm not an expert at the ggml graph and I feel it could be simplified.
Currently this is hidden by an env variable flag. To run you can use
GGML_CUDA_GRAPH_OPT=1TG Performance gain is more than the previous PR (2-7% gain), probably because we parallelize MUL_MAT + NORM + ROPE rather than just MUL_MAT. At the moment we leave some performance on the table where we don't fuse operations in the parallel streams themselves (e.g. MUL_MAT + BIAS, RMS_NORM + MUL etc.), I couldn't find a simple enough way to enable fusion there.
Before:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
After:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
TODO: