Replies: 4 comments 1 reply
-
In reality, you rarely (or never) get such cases. Usually a batch consists of one or more large prompts (i.e. the initial requests) which are processed sequentially since one prompt is enough to saturate the GPU compute capacity. After that, during the generation phase, we get batches with 1 token per request - all of the tokens are processed in a single ubatch. |
Beta Was this translation helpful? Give feedback.
-
Yes, normally padding is not required, but I’ve encountered a scenario with multiple concurrent requests, where each prompt is quite short—usually no more than 200 tokens. In this case, the prefill for multiple requests gets split into several ubatches, but each ubatch itself isn’t very long. |
Beta Was this translation helpful? Give feedback.
-
Your understanding is correct. When kv_unified = false, sequences with different prefill lengths are forced into separate streams, which can lead to many small microbatches and effectively serialize execution, as you described. |
Beta Was this translation helpful? Give feedback.
-
You would in principle get better performance if you process the prompts in a batch. The CUDA backend already has an optimization for FlashAttention that skips the fully masked out tails of sequences. However, in practice you will not get that much of a speedup unless the prompts are very short. To get a feel for this, try running llama-bench for some constant number of tokens while varying the batch size. Also the more prompt tokens you process in a batch the less total time you need but the longer the delivery of new tokens to preexisting requests is interrupted at a time. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Since the current default kv_unified is set to false, different sequences use different streams. This requires that when splitting into ubatches, all sequences must be cut into the same length. If multiple requests are being processed simultaneously and their prefill lengths differ, it may degrade to serial execution. (For example, imagine 10 requests with prefill lengths ranging from 1 to 10; this would result in 10 ubatches, where each sequence in a ubatch has a length of 1.) This can cause performance issues.
To address this, I plan to pad the prefill so that all sequences are assembled to the same length. After the prefill stage, I will remove the padded data from the kv_cache according to the actual prefill lengths, and then reset pos to the correct position during decoding.
Could you please confirm if this approach is correct? Thanks.
Beta Was this translation helpful? Give feedback.
All reactions