Pad for prefill stage？ #15958

hipudding · 2025-09-13T08:11:36Z

hipudding
Sep 13, 2025
Collaborator

Since the current default kv_unified is set to false, different sequences use different streams. This requires that when splitting into ubatches, all sequences must be cut into the same length. If multiple requests are being processed simultaneously and their prefill lengths differ, it may degrade to serial execution. (For example, imagine 10 requests with prefill lengths ranging from 1 to 10; this would result in 10 ubatches, where each sequence in a ubatch has a length of 1.) This can cause performance issues.

To address this, I plan to pad the prefill so that all sequences are assembled to the same length. After the prefill stage, I will remove the padded data from the kv_cache according to the actual prefill lengths, and then reset pos to the correct position during decoding.

Could you please confirm if this approach is correct? Thanks.

ggerganov · 2025-09-13T11:24:22Z

ggerganov
Sep 13, 2025
Maintainer

For example, imagine 10 requests with prefill lengths ranging from 1 to 10;

In reality, you rarely (or never) get such cases. Usually a batch consists of one or more large prompts (i.e. the initial requests) which are processed sequentially since one prompt is enough to saturate the GPU compute capacity. After that, during the generation phase, we get batches with 1 token per request - all of the tokens are processed in a single ubatch.

0 replies

hipudding · 2025-09-14T05:27:25Z

hipudding
Sep 14, 2025
Collaborator Author

Yes, normally padding is not required, but I’ve encountered a scenario with multiple concurrent requests, where each prompt is quite short—usually no more than 200 tokens. In this case, the prefill for multiple requests gets split into several ubatches, but each ubatch itself isn’t very long.

1 reply

hipudding Sep 14, 2025
Collaborator Author

I’d like to know if it’s correct to do this in my program (apply padding and then remove the pad-related kv cache).

AY0UBYOUSFI · 2025-09-16T15:19:16Z

AY0UBYOUSFI
Sep 16, 2025

Your understanding is correct. When kv_unified = false, sequences with different prefill lengths are forced into separate streams, which can lead to many small microbatches and effectively serialize execution, as you described.

0 replies

JohannesGaessler · 2025-09-16T15:52:18Z

JohannesGaessler
Sep 16, 2025
Collaborator

You would in principle get better performance if you process the prompts in a batch. The CUDA backend already has an optimization for FlashAttention that skips the fully masked out tails of sequences. However, in practice you will not get that much of a speedup unless the prompts are very short. To get a feel for this, try running llama-bench for some constant number of tokens while varying the batch size. Also the more prompt tokens you process in a batch the less total time you need but the longer the delivery of new tokens to preexisting requests is interrupted at a time.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pad for prefill stage？ #15958

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Pad for prefill stage？ #15958

Uh oh!

hipudding Sep 13, 2025 Collaborator

Replies: 4 comments · 1 reply

Uh oh!

ggerganov Sep 13, 2025 Maintainer

Uh oh!

hipudding Sep 14, 2025 Collaborator Author

Uh oh!

hipudding Sep 14, 2025 Collaborator Author

Uh oh!

AY0UBYOUSFI Sep 16, 2025

Uh oh!

JohannesGaessler Sep 16, 2025 Collaborator

hipudding
Sep 13, 2025
Collaborator

Replies: 4 comments 1 reply

ggerganov
Sep 13, 2025
Maintainer

hipudding
Sep 14, 2025
Collaborator Author

hipudding Sep 14, 2025
Collaborator Author

AY0UBYOUSFI
Sep 16, 2025

JohannesGaessler
Sep 16, 2025
Collaborator