Skip to content

Commit 2f00f36

Browse files
committed
- some blog fixes
- add a ladder of strategies to the conclusion
1 parent 4a7d096 commit 2f00f36

File tree

2 files changed

+29
-16
lines changed

2 files changed

+29
-16
lines changed

blog/2025-09-24_kvcache-wins-you-can-see.md

Lines changed: 29 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@ This is precisely what llm-d provides (pun intended). It creates a **global view
126126

127127
The global cache view is built upon a continuous stream of [**`KVEvents`**](https://docs.vllm.ai/en/latest/api/vllm/config/kv_events.html) from each vLLM pod, which are processed efficiently by the open-source [**`llm-d-kv-cache-manager`**](https://github.com/llm-d/llm-d-kv-cache-manager) library.
128128

129-
The `KVEvents` provide a live feed of all physical cache changes across the cluster, firing every time a cache block is created or evicted. This high-throughput stream is then ingested and organized by the llm-d-kv-cache-manager library's components:
129+
The `KVEvents` provide a live feed of all physical cache changes across the cluster, firing every time a cache block is created or evicted. This stream is then ingested and organized by the llm-d-kv-cache-manager library's components:
130130

131131
1. **`kvevents.Pool`**: This component consumes the high-throughput stream of events. As it digests them, it continuously updates a low-level **KV-Block Index**, which maintains a simple, real-time map of block-hashes to the pod and memory-medium (GPU/CPU) it resides on.
132132
2. **`kvcache.Index`**: This is the higher-level index used by the scheduler. It uses the underlying KV-Block Index to map logical sequences of tokens (i.e., prefixes) to the pods that hold them. This provides the direct answer to the question, "what percentage of this request's prefix is on the accessible Pods?"
@@ -170,8 +170,8 @@ The four strategies compared:
170170

171171
* **`random-scheduling`**: A naive scheduler, acting as the control group.
172172
* **`load-scheduling`**: A scheduler aware of only of load scorers: vLLM queueing \+ kv-cache-utilization
173-
* **`estimated-scheduling`**: The default configuration in the intelligent inference scheduling path, extending load-aware scheduling with the [**approximate** prefix-cache scorer](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/).
174-
* This plugin builds an estimated-locality index based on routing history.
173+
* **`approximate-scheduling`**: The default configuration in the intelligent inference scheduling path, extending load-aware scheduling with the [**approximate** prefix-cache scorer](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/).
174+
* This plugin builds an approximate-locality index based on routing history.
175175
* **`precise-scheduling`**: The advanced well-lit path described in this post.
176176

177177
This benchmark, therefore, tests the scheduler's ability to efficiently manage the disaggregated KV-cache. In a production environment, if the total cache demand were to exceed the cluster's capacity, an autoscaling system would be responsible for spinning up more replicas to maintain SLOs. Here, we focus on **maximizing the performance of the existing hardware**.
@@ -180,18 +180,18 @@ This benchmark, therefore, tests the scheduler's ability to efficiently manage t
180180

181181
The summary table below shows the difference across the key performance indicators.
182182

183-
| Experiment | Output toks/s | TTFT p90 (s) | TTFT mean (s) | vLLM Wait Queue (mean) |
184-
| :---- | :---- | :---- | :---- | :---- |
183+
| Experiment | Output toks/s | TTFT p90 (s) | TTFT mean (s) | vLLM Wait Queue (mean) |
184+
|:-----------------------| :---- | :---- | :---- | :---- |
185185
| **precise-scheduling** | **8730.0** | **0.542** | **0.298** | **0.1** |
186-
| estimated-scheduling | 6944.4 | 31.083 | 13.316 | 8.1 |
187-
| load-scheduling | 4428.7 | 94.865 | 46.987 | 28.9 |
188-
| random-scheduling | 4428.7 | 92.551 | 45.281 | 27.3 |
186+
| approximate-scheduling | 6944.4 | 31.083 | 13.316 | 8.1 |
187+
| load-scheduling | 4428.7 | 94.865 | 46.987 | 28.9 |
188+
| random-scheduling | 4428.7 | 92.551 | 45.281 | 27.3 |
189189

190190
#### **Time to First Token (TTFT)**
191191

192192
The most dramatic impact was on user-facing latency. `precise-scheduling` delivered a P90 TTFT of just **0.542 seconds**. In contrast, the approximate scheduler took over **31 seconds**, and the cache-blind schedulers took over **90 seconds**.
193193

194-
* **`precise-scheduling` is 57x faster than `estimated-scheduling`.**
194+
* **`precise-scheduling` is 57x faster than `approximate-scheduling`.**
195195
* **`precise-scheduling` is over 170x faster than `random-scheduling`.**
196196

197197
This is the difference between an interactive experience and a system that is functionally unusable at scale.
@@ -200,7 +200,7 @@ This is the difference between an interactive experience and a system that is fu
200200

201201
This efficiency in latency directly translates to higher system capacity. `precise-scheduling` achieved a total throughput of **8,730 output tokens/second**. This represents:
202202

203-
* A **25% increase** over the **`estimated-scheduling`** baseline.
203+
* A **25% increase** over the **`approximate-scheduling`** baseline.
204204
* Over **double the throughput** of the cache-blind configurations.
205205

206206
This allows you to handle significantly more traffic on the exact same hardware, simply by eliminating the waste of cache misses.
@@ -211,13 +211,13 @@ This allows you to handle significantly more traffic on the exact same hardware,
211211

212212
<br/><br/>
213213

214-
The charts above clearly illustrate these wins. The blue line (`precise-scheduling`) maintains the lowest Mean TTFT (tested for up to 50 QPS) and achieves the highest Total Throughput as the request rate increases.
214+
The charts above clearly illustrate these wins. The blue line (`precise-scheduling`) maintains the lowest Mean TTFT and achieves the highest Total Throughput as the request rate increases.
215215

216216
#### **The "Why": From Saved Work to System Throughput**
217217

218218
The dramatic performance gains seen in the benchmarks are a direct result of **system efficiency**, a difference that is immediately visible in the **real-time Grafana metrics**.
219219

220-
The following graphs were captured throughout the benchmark runs. Schedulers are shown in order: `precise-scheduling` *(left)*, `estimated-scheduling` *(center)*, and `random-scheduling` *(right)*.
220+
The following graphs were captured throughout the benchmark runs. Schedulers are shown in order: `precise-scheduling` *(left)*, `approximate-scheduling` *(center)*, and `random-scheduling` *(right)*.
221221

222222
##### **1\. Effective Cache Throughput: Quantifying Saved Work**
223223

@@ -229,19 +229,19 @@ First, we measure the **Effective Cache Throughput** \- the number of prompt **t
229229

230230
<br/><br/>
231231

232-
The chart clearly shows that `precise-scheduling` sustains a massive and stable throughput of saved work by hitting the prefixes effectively. In the middle, we see `estimate-scheduling` with good but lower efficiency, and on the right, `random-scheduling` saving almost no work.
232+
The chart clearly shows that `precise-scheduling` sustains a massive and stable throughput of saved work by hitting the prefixes effectively. In the middle, we see `approximate-scheduling` with good but lower efficiency, and on the right, `random-scheduling` saving almost no work.
233233

234234
##### **2\. System State: The Consequence of Efficiency**
235235

236-
This saved work translates directly into system health. By avoiding prefill bottlenecks, the GPUs can focus on productive decoding. We can see this by comparing the number of "**Waiting**" requests (**queued**) to "**Running**" requests (**in decode**).
236+
This saved work translates directly into system health. By avoiding prefill bottlenecks, the GPUs can focus on productive decoding. We can see this by comparing the number of "**Waiting**" requests (**queued**) and "**Running**" requests (**in decode**).
237237

238238
![vLLM waiting requests metrics](/img/blogs/kv-cache-wins/image7.png)
239239
<small>*__FIGURE 7__: The number of **waiting requests** in vLLM over the course of the benchmark.*</small>
240240

241241
![vLLM running requests metrics](/img/blogs/kv-cache-wins/image8.png)
242242
<small>*__FIGURE 8__: The number of **running requests** **(decoding)** in vLLM over the course of the benchmark.*</small>
243243

244-
The **`precise-scheduling`** plots on the left show a stable system. By keeping the waiting queue minimal, it maximizes the number of actively running requests. In contrast, the other schedulers are clearly overwhelmed; their growing waiting queues choke the system and prevent work from being done efficiently.
244+
The **`precise-scheduling`** plots on the left show a stable system. By effectively utilizing the disaggregated KV-cache, it maintains minimal waiting queues and maximizes the number of actively running requests. In contrast, the other schedulers are clearly overwhelmed; their growing waiting queues choke the system and prevent work from being done efficiently.
245245

246246
This instability is caused by **"cache thrashing."** Cache-blind schedulers constantly **duplicate and evict** the same prefixes across different pods, wasting GPU cycles on **redundant prefill**. `precise-scheduling` avoids this entirely. It is precisely aware of prefix locations and consistently routes requests for cache-hits \- as long as the load allows \- resulting in less work, virtually no queues, and a healthy system.
247247

@@ -270,6 +270,19 @@ The journey of llm-d reflects a broader shift in how we think about LLM inferenc
270270

271271
By moving from AI-blind routing to a precise, KV-cache aware strategy, **we can unlock order-of-magnitude improvements in latency and throughput on the exact same hardware**. The well-lit path of precise prefix-cache awareness offers a tested, benchmarked solution to make your distributed deployments dramatically more efficient.
272272

273+
:::tip Choosing the Right Strategy
274+
The optimal scheduler depends on the complexity of the workload. Below is a hierarchy of supported strategies, where each level addresses the limitations of the one before it.
275+
276+
* **1. Random/Round-Robin Scheduling**: this simple approach works well for symmetric workloads where all requests have similar computational costs and minimal cache reuse.
277+
278+
* **2. Load-Aware Scheduling**: the necessary next step for asymmetric workloads. By routing requests based on Pod serving capacity, it prevents overload and improves resource utilization.
279+
280+
* **3. Approximate Prefix-Cache Scheduling**: this strategy introduces cache-awareness for workloads with context reuse (as described in the blog).
281+
The estimations can become unreliable at high scale or with dynamic workloads, leading to suboptimal routing.
282+
283+
* **4. Precise Prefix-Cache Aware Scheduling**: in production environments with tight SLOs - this is the most effective strategy for dynamic, high-scale workloads where maximizing the cache-hit ratio is a primary performance driver.
284+
:::
285+
273286
## **Get Involved with llm-d**
274287

275288
The llm-d project thrives on community contributions, and there are many ways to get involved:
@@ -291,7 +304,7 @@ The llm-d project thrives on community contributions, and there are many ways to
291304
* **Schedulers Compared**:
292305
* **`random-scheduling`**: A naive scheduler, acting as the control group.
293306
* **`load-scheduling`**: A scheduler aware of only of load scorers: vLLM queueing \+ kv-cache-utilization
294-
* **`estimated-scheduling`**: The baseline intelligent scheduler extending load-scheduling with the approximate prefix-cache scorer.
307+
* **`approximate-scheduling`**: The baseline intelligent scheduler extending load-scheduling with the approximate prefix-cache scorer.
295308
* **`precise-scheduling`**: The advanced well-lit path described in this post.
296309

297310
### **A.2: Workload Details \- Real-World B2B SaaS Scenario**

static/img/blogs/hangyin.png

-386 KB
Loading

0 commit comments

Comments
 (0)