You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/2025-09-24_kvcache-wins-you-can-see.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -271,19 +271,19 @@ The journey of llm-d reflects a broader shift in how we think about LLM inferenc
271
271
By moving from AI-blind routing to a precise, KV-cache aware strategy, **we can unlock order-of-magnitude improvements in latency and throughput on the exact same hardware**. The well-lit path of precise prefix-cache awareness offers a tested, benchmarked solution to make your distributed deployments dramatically more efficient.
272
272
273
273
:::tip Choosing the Right Strategy
274
-
The optimal scheduler depends on the complexity of the workload. Below is a hierarchy of common strategies, where each level addresses the limitations of the one before it.
274
+
The optimal scheduler depends on the complexity of the workload. Below is a hierarchy of supported strategies, where each level addresses the limitations of the one before it.
275
275
276
276
***1. Random/Round-Robin Scheduling**
277
277
This simplest approach works well for symmetric workloads where all requests have similar computational costs and minimal cache reuse.
278
-
***Its Limitation:** It creates load imbalance when workloads are asymmetric
278
+
***Limitation:** It creates load imbalance when workloads are asymmetric
279
279
280
280
***2. Load-Aware Scheduling**
281
281
The necessary next step for asymmetric workloads. By routing requests based serving capacity, it prevents overload and improves resource utilization.
282
-
***Its Limitation:** It cannot exploit caching opportunities, resulting in redundant computation.
282
+
***Limitation:** It cannot exploit caching opportunities, resulting in redundant computation.
283
283
284
284
***3. Approximate Prefix-Cache Scheduling**
285
285
This strategy introduces cache-awareness for workloads with predictable prefix reuse. It is effective when its estimations of the cache state are reliable.
286
-
***Its Limitation:** The estimations can become stale at high scale or with dynamic workloads, leading to suboptimal routing.
286
+
***Limitation:** The estimations can become stale at high scale or with dynamic workloads, leading to suboptimal routing.
287
287
288
288
***4. Precise Prefix-Cache Aware Scheduling**
289
289
In production environments with tight SLOs - this is the most effective strategy for dynamic, high-scale workloads where maximizing the cache-hit ratio is a primary performance driver.
0 commit comments