You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/2025-08-05_mastering-kv-cache-aware-routing.md
+33-32Lines changed: 33 additions & 32 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,7 @@ authors:
7
7
- cnuland
8
8
9
9
tags: [llm-d, blog, community]
10
+
draft: true
10
11
---
11
12
12
13
## Introduction
@@ -19,11 +20,11 @@ This blog post is written for **llm-d v0.2.0**. For detailed release information
19
20
20
21
In this blog post, we'll cover:
21
22
22
-
- What KV-cache-aware routing is and why it matters
23
-
- How llm-d implements this feature with EPPs, Redis, and NIXL
24
-
- The critical Kubernetes YAML assets that make it work
25
-
- A test case showing our latest 87.4% cache hit rate
26
-
- Where to go to learn more and get started
23
+
- What KV-cache-aware routing is and why it matters
24
+
- How llm-d implements this feature with EPPs, Redis, and NIXL
25
+
- The critical Kubernetes YAML assets that make it work
26
+
- A test case showing our latest 87.4% cache hit rate
27
+
- Where to go to learn more and get started
27
28
28
29
<!-- truncate -->
29
30
@@ -36,21 +37,21 @@ In this blog post, we'll cover:
36
37
37
38
**llm-d** is an open source project built by Red Hat and the AI infrastructure community to manage large-scale LLM inference using cloud-native patterns. llm-d introduces:
38
39
39
-
- Disaggregated **prefill and decode** workloads
40
-
-**Multi-model** and multi-tenant isolation
41
-
-**Intelligent routing** via an External Processing Pod (EPP)
42
-
- And crucially, **KV-cache-aware routing** for memory-efficient, low-latency inference
40
+
- Disaggregated **prefill and decode** workloads
41
+
-**Multi-model** and multi-tenant isolation
42
+
-**Intelligent routing** via an External Processing Pod (EPP)
43
+
- And crucially, **KV-cache-aware routing** for memory-efficient, low-latency inference
43
44
44
45
---
45
46
46
47
## The Problem: Stateless Inference Fails to Reuse Cache
47
48
48
49
In traditional deployments, even if KV-caches are enabled inside the model server (like vLLM), the **gateway is unaware of cache state**. That leads to:
49
50
50
-
- Round-robin routing or explicit sticky sessions
51
-
- Frequent **cache misses**
52
-
- Repeated compute for common prefixes
53
-
- Unnecessary GPU memory use
51
+
- Round-robin routing or explicit sticky sessions
52
+
- Frequent **cache misses**
53
+
- Repeated compute for common prefixes
54
+
- Unnecessary GPU memory use
54
55
55
56
This breaks down under high concurrency or workloads with shared prompts (like RAG, chat history, or templated inputs).
56
57
@@ -60,10 +61,10 @@ This breaks down under high concurrency or workloads with shared prompts (like R
60
61
61
62
llm-d enables **state-aware request scheduling** by introducing a few key components:
62
63
63
-
- An **EPP (External Processing Pod)** that acts as an intelligent router
64
-
- A **Redis-based cache indexer** that tracks what each pod has cached
65
-
- A **NIXL side-channel** between pods to transfer KV data when needed
66
-
-**Configurable routing scorers** that balance reuse and load
64
+
- An **EPP (External Processing Pod)** that acts as an intelligent router
65
+
- A **Redis-based cache indexer** that tracks what each pod has cached
66
+
- A **NIXL side-channel** between pods to transfer KV data when needed
67
+
-**Configurable routing scorers** that balance reuse and load
67
68
68
69
The result is a scheduling layer that favors pods with warm cache states—cutting inference times and GPU load.
69
70
@@ -76,10 +77,10 @@ The result is a scheduling layer that favors pods with warm cache states—cutti
76
77
77
78
To follow this guide, you should have:
78
79
79
-
- OpenShift or Kubernetes with GPU-enabled nodes
80
-
- The [llm-d Operator](https://llm-d.ai/docs/guide/Installation/prerequisites) installed
80
+
- OpenShift or Kubernetes with GPU-enabled nodes
81
+
- The [llm-d Operator](https://llm-d.ai/docs/guide/Installation/prerequisites) installed
81
82
- A Hugging Face token (for downloading LLaMA or other models)
82
-
-[Project Code & Performace Test on GitHub](https://github.com/cnuland/hello-chris-llm-d)
83
+
-[Project Code & Performace Test on GitHub](https://github.com/cnuland/hello-chris-llm-d)
83
84
---
84
85
85
86
## 🔧 Core Configurations
@@ -209,7 +210,7 @@ data:
209
210
# (2.2) CRITICAL: Session-aware scoring for 99.91% stickiness
210
211
- name: ENABLE_SESSION_AWARE_SCORER
211
212
value: "true"
212
-
213
+
213
214
# (2.3) Optimized scoring weights for session stickiness
214
215
- name: KVCACHE_AWARE_SCORER_WEIGHT
215
216
value: "10"
@@ -219,27 +220,27 @@ data:
219
220
value: "5"
220
221
- name: SESSION_AWARE_SCORER_WEIGHT
221
222
value: "20" # Highest weight for session stickiness
222
-
223
+
223
224
# (2.4) Session scoring configuration
224
225
- name: SESSION_SCORING_ALGORITHM
225
226
value: "sticky_hash"
226
227
- name: SESSION_HEADER_NAMES
227
228
value: "session-id,x-session-id,authorization"
228
229
- name: SESSION_STICKY_DURATION
229
230
value: "3600"
230
-
231
+
231
232
# (2.5) Enhanced Redis indexing with faster updates
0 commit comments