Skip to content

Commit 15ba193

Browse files
authored
draft that blogpost! (#79)
Gotta draft em all...wait no this one. Signed-off-by: JJ Asghar <[email protected]>
1 parent fb67b32 commit 15ba193

File tree

1 file changed

+33
-32
lines changed

1 file changed

+33
-32
lines changed

blog/2025-08-05_mastering-kv-cache-aware-routing.md

Lines changed: 33 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ authors:
77
- cnuland
88

99
tags: [llm-d, blog, community]
10+
draft: true
1011
---
1112

1213
## Introduction
@@ -19,11 +20,11 @@ This blog post is written for **llm-d v0.2.0**. For detailed release information
1920

2021
In this blog post, we'll cover:
2122

22-
- What KV-cache-aware routing is and why it matters
23-
- How llm-d implements this feature with EPPs, Redis, and NIXL
24-
- The critical Kubernetes YAML assets that make it work
25-
- A test case showing our latest 87.4% cache hit rate
26-
- Where to go to learn more and get started
23+
- What KV-cache-aware routing is and why it matters
24+
- How llm-d implements this feature with EPPs, Redis, and NIXL
25+
- The critical Kubernetes YAML assets that make it work
26+
- A test case showing our latest 87.4% cache hit rate
27+
- Where to go to learn more and get started
2728

2829
<!-- truncate -->
2930

@@ -36,21 +37,21 @@ In this blog post, we'll cover:
3637

3738
**llm-d** is an open source project built by Red Hat and the AI infrastructure community to manage large-scale LLM inference using cloud-native patterns. llm-d introduces:
3839

39-
- Disaggregated **prefill and decode** workloads
40-
- **Multi-model** and multi-tenant isolation
41-
- **Intelligent routing** via an External Processing Pod (EPP)
42-
- And crucially, **KV-cache-aware routing** for memory-efficient, low-latency inference
40+
- Disaggregated **prefill and decode** workloads
41+
- **Multi-model** and multi-tenant isolation
42+
- **Intelligent routing** via an External Processing Pod (EPP)
43+
- And crucially, **KV-cache-aware routing** for memory-efficient, low-latency inference
4344

4445
---
4546

4647
## The Problem: Stateless Inference Fails to Reuse Cache
4748

4849
In traditional deployments, even if KV-caches are enabled inside the model server (like vLLM), the **gateway is unaware of cache state**. That leads to:
4950

50-
- Round-robin routing or explicit sticky sessions
51-
- Frequent **cache misses**
52-
- Repeated compute for common prefixes
53-
- Unnecessary GPU memory use
51+
- Round-robin routing or explicit sticky sessions
52+
- Frequent **cache misses**
53+
- Repeated compute for common prefixes
54+
- Unnecessary GPU memory use
5455

5556
This breaks down under high concurrency or workloads with shared prompts (like RAG, chat history, or templated inputs).
5657

@@ -60,10 +61,10 @@ This breaks down under high concurrency or workloads with shared prompts (like R
6061

6162
llm-d enables **state-aware request scheduling** by introducing a few key components:
6263

63-
- An **EPP (External Processing Pod)** that acts as an intelligent router
64-
- A **Redis-based cache indexer** that tracks what each pod has cached
65-
- A **NIXL side-channel** between pods to transfer KV data when needed
66-
- **Configurable routing scorers** that balance reuse and load
64+
- An **EPP (External Processing Pod)** that acts as an intelligent router
65+
- A **Redis-based cache indexer** that tracks what each pod has cached
66+
- A **NIXL side-channel** between pods to transfer KV data when needed
67+
- **Configurable routing scorers** that balance reuse and load
6768

6869
The result is a scheduling layer that favors pods with warm cache states—cutting inference times and GPU load.
6970

@@ -76,10 +77,10 @@ The result is a scheduling layer that favors pods with warm cache states—cutti
7677

7778
To follow this guide, you should have:
7879

79-
- OpenShift or Kubernetes with GPU-enabled nodes
80-
- The [llm-d Operator](https://llm-d.ai/docs/guide/Installation/prerequisites) installed
80+
- OpenShift or Kubernetes with GPU-enabled nodes
81+
- The [llm-d Operator](https://llm-d.ai/docs/guide/Installation/prerequisites) installed
8182
- A Hugging Face token (for downloading LLaMA or other models)
82-
- [Project Code & Performace Test on GitHub](https://github.com/cnuland/hello-chris-llm-d)
83+
- [Project Code & Performace Test on GitHub](https://github.com/cnuland/hello-chris-llm-d)
8384
---
8485

8586
## 🔧 Core Configurations
@@ -209,7 +210,7 @@ data:
209210
# (2.2) CRITICAL: Session-aware scoring for 99.91% stickiness
210211
- name: ENABLE_SESSION_AWARE_SCORER
211212
value: "true"
212-
213+
213214
# (2.3) Optimized scoring weights for session stickiness
214215
- name: KVCACHE_AWARE_SCORER_WEIGHT
215216
value: "10"
@@ -219,27 +220,27 @@ data:
219220
value: "5"
220221
- name: SESSION_AWARE_SCORER_WEIGHT
221222
value: "20" # Highest weight for session stickiness
222-
223+
223224
# (2.4) Session scoring configuration
224225
- name: SESSION_SCORING_ALGORITHM
225226
value: "sticky_hash"
226227
- name: SESSION_HEADER_NAMES
227228
value: "session-id,x-session-id,authorization"
228229
- name: SESSION_STICKY_DURATION
229230
value: "3600"
230-
231+
231232
# (2.5) Enhanced Redis indexing with faster updates
232233
- name: KVCACHE_INDEXER_REDIS_ADDR
233234
value: llm-d-operator-redis-master.llm-d.svc.cluster.local:8100
234235
- name: KVCACHE_INDEX_UPDATE_INTERVAL
235236
value: "500ms"
236-
237+
237238
# (2.6) Prefill/Decode disaggregation with cache-aware routing
238239
- name: PD_ENABLED
239240
value: "true"
240241
- name: PD_CACHE_AWARE_ROUTING
241242
value: "true"
242-
243+
243244
# (2.7) Enhanced prefill scoring configuration
244245
- name: PREFILL_ENABLE_KVCACHE_AWARE_SCORER
245246
value: "true"
@@ -360,9 +361,9 @@ To validate that KV-cache-aware routing was functioning correctly, we designed a
360361

361362
**Monitored Signals:**
362363

363-
- Redis telemetry for prefix index hits
364-
- vLLM logs for cache use
365-
- Tekton metrics for latency
364+
- Redis telemetry for prefix index hits
365+
- vLLM logs for cache use
366+
- Tekton metrics for latency
366367
- Grafana dashboard for visibility
367368

368369
### 🔍 Latest Validation Results
@@ -436,7 +437,7 @@ The bottom line: KV-cache-aware routing isn't just technically impressive—it's
436437
---
437438

438439
## 📚 Learn More
439-
- [Project Code & Performance Test on GitHub](https://github.com/cnuland/hello-chris-llm-d)
440-
- [llm-d GitHub](https://github.com/llm-d/llm-d)
441-
- [llm-d Operator Quickstart](https://llm-d.ai/docs/guide/Installation/prerequisites)
442-
- [vLLM Documentation](https://docs.vllm.ai)
440+
- [Project Code & Performance Test on GitHub](https://github.com/cnuland/hello-chris-llm-d)
441+
- [llm-d GitHub](https://github.com/llm-d/llm-d)
442+
- [llm-d Operator Quickstart](https://llm-d.ai/docs/guide/Installation/prerequisites)
443+
- [vLLM Documentation](https://docs.vllm.ai)

0 commit comments

Comments
 (0)