Skip to content

Commit ed88e8f

Browse files
authored
Add intelligent inference blog and new authors to the blog authors list (#87)
* Add intelligent inference blog and new authors to the blog authors list Signed-off-by: Pete Cheslock <[email protected]> * Reduce the image sizes for the post. Signed-off-by: Pete Cheslock <[email protected]> * Update blog label in configuration from "News" to "Blog" for consistency Signed-off-by: Pete Cheslock <[email protected]> --------- Signed-off-by: Pete Cheslock <[email protected]>
1 parent fa97500 commit ed88e8f

14 files changed

+181
-2
lines changed
Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
---
2+
title: Intelligent Inference Scheduling with llm-d
3+
description: How llm-d enables smarter, prefix-aware, load- and SLO-aware routing for better latency and throughput
4+
slug: intelligent-inference-scheduling-with-llm-d
5+
date: 2025-09-03T09:00
6+
7+
authors:
8+
- niliguy
9+
- vitabortnikov
10+
- etailevran
11+
- robshaw
12+
- smarterclayton
13+
14+
tags: [blog, updates, llm-d]
15+
---
16+
17+
# Intelligent Inference Scheduling with llm-d
18+
19+
The llm-d project lays out clear, “well-lit” paths for anyone to adopt the leading inference optimizations within their existing deployment framework \- Kubernetes. These are tested approaches designed to make complex deployments easier and more efficient. In this post, we explore the first of these paths: **intelligent inference scheduling**. Unlike basic round-robin load balancing, this method takes the unique demands of LLMs into account, leading to better performance across the board: higher throughput, lower latency, and efficient use of resources.
20+
21+
### Why Intelligent Inference Is Needed for LLM Inference
22+
23+
Deploying large language models (LLMs) on Kubernetes has become the norm, but LLM inference workloads behave very differently from standard microservices. Traditional patterns like uniform replicas paired with round-robin load balancing assume each request uses the same amount of resources and finishes in roughly the same time. In contrast, LLM requests can vary wildly in token count and compute needs, making simple load-spread strategies prone to bottlenecks and imbalanced traffic.
24+
25+
<div style={{textAlign: 'center', margin: '20px 0'}}>
26+
<img src="/img/blogs/inference-scheduling/image01.png" alt="Intelligent inference scheduling diagram" style={{width: '75%', height: 'auto'}} />
27+
</div>
28+
29+
<!-- truncate -->
30+
31+
LLM inference pipelines also consist of two distinct phases, compute-bound prefill stage and memory-bound decode stage, that have fundamentally different resource profiles. Without specialization, every replica must handle both phases, leading to wasted GPU cycles or memory bandwidth. At the same time, many LLM use cases involve multi-turn chats or agentic flows where cached prefix computations dramatically speeds up response times if the request is routed back to the same instance.
32+
33+
On top of these challenges, LLM endpoints often serve a spectrum of quality-of-service needs. Interactive tasks like code completion demand millisecond-level latency, chat agents can tolerate a few seconds, and batch jobs might take minutes or more. Satisfying tight latency SLOs for expensive inference calls can be prohibitively costly if every pod is treated identically.
34+
To address these unique demands, an intelligent inference scheduler that understands both the shape of incoming requests and the real-time state of your cluster can boost throughput, slash tail latencies, and maximize GPU resource utilization.
35+
36+
### Recap: Inference Serving in Kubernetes, the Gateway API and Inference Gateway Extension
37+
38+
Kubernetes Services paired with Deployments and standard load balancing distribute traffic evenly across identical replicas. That model works well for stateless microservices with uniform, short-lived requests. But as we saw earlier, LLM inference calls vary wildly in compute intensity, benefit from stateful routing (e.g., prefix caches), and demand tight tail-latency control \- none of which a vanilla load balancing handles well.
39+
40+
The Gateway API modernizes Kubernetes networking by offering a CRD-based, L7 routing framework that replaces and extends traditional Ingress. It gives you fine-grained route definitions, pluggable data planes, and native compatibility with multi-cluster or cross-team routing policies. Yet on its own, the Gateway API lacks any notion of LLM inference serving based on inference-specific characteristics and metrics.
41+
42+
To bridge that gap, the Gateway API Inference Extension project introduces the Inference Gateway (IGW). IGW reuses Gateway API‘s core primitives but adds new CRDs \- most notably **InferencePool** \- to represent collections of model-serving pods. InferencePools can carry additional metadata such as base model, accelerator type, and runtime capabilities. Gateways then invoke a pluggable **EndpointPicker (EPP)** to perform “smart” load balancing, leveraging Envoy’s External Processing (ext-proc) to steer traffic to the right inference endpoint.
43+
44+
The default EPP in IGW follows a structured scheduling cycle for each incoming request:
45+
46+
* **Endpoint discovery:** Enumerate all InferencePool pods and gather their metadata (waiting queue state, loaded models, cache contents, etc.).
47+
* **Filtering:** Exclude pods that can’t serve the request due to overload, incompatible resources, or memory pressure.
48+
* **Scoring:** Assign each remaining pod a score via extensible scorers \- evaluating factors like queue depth, session affinity, prefix cache hits, and custom SLO indicators.
49+
* **Selection:** Pick appropriate endpoints, with built-in tie-breaking and fallback logic.
50+
51+
Building on IGW’s foundation, **llm-d** **augments the EPP with more advanced scheduling capabilities**. It introduces scorers that optimize for KV cache locality (boosting prefix-cache hit rates) and orchestrates multiple scheduling passes to disaggregate prefill and decode phases onto specialized pod variants. The result is a fully LLM-aware scheduler that drives higher throughput, lower tail latencies, and finer resource efficiency across the board.
52+
53+
<div style={{textAlign: 'center', margin: '20px 0'}}>
54+
<img src="/img/blogs/inference-scheduling/image02.png" alt="Diagram" style={{width: '75%', height: 'auto'}} />
55+
</div>
56+
57+
### Intelligent Inference Scheduling with llm-d
58+
59+
A key differentiator of llm-d is the ability to plug in configurable, AI-aware scorers into the inference gateway scheduling pipeline. These scorers go beyond generic load balancing by factoring in LLM-specific workload characteristics such as token count variability, compute/memory phase differences, and KV-cache locality \- when deciding where each request should run.
60+
61+
LLM workloads are not uniform. Some use cases — like multi-turn conversations, RAG pipelines, or agentic flows naturally lead to **high prefix reuse**, where requests repeatedly share large portions of the prompt. Others like diverse batch inference jobs or single-shot completions exhibit **low prefix sharing**, where cache hits are rare and every request is essentially unique.
62+
63+
Because of this diversity, llm-d’s pluggable, AI-aware scorers allow operators to tailor scheduling strategies to workload profiles. We evaluated two configurations:
64+
65+
* **Prefix-only scorer** – routes to maximize KV-cache hits.
66+
* **Prefix \+ Load scorer** – adds dynamic load-awareness while still exploiting cache opportunities.
67+
68+
#### **Why AI-Aware Scorers Win**
69+
70+
Following benchmarks show how performance evolves when cache opportunities are minimal, and they illustrate an important point: **the optimal scheduling strategy depends on the workload profile**.
71+
72+
#### **High Prefix Sharing Workload**
73+
74+
When cache locality is abundant, the results are dramatic:
75+
76+
* **Success rate:** The prefix-only scorer frequently overloaded replicas, succeeding in only \~55% of requests, while Prefix \+ Load maintained 100% success across all QPS levels.
77+
78+
* **Time to First Token (TTFT):** Prefix \+ Load kept TTFT consistently near-zero, while Prefix-only degraded rapidly, exceeding 140s at high QPS.
79+
80+
* **Inter-Token Latency (ITL):** Prefix \+ Load achieved ITL of \~30ms, versus \~160ms with Prefix-only — more than 5× improvement in responsiveness.
81+
82+
* **Throughput:** Prefix \+ Load scaled linearly with QPS, reaching \~60k tokens/sec at 20 QPS. Prefix-only flatlined near 2k–3k tokens/sec.
83+
84+
<div style={{margin: '20px 0'}}>
85+
<div style={{marginBottom: '20px'}}>
86+
<img src="/img/blogs/inference-scheduling/image03.png" alt="Throughput vs Request Rate" style={{width: '100%', height: 'auto'}} />
87+
<p style={{textAlign: 'center', fontSize: '0.9em', marginTop: '8px'}}><em>Throughput vs Request Rate</em></p>
88+
</div>
89+
90+
<div style={{display: 'grid', gridTemplateColumns: '1fr 1fr 1fr', gap: '15px', alignItems: 'start'}}>
91+
<div style={{display: 'flex', flexDirection: 'column', justifyContent: 'center', height: '100%'}}>
92+
<img src="/img/blogs/inference-scheduling/image04.png" alt="Success Rate" style={{width: '100%', height: 'auto'}} />
93+
<p style={{textAlign: 'center', fontSize: '0.85em', marginTop: '6px'}}><em>Success Rate</em></p>
94+
</div>
95+
<div style={{display: 'flex', flexDirection: 'column', justifyContent: 'center', height: '100%'}}>
96+
<img src="/img/blogs/inference-scheduling/image05.png" alt="TTFT and QPS" style={{width: '100%', height: 'auto'}} />
97+
<p style={{textAlign: 'center', fontSize: '0.85em', marginTop: '6px'}}><em>TTFT and QPS</em></p>
98+
</div>
99+
<div style={{display: 'flex', flexDirection: 'column', justifyContent: 'center', height: '100%'}}>
100+
<img src="/img/blogs/inference-scheduling/image06.png" alt="Intertoken Latency" style={{width: '100%', height: 'auto'}} />
101+
<p style={{textAlign: 'center', fontSize: '0.85em', marginTop: '6px'}}><em>Intertoken Latency</em></p>
102+
</div>
103+
</div>
104+
</div>
105+
106+
107+
In workloads with heavy prefix reuse, prefix-aware scheduling combined with load-awareness is essential to avoid bottlenecks and maximize GPU utilization. By combining prefix scoring with load awareness, llm-d achieves **100% request success, lower latencies, and linear throughput scaling** — the essence of intelligent, AI-aware scheduling.
108+
109+
#### **Low Prefix Sharing Workload**
110+
111+
When cache hits are rare, prefix-awareness provides little benefit, and both scorers perform similarly:
112+
113+
**Throughput:** Both scorers perform **nearly identically**, scaling linearly with QPS. Output throughput reaches \~400 tokens/sec and total throughput \~60k tokens/sec at 20 QPS for both strategies.
114+
115+
**Latency:**
116+
117+
* **Time to First Token (TTFT):** Both remain stable in the **300–380 ms range** as load increases. Small variations exist, but neither scorer shows a clear advantage.
118+
119+
* **Normalized time per token:** Flat around **0.65 ms/token**, with both scorers tightly overlapping across QPS levels.
120+
121+
* **Inter-Token Latency (ITL):** Increases linearly with load, from \~25 ms at 2 QPS to \~50 ms at 20 QPS — again, no significant gap between scorers.
122+
123+
**Reliability:**
124+
Both scorers achieve **100% success rate** across the full load range, confirming that load balancing alone is sufficient when prefix reuse is low.
125+
126+
Under low prefix sharing workloads, the benefits of prefix-aware routing naturally diminish. In this case, adding load-awareness or prefix-awareness makes little difference \- both strategies scale smoothly and meet latency targets.
127+
128+
![Latency vs request rate](/img/blogs/inference-scheduling/image07.png)
129+
![Throughput vs Request rate](/img/blogs/inference-scheduling/image08.png)
130+
131+
### **Takeaway**
132+
133+
These benchmarks illustrate why **configurable scorers matter in llm-d**.
134+
135+
* In **prefix-heavy workloads**, Prefix \+ Load scoring ensures cache hits are exploited without overloading replicas — yielding linear throughput scaling, low latencies, and high success rates.
136+
137+
* In **prefix-light workloads**, simple load balancing suffices, and the system avoids unnecessary complexity.
138+
139+
This adaptability means operators can choose (or combine) scorers based on workload characteristics, achieving the best **token-per-dollar efficiency** while consistently meeting latency and throughput SLOs.
140+
141+
### Looking Ahead: Roadmap and Future Plans
142+
143+
The IGW and `llm-d` projects are evolving rapidly, with several exciting directions on the horizon:
144+
145+
* **Dynamic Scheduling Goals**: Support for runtime reconfiguration of scheduling strategies based on workload type, latency targets, or user-defined policies.
146+
* **Multi-Model Awareness**: Enhanced routing logic that accounts for model compatibility, adapter stacking, and ensemble inference. (next blog)
147+
* **Plugin Ecosystem**: A curated set of reusable plugins for common LLM use cases, contributed by the community. We’re considering supporting out of process plugins, written in any language, to allow researchers to experiment with new scheduling algorithms and ideas \- let us know if you have an idea we can help enable\!
148+
149+
### Closing Thoughts
150+
151+
The journey of llm-d reflects a broader shift in how we think about LLM inference \- not just as a stateless function call, but as a dynamic, resource-aware orchestration problem. By building on IGW and pushing its boundaries, llm-d offers a flexible, extensible foundation for intelligent scheduling at scale.
152+
Whether you're running a single model or a fleet of fine-tuned variants, the goal is the same: **maximize performance, minimize latency, and make smarter use of available compute**.
153+
154+
### Get Involved with llm-d
155+
156+
The llm-d project thrives on community contributions, and there are many ways to get involved:
157+
158+
- **Explore the llm-d Community Quickstart Guide**[Start here](https://llm-d.ai/docs/community) to learn more about getting involved in the llm-d project.
159+
- **Join our Slack**[Get your invite](https://llm-d.ai/slack) and connect with maintainers and contributors
160+
- **Explore the code** → Browse our [GitHub organization](https://github.com/llm-d) and find issues that interest you
161+
- **Attend meetings** → All meetings are open! Add our [public calendar](https://llm-d.ai/docs/community#public-meeting-calendar) and join discussions`

blog/authors.yml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,21 @@ cnuland:
3737
title: Principal Technical Marketing Manager for AI, Red Hat
3838
url: https://github.com/cnuland
3939
image_url: /img/blogs/cnuland.jpeg
40+
41+
niliguy:
42+
name: Nili Guy
43+
title: R&D Manager, AI Infrastructure, IBM
44+
url: https://www.linkedin.com/in/nilig/
45+
image_url: /img/blogs/niliguy.jpg
46+
47+
etailevran:
48+
name: Etai Lev Ran
49+
title: Cloud Architect, IBM
50+
url: https://www.linkedin.com/in/elevran/
51+
image_url: /img/blogs/etailevran.jpg
52+
53+
vitabortnikov:
54+
name: Vita Bortnikov
55+
title: IBM Fellow, IBM
56+
url: https://www.linkedin.com/in/vita-bortnikov/
57+
image_url: /img/blogs/vitabortnikov.jpg

docusaurus.config.js

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ const config = {
156156
position: "left",
157157
label: "Community",
158158
},
159-
{ to: "/blog", label: "News", position: "left" },
159+
{ to: "/blog", label: "Blog", position: "left" },
160160
{
161161
type: 'html',
162162
position: 'right',
@@ -209,7 +209,7 @@ const config = {
209209
title: "More",
210210
items: [
211211
{
212-
label: "News",
212+
label: "Blog",
213213
to: "/blog",
214214
},
215215
{

static/img/blogs/etailevran.jpg

28.6 KB
Loading
419 KB
Loading
249 KB
Loading
329 KB
Loading
38.3 KB
Loading
194 KB
Loading
199 KB
Loading

0 commit comments

Comments
 (0)