You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Collapse authors to the yaml configuration and create a June update blog post (#51)
* Refactor author information in blog posts and fix typos
- Updated author entries in `2025-05-20_announce.md`, `2025-06-03_week_1_round_up.md`, and `authors.yml` to use usernames instead of full details.
- Corrected minor typographical errors in various blog posts for consistency and clarity.
Signed-off-by: Pete Cheslock <[email protected]>
* Add June 2025 Community Update and new tags
- Introduced a new blog post titled "llm-d Community Update - June 2025" detailing project progress, new YouTube channel, and ways to get involved.
- Added new tags for community engagement, updates, and SIG-Benchmarking in `tags.yml`.
Signed-off-by: Pete Cheslock <[email protected]>
* Move the reminder down - and bring the survey higher up.
Signed-off-by: Pete Cheslock <[email protected]>
* Add June 2025 Community Update
- Introduced a new blog post detailing the latest updates from the llm-d community, including a call for participation in a performance benchmarks survey, the launch of a new YouTube channel, and ways to get involved with the project.
Signed-off-by: Pete Cheslock <[email protected]>
* Add new blog post for June 2025 Community Update
- Introduced a comprehensive update detailing project progress, a call for participation in a community survey, the launch of a new YouTube channel, and various ways for community members to get involved with the llm-d project.
Signed-off-by: Pete Cheslock <[email protected]>
* Remove date as a header
Signed-off-by: Pete Cheslock <[email protected]>
* Remove duplicate H1 header
Signed-off-by: Pete Cheslock <[email protected]>
---------
Signed-off-by: Pete Cheslock <[email protected]>
@@ -57,11 +43,11 @@ The LLM inference workload, however, is unique with slow, non-uniform, expensive
57
43
58
44

59
45
60
-
Let’s take a look at each one step-by-step:
46
+
Let's take a look at each one step-by-step:
61
47
62
48
*A. Requests are expensive with significant variance in resource utilization.*
63
49
64
-
* Each LLM inference request has a different “shape” to it, as measured by the number of input tokens and output tokens. There is significant variance in these parameters across requests and workloads.
50
+
* Each LLM inference request has a different "shape" to it, as measured by the number of input tokens and output tokens. There is significant variance in these parameters across requests and workloads.
65
51
* RAG has long inputs \- prompt and retrieved docs \- and short generated outputs
66
52
* Reasoning has a short or medium inputs and long generated outputs
67
53
@@ -71,13 +57,13 @@ Let’s take a look at each one step-by-step:
71
57
72
58
*B. Routing to specific replicas with cached prior computation can achieve orders of magnitude better latency.*
73
59
74
-
* Many common LLM workloads have “multi-turn” request patterns, where the same prompt is sent iteratively to the same instance.
60
+
* Many common LLM workloads have "multi-turn" request patterns, where the same prompt is sent iteratively to the same instance.
75
61
* Agentic (tool calls are iterative request flow)
76
62
* Code completion task (requests reuse current codebase as context)
* LLM inference servers like vLLM implement a method called “automatic prefix caching”, which enables “skipping” a significant amount of prefill computation when there is a cache hit. If requests are routed to vLLM replicas that have the data in the cache, we skip computation. Increasing the likelihood of prefix cache hits with a larger cache size can dramatically improve tail latencies.
66
+
* LLM inference servers like vLLM implement a method called "automatic prefix caching", which enables "skipping" a significant amount of prefill computation when there is a cache hit. If requests are routed to vLLM replicas that have the data in the cache, we skip computation. Increasing the likelihood of prefix cache hits with a larger cache size can dramatically improve tail latencies.
@@ -97,16 +83,16 @@ Let’s take a look at each one step-by-step:
97
83
*D. Production deployments often have a range of quality of service (QoS) requirements.*
98
84
99
85
* Use cases for a single LLM endpoint can have a wide variety of quality of service requirements. Consider the following examples:
100
-
* Latency is the most important factor: Code completion requests and search responses need to minimize latency to provide an “in the loop” experience. O(ms) latency tolerance.
86
+
* Latency is the most important factor: Code completion requests and search responses need to minimize latency to provide an "in the loop" experience. O(ms) latency tolerance.
101
87
* Latency is important: Chat agent sessions and email drafting with interactive use cases. O(seconds) latency tolerance.
102
-
* Latency tolerant: Video call and email summarization and “deep research” agents with daily or hourly usage patterns. O(minutes) latency tolerance.
88
+
* Latency tolerant: Video call and email summarization and "deep research" agents with daily or hourly usage patterns. O(minutes) latency tolerance.
* Given the compute intensity (and, therefore, high costs) of LLMs, tight latency SLOs are substantially more expensive to achieve. This spectrum of latency requirements presents an opportunity to further optimize infrastructure efficiency – the more latency tolerant a workload is, the more we can optimize infrastructure efficiency amongst other workloads.
106
92
107
93
### Why llm-d?
108
94
109
-
To exploit these characteristics and achieve optimal performance for LLM workloads, the inference serving landscape is rapidly transitioning towards distributed cluster-scale architectures. For instance, in its “Open Source Week”, the DeepSeek team published the design of its [inference system](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md), which aggressively leverages disaggregation and KV caching to achieve remarkable performance per $ of compute.
95
+
To exploit these characteristics and achieve optimal performance for LLM workloads, the inference serving landscape is rapidly transitioning towards distributed cluster-scale architectures. For instance, in its "Open Source Week", the DeepSeek team published the design of its [inference system](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md), which aggressively leverages disaggregation and KV caching to achieve remarkable performance per $ of compute.
110
96
111
97
However, for most GenAI innovators, ML platform teams, and IT operations groups, these benefits remain out of reach. Building and operating a complex, monolithic system is time-consuming and challenging, especially in the context of the rapid pace of innovation and enterprise deployments with tens or hundreds of models for divergent use cases. This complexity risks time to market, higher operational costs and sprawl, and difficulty adopting and experimenting.
112
98
@@ -129,25 +115,25 @@ To achieve this objective, we designed llm-d with a modular and layered architec
129
115
130
116
*[**Kubernetes**](https://kubernetes.io/docs/home/)**(K8s)**. K8s is an open source container orchestration engine for automating deployment, scaling, and management of containerized applications. It is the industry standard for deploying and updating LLM inference engines across various hardware accelerators.
131
117
132
-
*[**Inference Gateway**](https://gateway-api-inference-extension.sigs.k8s.io/)**(IGW)**. IGW is an official Kubernetes project that extends the [Gateway API](https://gateway-api.sigs.k8s.io/) (the next generation of Kubernetes Ingress and Load Balancing API) with inference-specific routing. IGW includes many important features like model routing, serving priority, and extensible scheduling logic for “smart” load balancing. IGW integrates with many different gateway implementations, such as Envoy, making it widely portable across Kubernetes clusters.
118
+
*[**Inference Gateway**](https://gateway-api-inference-extension.sigs.k8s.io/)**(IGW)**. IGW is an official Kubernetes project that extends the [Gateway API](https://gateway-api.sigs.k8s.io/) (the next generation of Kubernetes Ingress and Load Balancing API) with inference-specific routing. IGW includes many important features like model routing, serving priority, and extensible scheduling logic for "smart" load balancing. IGW integrates with many different gateway implementations, such as Envoy, making it widely portable across Kubernetes clusters.
***vLLM Optimized Inference Scheduler**\- IGW defines a pattern for customizable “smart” load-balancing via the [Endpoint Picker Protocol (EPP)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol). Leveraging enhanced operational telemetry exposed by vLLM, the inference scheduler implements the filtering and scoring algorithms necessary to make “smart” scheduling decisions around disaggregated serving, prefix-cache-awareness, and load-awareness, validated to be used out-of-the-box by llm-d users. Advanced teams can also tweak or implement their own scorers and filterers to further customize for their use cases, while still benefiting from upcoming operational features in the inference gateway, like flow control and latency-aware balancing.
124
+
***vLLM Optimized Inference Scheduler**\- IGW defines a pattern for customizable "smart" load-balancing via the [Endpoint Picker Protocol (EPP)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol). Leveraging enhanced operational telemetry exposed by vLLM, the inference scheduler implements the filtering and scoring algorithms necessary to make "smart" scheduling decisions around disaggregated serving, prefix-cache-awareness, and load-awareness, validated to be used out-of-the-box by llm-d users. Advanced teams can also tweak or implement their own scorers and filterers to further customize for their use cases, while still benefiting from upcoming operational features in the inference gateway, like flow control and latency-aware balancing.
139
125
* For more details, see our Northstar: [\[PUBLIC\] llm-d Scheduler Northstar](https://docs.google.com/document/d/1kE1LY8OVjiOgKVD9-9Po96HODbTIbgHp4qgvw06BCOc/edit?tab=t.0)
140
126
141
-
***Disaggregated Serving with [vLLM](https://github.com/vllm-project/vllm)\-** llm-d leverages vLLM’s recently enabled support for disaggregated serving via a pluggable KV Connector API to run prefill and decode on independent instances, using high-performance transport libraries like [NVIDIA’s NIXL](https://github.com/ai-dynamo/nixl).
127
+
***Disaggregated Serving with [vLLM](https://github.com/vllm-project/vllm)\-** llm-d leverages vLLM's recently enabled support for disaggregated serving via a pluggable KV Connector API to run prefill and decode on independent instances, using high-performance transport libraries like [NVIDIA's NIXL](https://github.com/ai-dynamo/nixl).
142
128
143
-
In llm-d, we plan to support two “well-lit” paths for prefill/decode (P/D) disaggregation:
129
+
In llm-d, we plan to support two "well-lit" paths for prefill/decode (P/D) disaggregation:
144
130
* Latency optimized implementation using fast interconnects (IB, RDMA, ICI)
145
131
* Throughput optimized implementation using data center networking
146
132
* For more details, see our Northstar:[\[PUBLIC\] llm-d Disaggregated Serving Northstar](https://docs.google.com/document/d/1FNN5snmipaTxEA1FGEeSH7Z_kEqskouKD1XYhVyTHr8/edit?tab=t.0#heading=h.ycwld2oth1kj)
147
133
148
134
***Disaggregated Prefix Caching with vLLM**\- llm-d uses the same vLLM KV connector API used in disaggregated serving to provide a pluggable cache for previous calculations, including offloading KVs to host, remote storage, and systems like [LMCache](https://github.com/LMCache/LMCache).
149
135
150
-
In llm-d, we plan to support two “well-lit” paths for KV cache disaggregation:
136
+
In llm-d, we plan to support two "well-lit" paths for KV cache disaggregation:
151
137
* Independent caching with basic offloading to host memory and disk, providing a zero operational cost mechanism that utilizes all system resources
152
138
* Shared caching with KV transfer between instances and shared storage with global indexing, providing potential for higher performance at the cost of a more operationally complex system.
153
139
* For more details, see our Northstar: [\[PUBLIC\] llm-d Prefix Caching Northstar](https://docs.google.com/document/d/1d-jKVHpTJ_tkvy6Pfbl3q2FM59NpfnqPAh__Uz_bEZ8/edit?tab=t.0#heading=h.6qazyl873259)
@@ -163,7 +149,7 @@ And our key new contributions:
163
149
164
150
#### Example llm-d Features
165
151
166
-
llm-d integrates IGW and vLLM together, enabling a high performance distributed serving stack. Let’s discuss some of the example features enabled by llm-d.
152
+
llm-d integrates IGW and vLLM together, enabling a high performance distributed serving stack. Let's discuss some of the example features enabled by llm-d.
167
153
168
154
**Prefix and KV cache-aware routing**
169
155
@@ -185,13 +171,13 @@ We conducted a series of experiments to evaluate the performance of the [llm-d-i
185
171
***S2:** llm-d delivers \~50% higher QPS than the baseline while meeting SLO requirements (higher is better).
186
172
***S3:** llm-d sustains 2X the baseline QPS under SLO constraints (higher is better).
187
173
188
-
These results show that llm-d’s cache- and prefix-aware scheduling effectively reduces TTFT and increases QPS compared to the baseline, while consistently meeting SLA requirements.
174
+
These results show that llm-d's cache- and prefix-aware scheduling effectively reduces TTFT and increases QPS compared to the baseline, while consistently meeting SLA requirements.
189
175
190
176
Try it out with the \`base.yaml\` config in our [quickstart](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart). And as a customization example, see the [template](https://github.com/llm-d/llm-d-inference-scheduler/blob/main/docs/create_new_filter.md) for adding your own scheduler filter.
191
177
192
178
**P/D disaggregation**
193
179
194
-
We’ve completed an initial implementation of P/D disaggregation with vLLM and llm-d-inference-scheduler, which delivers promising speedups for prefill-heavy workloads (20:1 ISL | OSL). Our next focus is finalizing the implementation with heterogeneous TP and completing comprehensive benchmarks for disaggregated serving. Short-term priorities include enabling heterogeneous TP, scaling with high-performance P/D \+ EP\<\>DP for large scale MoEs, and DP-aware load balancing. We will follow up with a detailed performance blog in the coming weeks.
180
+
We've completed an initial implementation of P/D disaggregation with vLLM and llm-d-inference-scheduler, which delivers promising speedups for prefill-heavy workloads (20:1 ISL | OSL). Our next focus is finalizing the implementation with heterogeneous TP and completing comprehensive benchmarks for disaggregated serving. Short-term priorities include enabling heterogeneous TP, scaling with high-performance P/D \+ EP\<\>DP for large scale MoEs, and DP-aware load balancing. We will follow up with a detailed performance blog in the coming weeks.
195
181
196
182
Try it out with the pd-nixl.yaml config in our [quickstart](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart).
@@ -25,13 +21,15 @@ We've hit 1000 ⭐️'s on [GitHub](https://github.com/llm-d/llm-d)
25
21
26
22

27
23
24
+
<!-- truncate -->
28
25
29
26
**Here are some of the active design conversations:**
30
27
31
28
:::tip Join our Google Group
32
29
We use Google Groups to share architecture diagrams and other content. Please join: [llm-d-contributors Google Group](https://groups.google.com/g/llm-d-contributors)
0 commit comments