Skip to content

Commit b3a3188

Browse files
authored
Collapse authors to the yaml configuration and create a June update blog post (#51)
* Refactor author information in blog posts and fix typos - Updated author entries in `2025-05-20_announce.md`, `2025-06-03_week_1_round_up.md`, and `authors.yml` to use usernames instead of full details. - Corrected minor typographical errors in various blog posts for consistency and clarity. Signed-off-by: Pete Cheslock <[email protected]> * Add June 2025 Community Update and new tags - Introduced a new blog post titled "llm-d Community Update - June 2025" detailing project progress, new YouTube channel, and ways to get involved. - Added new tags for community engagement, updates, and SIG-Benchmarking in `tags.yml`. Signed-off-by: Pete Cheslock <[email protected]> * Move the reminder down - and bring the survey higher up. Signed-off-by: Pete Cheslock <[email protected]> * Add June 2025 Community Update - Introduced a new blog post detailing the latest updates from the llm-d community, including a call for participation in a performance benchmarks survey, the launch of a new YouTube channel, and ways to get involved with the project. Signed-off-by: Pete Cheslock <[email protected]> * Add new blog post for June 2025 Community Update - Introduced a comprehensive update detailing project progress, a call for participation in a community survey, the launch of a new YouTube channel, and various ways for community members to get involved with the llm-d project. Signed-off-by: Pete Cheslock <[email protected]> * Remove date as a header Signed-off-by: Pete Cheslock <[email protected]> * Remove duplicate H1 header Signed-off-by: Pete Cheslock <[email protected]> --------- Signed-off-by: Pete Cheslock <[email protected]>
1 parent 97bba56 commit b3a3188

File tree

6 files changed

+121
-39
lines changed

6 files changed

+121
-39
lines changed

blog/2025-05-20_News.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@ title: llm-d Press Release
33
description: Official Press Release for llm-d
44
slug: llm-d-press-release
55

6-
76
tags: [news]
87

98
hide_table_of_contents: false

blog/2025-05-20_announce.md

Lines changed: 18 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -5,23 +5,9 @@ slug: llm-d-announce
55
date: 2025-05-20T08:00
66

77
authors:
8-
- name: Robert Shaw
9-
title: Director of Engineering, Red Hat
10-
url: https://github.com/robertgshaw2-redhat
11-
image_url: https://avatars.githubusercontent.com/u/114415538?v=4
12-
13-
14-
- name: Clayton Coleman
15-
title: Distinguished Engineer, Google
16-
url: https://github.com/smarterclayton
17-
image_url: https://avatars.githubusercontent.com/u/1163175?v=4
18-
19-
20-
- name: Carlos Costa
21-
title: Distinguished Engineer, IBM
22-
url: https://github.com/chcost
23-
image_url: https://avatars.githubusercontent.com/u/26551701?v=4
24-
8+
- robshaw
9+
- smarterclayton
10+
- chcost
2511

2612
tags: [hello, welcome, llm-d]
2713
hide_table_of_contents: false
@@ -57,11 +43,11 @@ The LLM inference workload, however, is unique with slow, non-uniform, expensive
5743

5844
![Figure 2: Comparison of modern HTTP requests](../docs/assets/images/image7_33.png)
5945

60-
Lets take a look at each one step-by-step:
46+
Let's take a look at each one step-by-step:
6147

6248
*A. Requests are expensive with significant variance in resource utilization.*
6349

64-
* Each LLM inference request has a different shape to it, as measured by the number of input tokens and output tokens. There is significant variance in these parameters across requests and workloads.
50+
* Each LLM inference request has a different "shape" to it, as measured by the number of input tokens and output tokens. There is significant variance in these parameters across requests and workloads.
6551
* RAG has long inputs \- prompt and retrieved docs \- and short generated outputs
6652
* Reasoning has a short or medium inputs and long generated outputs
6753

@@ -71,13 +57,13 @@ Let’s take a look at each one step-by-step:
7157

7258
*B. Routing to specific replicas with cached prior computation can achieve orders of magnitude better latency.*
7359

74-
* Many common LLM workloads have multi-turn request patterns, where the same prompt is sent iteratively to the same instance.
60+
* Many common LLM workloads have "multi-turn" request patterns, where the same prompt is sent iteratively to the same instance.
7561
* Agentic (tool calls are iterative request flow)
7662
* Code completion task (requests reuse current codebase as context)
7763

7864
![The agentic pattern sequence](../docs/assets/images/image8_0.jpg)
7965

80-
* LLM inference servers like vLLM implement a method called automatic prefix caching, which enables skipping a significant amount of prefill computation when there is a cache hit. If requests are routed to vLLM replicas that have the data in the cache, we skip computation. Increasing the likelihood of prefix cache hits with a larger cache size can dramatically improve tail latencies.
66+
* LLM inference servers like vLLM implement a method called "automatic prefix caching", which enables "skipping" a significant amount of prefill computation when there is a cache hit. If requests are routed to vLLM replicas that have the data in the cache, we skip computation. Increasing the likelihood of prefix cache hits with a larger cache size can dramatically improve tail latencies.
8167

8268
![The prefix aching method](../docs/assets/images/image3.jpg)
8369

@@ -97,16 +83,16 @@ Let’s take a look at each one step-by-step:
9783
*D. Production deployments often have a range of quality of service (QoS) requirements.*
9884

9985
* Use cases for a single LLM endpoint can have a wide variety of quality of service requirements. Consider the following examples:
100-
* Latency is the most important factor: Code completion requests and search responses need to minimize latency to provide an in the loop experience. O(ms) latency tolerance.
86+
* Latency is the most important factor: Code completion requests and search responses need to minimize latency to provide an "in the loop" experience. O(ms) latency tolerance.
10187
* Latency is important: Chat agent sessions and email drafting with interactive use cases. O(seconds) latency tolerance.
102-
* Latency tolerant: Video call and email summarization and deep research agents with daily or hourly usage patterns. O(minutes) latency tolerance.
88+
* Latency tolerant: Video call and email summarization and "deep research" agents with daily or hourly usage patterns. O(minutes) latency tolerance.
10389
* Latency agnostic: Overnight batch processing workloads, meeting minute generation, and autonomous agents. O(hours) latency tolerance.
10490

10591
* Given the compute intensity (and, therefore, high costs) of LLMs, tight latency SLOs are substantially more expensive to achieve. This spectrum of latency requirements presents an opportunity to further optimize infrastructure efficiency – the more latency tolerant a workload is, the more we can optimize infrastructure efficiency amongst other workloads.
10692

10793
### Why llm-d?
10894

109-
To exploit these characteristics and achieve optimal performance for LLM workloads, the inference serving landscape is rapidly transitioning towards distributed cluster-scale architectures. For instance, in its Open Source Week, the DeepSeek team published the design of its [inference system](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md), which aggressively leverages disaggregation and KV caching to achieve remarkable performance per $ of compute.
95+
To exploit these characteristics and achieve optimal performance for LLM workloads, the inference serving landscape is rapidly transitioning towards distributed cluster-scale architectures. For instance, in its "Open Source Week", the DeepSeek team published the design of its [inference system](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md), which aggressively leverages disaggregation and KV caching to achieve remarkable performance per $ of compute.
11096

11197
However, for most GenAI innovators, ML platform teams, and IT operations groups, these benefits remain out of reach. Building and operating a complex, monolithic system is time-consuming and challenging, especially in the context of the rapid pace of innovation and enterprise deployments with tens or hundreds of models for divergent use cases. This complexity risks time to market, higher operational costs and sprawl, and difficulty adopting and experimenting.
11298

@@ -129,25 +115,25 @@ To achieve this objective, we designed llm-d with a modular and layered architec
129115

130116
* [**Kubernetes**](https://kubernetes.io/docs/home/) **(K8s)**. K8s is an open source container orchestration engine for automating deployment, scaling, and management of containerized applications. It is the industry standard for deploying and updating LLM inference engines across various hardware accelerators.
131117

132-
* [**Inference Gateway**](https://gateway-api-inference-extension.sigs.k8s.io/) **(IGW)**. IGW is an official Kubernetes project that extends the [Gateway API](https://gateway-api.sigs.k8s.io/) (the next generation of Kubernetes Ingress and Load Balancing API) with inference-specific routing. IGW includes many important features like model routing, serving priority, and extensible scheduling logic for smart load balancing. IGW integrates with many different gateway implementations, such as Envoy, making it widely portable across Kubernetes clusters.
118+
* [**Inference Gateway**](https://gateway-api-inference-extension.sigs.k8s.io/) **(IGW)**. IGW is an official Kubernetes project that extends the [Gateway API](https://gateway-api.sigs.k8s.io/) (the next generation of Kubernetes Ingress and Load Balancing API) with inference-specific routing. IGW includes many important features like model routing, serving priority, and extensible scheduling logic for "smart" load balancing. IGW integrates with many different gateway implementations, such as Envoy, making it widely portable across Kubernetes clusters.
133119

134120
![](../docs/assets/images/llm-d-arch-simplified.svg)
135121

136122
And our key new contributions:
137123

138-
* **vLLM Optimized Inference Scheduler** \- IGW defines a pattern for customizable smart load-balancing via the [Endpoint Picker Protocol (EPP)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol). Leveraging enhanced operational telemetry exposed by vLLM, the inference scheduler implements the filtering and scoring algorithms necessary to make smart scheduling decisions around disaggregated serving, prefix-cache-awareness, and load-awareness, validated to be used out-of-the-box by llm-d users. Advanced teams can also tweak or implement their own scorers and filterers to further customize for their use cases, while still benefiting from upcoming operational features in the inference gateway, like flow control and latency-aware balancing.
124+
* **vLLM Optimized Inference Scheduler** \- IGW defines a pattern for customizable "smart" load-balancing via the [Endpoint Picker Protocol (EPP)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol). Leveraging enhanced operational telemetry exposed by vLLM, the inference scheduler implements the filtering and scoring algorithms necessary to make "smart" scheduling decisions around disaggregated serving, prefix-cache-awareness, and load-awareness, validated to be used out-of-the-box by llm-d users. Advanced teams can also tweak or implement their own scorers and filterers to further customize for their use cases, while still benefiting from upcoming operational features in the inference gateway, like flow control and latency-aware balancing.
139125
* For more details, see our Northstar: [\[PUBLIC\] llm-d Scheduler Northstar](https://docs.google.com/document/d/1kE1LY8OVjiOgKVD9-9Po96HODbTIbgHp4qgvw06BCOc/edit?tab=t.0)
140126

141-
* **Disaggregated Serving with [vLLM](https://github.com/vllm-project/vllm) \-** llm-d leverages vLLMs recently enabled support for disaggregated serving via a pluggable KV Connector API to run prefill and decode on independent instances, using high-performance transport libraries like [NVIDIAs NIXL](https://github.com/ai-dynamo/nixl).
127+
* **Disaggregated Serving with [vLLM](https://github.com/vllm-project/vllm) \-** llm-d leverages vLLM's recently enabled support for disaggregated serving via a pluggable KV Connector API to run prefill and decode on independent instances, using high-performance transport libraries like [NVIDIA's NIXL](https://github.com/ai-dynamo/nixl).
142128

143-
In llm-d, we plan to support two well-lit paths for prefill/decode (P/D) disaggregation:
129+
In llm-d, we plan to support two "well-lit" paths for prefill/decode (P/D) disaggregation:
144130
* Latency optimized implementation using fast interconnects (IB, RDMA, ICI)
145131
* Throughput optimized implementation using data center networking
146132
* For more details, see our Northstar:[\[PUBLIC\] llm-d Disaggregated Serving Northstar](https://docs.google.com/document/d/1FNN5snmipaTxEA1FGEeSH7Z_kEqskouKD1XYhVyTHr8/edit?tab=t.0#heading=h.ycwld2oth1kj)
147133

148134
* **Disaggregated Prefix Caching with vLLM** \- llm-d uses the same vLLM KV connector API used in disaggregated serving to provide a pluggable cache for previous calculations, including offloading KVs to host, remote storage, and systems like [LMCache](https://github.com/LMCache/LMCache).
149135

150-
In llm-d, we plan to support two well-lit paths for KV cache disaggregation:
136+
In llm-d, we plan to support two "well-lit" paths for KV cache disaggregation:
151137
* Independent caching with basic offloading to host memory and disk, providing a zero operational cost mechanism that utilizes all system resources
152138
* Shared caching with KV transfer between instances and shared storage with global indexing, providing potential for higher performance at the cost of a more operationally complex system.
153139
* For more details, see our Northstar: [\[PUBLIC\] llm-d Prefix Caching Northstar](https://docs.google.com/document/d/1d-jKVHpTJ_tkvy6Pfbl3q2FM59NpfnqPAh__Uz_bEZ8/edit?tab=t.0#heading=h.6qazyl873259)
@@ -163,7 +149,7 @@ And our key new contributions:
163149

164150
#### Example llm-d Features
165151

166-
llm-d integrates IGW and vLLM together, enabling a high performance distributed serving stack. Lets discuss some of the example features enabled by llm-d.
152+
llm-d integrates IGW and vLLM together, enabling a high performance distributed serving stack. Let's discuss some of the example features enabled by llm-d.
167153

168154
**Prefix and KV cache-aware routing**
169155

@@ -185,13 +171,13 @@ We conducted a series of experiments to evaluate the performance of the [llm-d-i
185171
* **S2:** llm-d delivers \~50% higher QPS than the baseline while meeting SLO requirements (higher is better).
186172
* **S3:** llm-d sustains 2X the baseline QPS under SLO constraints (higher is better).
187173

188-
These results show that llm-ds cache- and prefix-aware scheduling effectively reduces TTFT and increases QPS compared to the baseline, while consistently meeting SLA requirements.
174+
These results show that llm-d's cache- and prefix-aware scheduling effectively reduces TTFT and increases QPS compared to the baseline, while consistently meeting SLA requirements.
189175

190176
Try it out with the \`base.yaml\` config in our [quickstart](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart). And as a customization example, see the [template](https://github.com/llm-d/llm-d-inference-scheduler/blob/main/docs/create_new_filter.md) for adding your own scheduler filter.
191177

192178
**P/D disaggregation**
193179

194-
Weve completed an initial implementation of P/D disaggregation with vLLM and llm-d-inference-scheduler, which delivers promising speedups for prefill-heavy workloads (20:1 ISL | OSL). Our next focus is finalizing the implementation with heterogeneous TP and completing comprehensive benchmarks for disaggregated serving. Short-term priorities include enabling heterogeneous TP, scaling with high-performance P/D \+ EP\<\>DP for large scale MoEs, and DP-aware load balancing. We will follow up with a detailed performance blog in the coming weeks.
180+
We've completed an initial implementation of P/D disaggregation with vLLM and llm-d-inference-scheduler, which delivers promising speedups for prefill-heavy workloads (20:1 ISL | OSL). Our next focus is finalizing the implementation with heterogeneous TP and completing comprehensive benchmarks for disaggregated serving. Short-term priorities include enabling heterogeneous TP, scaling with high-performance P/D \+ EP\<\>DP for large scale MoEs, and DP-aware load balancing. We will follow up with a detailed performance blog in the coming weeks.
195181

196182
Try it out with the pd-nixl.yaml config in our [quickstart](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart).
197183

blog/2025-06-03_week_1_round_up.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,7 @@ description: Latest news and active design discussions from the llm-d project te
44
slug: llm-d-week-1-round-up
55

66
authors:
7-
- name: Pete Cheslock
8-
title: AI Community Architect, Red Hat
9-
url: https://github.com/petecheslock
10-
image_url: https://avatars.githubusercontent.com/u/511733?v=4
11-
7+
- petecheslock
128

139
tags: [news]
1410

@@ -25,13 +21,15 @@ We've hit 1000 ⭐️'s on [GitHub](https://github.com/llm-d/llm-d)
2521

2622
![llm-d Star Chart](../docs/assets/images/star-history-202563.png)
2723

24+
<!-- truncate -->
2825

2926
**Here are some of the active design conversations:**
3027

3128
:::tip Join our Google Group
3229
We use Google Groups to share architecture diagrams and other content. Please join: [llm-d-contributors Google Group](https://groups.google.com/g/llm-d-contributors)
3330
:::
3431

32+
3533
* [2025-06-01 \[PUBLIC\] llm-d KVTransfer Protocol](https://docs.google.com/document/d/1zBkToR9XWjvBYLxu15JeoGpq16nH5sFFensZP_3lJQU/view)
3634
* [Revisiting The InferenceModel API](https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/view)
3735
* [ModelService: Declarative Inference Serving on llm-d](https://docs.google.com/document/d/1HA-2yNZpc1F4KhyeYA30shjZpYEDqGIJXqVgDVv3SWU/view)

0 commit comments

Comments
 (0)