You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added a simple GDPR compliant tracker so we can see hits and
engagements on the main hosted sites. This is ideally a
temporary solution, until we find a more robust
system.
Signed-off-by: JJ Asghar <[email protected]>
Copy file name to clipboardExpand all lines: blog/2025-07-29_llm-d-v0.2-our-first-well-lit-paths.md
+33-30Lines changed: 33 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,8 +27,8 @@ Our deployments have been tested and benchmarked on recent GPUs, such as H200 no
27
27
28
28
We’ve defined and improved three well-lit paths that form the foundation of this release:
29
29
30
-
*[**Intelligent inference scheduling over any vLLM deployment**](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/inference-scheduling): support for precise prefix-cache aware routing with no additional infrastructure, out-of-the-box load-aware scheduling for better tail latency that “just works”, and a new configurable scheduling profile system enable teams to see immediate latency wins and still customize scheduling behavior for their workloads and infrastructure.
31
-
*[**P/D disaggregation**:](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/pd-disaggregation) support for separating prefill and decode workloads to improve latency and GPU utilization for long-context scenarios.
30
+
*[**Intelligent inference scheduling over any vLLM deployment**](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/inference-scheduling): support for precise prefix-cache aware routing with no additional infrastructure, out-of-the-box load-aware scheduling for better tail latency that “just works”, and a new configurable scheduling profile system enable teams to see immediate latency wins and still customize scheduling behavior for their workloads and infrastructure.
31
+
*[**P/D disaggregation**:](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/pd-disaggregation) support for separating prefill and decode workloads to improve latency and GPU utilization for long-context scenarios.
32
32
*[**Wide expert parallelism for DeepSeek R1 (EP/DP)**](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/wide-ep-lws): support for large-scale multi-node deployments using expert and data parallelism patterns for MoE models. This includes optimized deployments leveraging NIXL+UCX for inter-node communication, with fixes and improvements to reduce latency, and demonstrates the use of LeaderWorkerSet for Kubernetes-native inference orchestration.
33
33
34
34
All of these scenarios are reproducible: we provide reference hardware specs, workloads, and benchmarking harness support, so others can evaluate, reproduce, and extend these benchmarks easily. This also reflects improvements to our deployment tooling and benchmarking framework, a new "machinery" that allows users to set up, test, and analyze these scenarios consistently.
@@ -47,9 +47,9 @@ We've refactored the deployer into a Helm-first, modular structure, splitting ch
47
47
48
48
The path for Prefill/Decode (P/D) disaggregation and multi-node DP/EP MoE deployments is now more clearly defined and tested. This work integrates and optimizes key [vLLM 0.10.0](https://github.com/vllm-project/vllm/releases/tag/v0.10.0) kernel improvements, including DeepGEMM and CUTLASS for expert parallel compute, as well as PPLX and DeepEP kernels and intra- and inter-node communication fixes and optimizations and multi-node scenarios. We now include:
49
49
50
-
* Kubernetes-native deployment recipes now support API servers per DP rank for one-pod-per-rank placement, enhancing scalability and control
51
-
* Helm charts are updated to support LeaderWorkerSet (LWS) for multi-node setups and direct one-pod-per-DP-rank deployments
52
-
* Optimized intra-node communication by enabling DeepEP to use cuda\_ipc efficiently
50
+
* Kubernetes-native deployment recipes now support API servers per DP rank for one-pod-per-rank placement, enhancing scalability and control
51
+
* Helm charts are updated to support LeaderWorkerSet (LWS) for multi-node setups and direct one-pod-per-DP-rank deployments
52
+
* Optimized intra-node communication by enabling DeepEP to use cuda\_ipc efficiently
53
53
* Enhanced NIXL+UCX performance, with fixes and optimizations that significantly reduce inter-node communication overhead, particularly for long context workloads
54
54
55
55
These validated scenarios are backed by benchmark baselines and example deployments via our quickstarts, offering clearer guidance on what works well today. As part of the "well-lit path" we have also identified limitations including known edge cases around response sizes and failure modes where more work is required.
@@ -84,9 +84,9 @@ Multi-arch support, smaller images, and hardened configurations ensure a reliabl
84
84
85
85
Here are some key lessons we learned so far in our progress with llm-d:
86
86
87
-
***Low-hanging fruit matters.** Targeted optimizations, like reducing KV‑cache transfer overhead between prefill and decode workers and refining prefix‑aware scheduling, delivered significant gains in throughput and tail latency. These quick wins required minimal change but paved the way for the deeper architectural improvements planned in upcoming releases.
88
-
***Using bleeding-edge libraries is hard.** Many key libraries associated with distributed inference are immature. Through our applied experiments in our well-lit paths and in close collaboration with ecosystem partners, we have improved much of the key infrastructure the larger community relies on in real-world conditions.
89
-
***Build on proven paths.** This validates why llm-d exists: to help users avoid discovering these problems themselves, offering reproducible deployments, performance baselines, and extensibility. llm-d focuses on building these paths so our users don’t need to troubleshoot these complex challenges in isolation.
87
+
***Low-hanging fruit matters.** Targeted optimizations, like reducing KV‑cache transfer overhead between prefill and decode workers and refining prefix‑aware scheduling, delivered significant gains in throughput and tail latency. These quick wins required minimal change but paved the way for the deeper architectural improvements planned in upcoming releases.
88
+
***Using bleeding-edge libraries is hard.** Many key libraries associated with distributed inference are immature. Through our applied experiments in our well-lit paths and in close collaboration with ecosystem partners, we have improved much of the key infrastructure the larger community relies on in real-world conditions.
89
+
***Build on proven paths.** This validates why llm-d exists: to help users avoid discovering these problems themselves, offering reproducible deployments, performance baselines, and extensibility. llm-d focuses on building these paths so our users don’t need to troubleshoot these complex challenges in isolation.
90
90
***Community matters.** Working closely with the NVIDIA Dynamo community, we've tackled NIXL/UCX performance overheads for long context workloads, leading to significant improvements and active upstream contributions.
91
91
92
92
### Our survey
@@ -99,10 +99,10 @@ Conversational AI (82.9%) and real-time applications (56.1%) stood out as the mo
* Modular Helm charts and clear deployment workflows.
103
-
* Verified support for P/D, DP/EP, pod-per-rank, and heterogeneous GPUs (H200, B200).
104
-
* Reproducible performance baselines, now with MoE support.
105
-
* New foundations for routing and scheduler extensibility.
102
+
* Modular Helm charts and clear deployment workflows.
103
+
* Verified support for P/D, DP/EP, pod-per-rank, and heterogeneous GPUs (H200, B200).
104
+
* Reproducible performance baselines, now with MoE support.
105
+
* New foundations for routing and scheduler extensibility.
106
106
* A developer, and researcher-friendly platform with tested examples, with detailed guides on the way.
107
107
108
108
## A growing community
@@ -111,31 +111,31 @@ The best part of llm-d has been watching the community grow around it. We're thr
111
111
112
112
Much of the work happens within our seven Special Interest Groups (SIGs), each focused on a key area:
113
113
114
-
***Inference Scheduler** – Developing smarter routing and load‑balancing strategies, including KV‑cache‑aware scheduling.
115
-
***P/D Disaggregation** – Advancing phase‑separation strategies to improve resource‑utilization efficiency.
116
-
***KV Disaggregation** – Advancing and optimizing distributed KV‑cache management.
117
-
***Installation** – Streamlining deployment on Kubernetes, from single‑node setups to large multi‑node clusters.
118
-
***Benchmarking** – Building tools to automate performance validation and make scenarios easier to reproduce and extend.
119
-
***Autoscaling** – Adapting resources dynamically based on workload demands.
114
+
***Inference Scheduler** – Developing smarter routing and load‑balancing strategies, including KV‑cache‑aware scheduling.
115
+
***P/D Disaggregation** – Advancing phase‑separation strategies to improve resource‑utilization efficiency.
116
+
***KV Disaggregation** – Advancing and optimizing distributed KV‑cache management.
117
+
***Installation** – Streamlining deployment on Kubernetes, from single‑node setups to large multi‑node clusters.
118
+
***Benchmarking** – Building tools to automate performance validation and make scenarios easier to reproduce and extend.
119
+
***Autoscaling** – Adapting resources dynamically based on workload demands.
120
120
***Observability** – Providing deep visibility into system performance and health.
121
121
122
122
We're also collaborating with other great open-source communities like vLLM, Dynamo, and LMCache. Every one of these groups is open, and we’d love for you to join in. Whether you want to contribute code, share ideas, or just listen in, you are welcome. You can find details for each SIG, including their leaders and meeting times, on [our community page](https://llm-d.ai/docs/community/sigs).
123
123
124
-
## What's next:
124
+
## What's next:
125
125
126
126
Looking ahead, our community is focusing on these key areas:
127
127
128
-
***Core optimizations**
129
-
* TCP-based request dispatch upstream
130
-
* Disaggregation protocol refinements, including possible sidecar removal
131
-
* CPU cache offloading to expand memory capacity
132
-
* KV event awareness baked directly into routing decisions
133
-
* SLO-driven scheduling architecture for predictable performance
134
-
***Benchmarking enhancements:**
135
-
* Expanded reproducibility guides.
136
-
* Complete performance validation for core scenarios.
137
-
***Developer experience improvements:**
138
-
* Expanded examples for inference gateway and scheduler extensibility.
128
+
***Core optimizations**
129
+
* TCP-based request dispatch upstream
130
+
* Disaggregation protocol refinements, including possible sidecar removal
131
+
* CPU cache offloading to expand memory capacity
132
+
* KV event awareness baked directly into routing decisions
133
+
* SLO-driven scheduling architecture for predictable performance
134
+
***Benchmarking enhancements:**
135
+
* Expanded reproducibility guides.
136
+
* Complete performance validation for core scenarios.
137
+
***Developer experience improvements:**
138
+
* Expanded examples for inference gateway and scheduler extensibility.
139
139
* Central Helm charts and expanded documentation.
140
140
141
141
See our [roadmap issue](https://github.com/llm-d/llm-d/issues/146) to see what is coming next and make your voice heard\!
@@ -149,3 +149,6 @@ Community engagement is key to our success:
149
149
*[**Join our community calls**](https://red.ht/llm-d-public-calendar) (Wed 12:30pm ET)
150
150
151
151
Contribute on [GitHub](https://github.com/llm-d), join our community calls, join the SIGs and build with us\!
0 commit comments