Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 48 additions & 45 deletions blog/2025-05-20_News.md

Large diffs are not rendered by default.

82 changes: 42 additions & 40 deletions blog/2025-05-20_announce.md

Large diffs are not rendered by default.

7 changes: 5 additions & 2 deletions blog/2025-06-03_week_1_round_up.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ slug: llm-d-week-1-round-up

authors:
- petecheslock

tags: [news]

hide_table_of_contents: false
Expand Down Expand Up @@ -54,4 +54,7 @@ We use Google Groups to share architecture diagrams and other content. Please jo
* [LinkedIn](http://linkedin.com/company/llm-d)
* [@\_llm\_d\_](https://twitter.com/_llm_d_)
* [r/llm\_d](https://www.reddit.com/r/llm_d/)
* YouTube - coming soon
* YouTube - coming soon

<script data-goatcounter="https://llm-d-tracker.asgharlabs.io/count"
async src="//llm-d-tracker.asgharlabs.io/count.js"></script>
3 changes: 3 additions & 0 deletions blog/2025-06-25_community_update.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,6 @@ There are many ways to contribute to llm-d:
6. Check out our [Contributor Guidelines](https://llm-d.ai/docs/community/contribute) to start contributing code

We're looking forward to hearing from you and working together to make llm-d even better!

<script data-goatcounter="https://llm-d-tracker.asgharlabs.io/count"
async src="//llm-d-tracker.asgharlabs.io/count.js"></script>
63 changes: 33 additions & 30 deletions blog/2025-07-29_llm-d-v0.2-our-first-well-lit-paths.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ Our deployments have been tested and benchmarked on recent GPUs, such as H200 no

We’ve defined and improved three well-lit paths that form the foundation of this release:

* [**Intelligent inference scheduling over any vLLM deployment**](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/inference-scheduling): support for precise prefix-cache aware routing with no additional infrastructure, out-of-the-box load-aware scheduling for better tail latency that “just works”, and a new configurable scheduling profile system enable teams to see immediate latency wins and still customize scheduling behavior for their workloads and infrastructure.
* [**P/D disaggregation**:](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/pd-disaggregation) support for separating prefill and decode workloads to improve latency and GPU utilization for long-context scenarios.
* [**Intelligent inference scheduling over any vLLM deployment**](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/inference-scheduling): support for precise prefix-cache aware routing with no additional infrastructure, out-of-the-box load-aware scheduling for better tail latency that “just works”, and a new configurable scheduling profile system enable teams to see immediate latency wins and still customize scheduling behavior for their workloads and infrastructure.
* [**P/D disaggregation**:](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/pd-disaggregation) support for separating prefill and decode workloads to improve latency and GPU utilization for long-context scenarios.
* [**Wide expert parallelism for DeepSeek R1 (EP/DP)**](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/wide-ep-lws): support for large-scale multi-node deployments using expert and data parallelism patterns for MoE models. This includes optimized deployments leveraging NIXL+UCX for inter-node communication, with fixes and improvements to reduce latency, and demonstrates the use of LeaderWorkerSet for Kubernetes-native inference orchestration.

All of these scenarios are reproducible: we provide reference hardware specs, workloads, and benchmarking harness support, so others can evaluate, reproduce, and extend these benchmarks easily. This also reflects improvements to our deployment tooling and benchmarking framework, a new "machinery" that allows users to set up, test, and analyze these scenarios consistently.
Expand All @@ -47,9 +47,9 @@ We've refactored the deployer into a Helm-first, modular structure, splitting ch

The path for Prefill/Decode (P/D) disaggregation and multi-node DP/EP MoE deployments is now more clearly defined and tested. This work integrates and optimizes key [vLLM 0.10.0](https://github.com/vllm-project/vllm/releases/tag/v0.10.0) kernel improvements, including DeepGEMM and CUTLASS for expert parallel compute, as well as PPLX and DeepEP kernels and intra- and inter-node communication fixes and optimizations and multi-node scenarios. We now include:

* Kubernetes-native deployment recipes now support API servers per DP rank for one-pod-per-rank placement, enhancing scalability and control
* Helm charts are updated to support LeaderWorkerSet (LWS) for multi-node setups and direct one-pod-per-DP-rank deployments
* Optimized intra-node communication by enabling DeepEP to use cuda\_ipc efficiently
* Kubernetes-native deployment recipes now support API servers per DP rank for one-pod-per-rank placement, enhancing scalability and control
* Helm charts are updated to support LeaderWorkerSet (LWS) for multi-node setups and direct one-pod-per-DP-rank deployments
* Optimized intra-node communication by enabling DeepEP to use cuda\_ipc efficiently
* Enhanced NIXL+UCX performance, with fixes and optimizations that significantly reduce inter-node communication overhead, particularly for long context workloads

These validated scenarios are backed by benchmark baselines and example deployments via our quickstarts, offering clearer guidance on what works well today. As part of the "well-lit path" we have also identified limitations including known edge cases around response sizes and failure modes where more work is required.
Expand Down Expand Up @@ -84,9 +84,9 @@ Multi-arch support, smaller images, and hardened configurations ensure a reliabl

Here are some key lessons we learned so far in our progress with llm-d:

* **Low-hanging fruit matters.** Targeted optimizations, like reducing KV‑cache transfer overhead between prefill and decode workers and refining prefix‑aware scheduling, delivered significant gains in throughput and tail latency. These quick wins required minimal change but paved the way for the deeper architectural improvements planned in upcoming releases.
* **Using bleeding-edge libraries is hard.** Many key libraries associated with distributed inference are immature. Through our applied experiments in our well-lit paths and in close collaboration with ecosystem partners, we have improved much of the key infrastructure the larger community relies on in real-world conditions.
* **Build on proven paths.** This validates why llm-d exists: to help users avoid discovering these problems themselves, offering reproducible deployments, performance baselines, and extensibility. llm-d focuses on building these paths so our users don’t need to troubleshoot these complex challenges in isolation.
* **Low-hanging fruit matters.** Targeted optimizations, like reducing KV‑cache transfer overhead between prefill and decode workers and refining prefix‑aware scheduling, delivered significant gains in throughput and tail latency. These quick wins required minimal change but paved the way for the deeper architectural improvements planned in upcoming releases.
* **Using bleeding-edge libraries is hard.** Many key libraries associated with distributed inference are immature. Through our applied experiments in our well-lit paths and in close collaboration with ecosystem partners, we have improved much of the key infrastructure the larger community relies on in real-world conditions.
* **Build on proven paths.** This validates why llm-d exists: to help users avoid discovering these problems themselves, offering reproducible deployments, performance baselines, and extensibility. llm-d focuses on building these paths so our users don’t need to troubleshoot these complex challenges in isolation.
* **Community matters.** Working closely with the NVIDIA Dynamo community, we've tackled NIXL/UCX performance overheads for long context workloads, leading to significant improvements and active upstream contributions.

### Our survey
Expand All @@ -99,10 +99,10 @@ Conversational AI (82.9%) and real-time applications (56.1%) stood out as the mo

Today, [llm-d 0.2](https://github.com/llm-d/llm-d/releases/tag/v0.2.0) offers:

* Modular Helm charts and clear deployment workflows.
* Verified support for P/D, DP/EP, pod-per-rank, and heterogeneous GPUs (H200, B200).
* Reproducible performance baselines, now with MoE support.
* New foundations for routing and scheduler extensibility.
* Modular Helm charts and clear deployment workflows.
* Verified support for P/D, DP/EP, pod-per-rank, and heterogeneous GPUs (H200, B200).
* Reproducible performance baselines, now with MoE support.
* New foundations for routing and scheduler extensibility.
* A developer, and researcher-friendly platform with tested examples, with detailed guides on the way.

## A growing community
Expand All @@ -111,31 +111,31 @@ The best part of llm-d has been watching the community grow around it. We're thr

Much of the work happens within our seven Special Interest Groups (SIGs), each focused on a key area:

* **Inference Scheduler** – Developing smarter routing and load‑balancing strategies, including KV‑cache‑aware scheduling.
* **P/D Disaggregation** – Advancing phase‑separation strategies to improve resource‑utilization efficiency.
* **KV Disaggregation** – Advancing and optimizing distributed KV‑cache management.
* **Installation** – Streamlining deployment on Kubernetes, from single‑node setups to large multi‑node clusters.
* **Benchmarking** – Building tools to automate performance validation and make scenarios easier to reproduce and extend.
* **Autoscaling** – Adapting resources dynamically based on workload demands.
* **Inference Scheduler** – Developing smarter routing and load‑balancing strategies, including KV‑cache‑aware scheduling.
* **P/D Disaggregation** – Advancing phase‑separation strategies to improve resource‑utilization efficiency.
* **KV Disaggregation** – Advancing and optimizing distributed KV‑cache management.
* **Installation** – Streamlining deployment on Kubernetes, from single‑node setups to large multi‑node clusters.
* **Benchmarking** – Building tools to automate performance validation and make scenarios easier to reproduce and extend.
* **Autoscaling** – Adapting resources dynamically based on workload demands.
* **Observability** – Providing deep visibility into system performance and health.

We're also collaborating with other great open-source communities like vLLM, Dynamo, and LMCache. Every one of these groups is open, and we’d love for you to join in. Whether you want to contribute code, share ideas, or just listen in, you are welcome. You can find details for each SIG, including their leaders and meeting times, on [our community page](https://llm-d.ai/docs/community/sigs).

## What's next:
## What's next:

Looking ahead, our community is focusing on these key areas:

* **Core optimizations**
* TCP-based request dispatch upstream
* Disaggregation protocol refinements, including possible sidecar removal
* CPU cache offloading to expand memory capacity
* KV event awareness baked directly into routing decisions
* SLO-driven scheduling architecture for predictable performance
* **Benchmarking enhancements:**
* Expanded reproducibility guides.
* Complete performance validation for core scenarios.
* **Developer experience improvements:**
* Expanded examples for inference gateway and scheduler extensibility.
* **Core optimizations**
* TCP-based request dispatch upstream
* Disaggregation protocol refinements, including possible sidecar removal
* CPU cache offloading to expand memory capacity
* KV event awareness baked directly into routing decisions
* SLO-driven scheduling architecture for predictable performance
* **Benchmarking enhancements:**
* Expanded reproducibility guides.
* Complete performance validation for core scenarios.
* **Developer experience improvements:**
* Expanded examples for inference gateway and scheduler extensibility.
* Central Helm charts and expanded documentation.

See our [roadmap issue](https://github.com/llm-d/llm-d/issues/146) to see what is coming next and make your voice heard\!
Expand All @@ -149,3 +149,6 @@ Community engagement is key to our success:
* [**Join our community calls**](https://red.ht/llm-d-public-calendar) (Wed 12:30pm ET)

Contribute on [GitHub](https://github.com/llm-d), join our community calls, join the SIGs and build with us\!

<script data-goatcounter="https://llm-d-tracker.asgharlabs.io/count"
async src="//llm-d-tracker.asgharlabs.io/count.js"></script>
2 changes: 2 additions & 0 deletions docs/community/contact_us.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,5 @@ You can also find us on
- [**LinkedIn:** https://linkedin.com/company/llm-d ](https://linkedin.com/company/llm-d)
- [**X:** https://x.com/\_llm_d\_](https://x.com/_llm_d_)

<script data-goatcounter="https://llm-d-tracker.asgharlabs.io/count"
async src="//llm-d-tracker.asgharlabs.io/count.js"></script>
3 changes: 3 additions & 0 deletions src/components/Welcome/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@ export default function Welcome() {
for most models across a diverse and comprehensive set of hardware accelerators.
</p>

<script data-goatcounter="https://llm-d-tracker.asgharlabs.io/count"
async src="//llm-d-tracker.asgharlabs.io/count.js"></script>

</div>

</div>
Expand Down