From a1cca7a28e3f4fc3f2a0625edb63a532916ff326 Mon Sep 17 00:00:00 2001
From: JJ Asghar
Date: Wed, 30 Jul 2025 11:47:33 -0500
Subject: [PATCH] Added "goat tracker"
Added a simple GDPR compliant tracker so we can see hits and
engagements on the main hosted sites. This is ideally a
temporary solution, until we find a more robust
system.
Signed-off-by: JJ Asghar
---
blog/2025-05-20_News.md | 93 ++++++++++---------
blog/2025-05-20_announce.md | 82 ++++++++--------
blog/2025-06-03_week_1_round_up.md | 7 +-
blog/2025-06-25_community_update.md | 3 +
...-29_llm-d-v0.2-our-first-well-lit-paths.md | 63 +++++++------
docs/community/contact_us.md | 2 +
src/components/Welcome/index.js | 3 +
7 files changed, 136 insertions(+), 117 deletions(-)
diff --git a/blog/2025-05-20_News.md b/blog/2025-05-20_News.md
index 82d225d..9b4b690 100644
--- a/blog/2025-05-20_News.md
+++ b/blog/2025-05-20_News.md
@@ -26,14 +26,14 @@ Red Hat and its industry partners are directly confronting this challenge with l
llm-d delivers a powerful suite of innovations, highlighted by:
-* **vLLM**, which has quickly become the open source de facto standard inference server, providing day 0 model support for emerging frontier models, and support for a broad list of accelerators, now including Google Cloud Tensor Processor Units (TPUs).
-* **Prefill and Decode Disaggregation** to separate the input context and token generation phases of AI into discrete operations, where they can then be distributed across multiple servers.
-* **KV (key-value) Cache Offloading**, based on LMCache, shifts the memory burden of the KV cache from GPU memory to more cost-efficient and abundant standard storage, like CPU memory or network storage.
-* **Kubernetes-powered clusters and controllers** for more efficient scheduling of compute and storage resources as workload demands fluctuate, while maintaining performance and lower latency.
-* **AI-Aware Network Routing** for scheduling incoming requests to the servers and accelerators that are most likely to have hot caches of past inference calculations.
+* **vLLM**, which has quickly become the open source de facto standard inference server, providing day 0 model support for emerging frontier models, and support for a broad list of accelerators, now including Google Cloud Tensor Processor Units (TPUs).
+* **Prefill and Decode Disaggregation** to separate the input context and token generation phases of AI into discrete operations, where they can then be distributed across multiple servers.
+* **KV (key-value) Cache Offloading**, based on LMCache, shifts the memory burden of the KV cache from GPU memory to more cost-efficient and abundant standard storage, like CPU memory or network storage.
+* **Kubernetes-powered clusters and controllers** for more efficient scheduling of compute and storage resources as workload demands fluctuate, while maintaining performance and lower latency.
+* **AI-Aware Network Routing** for scheduling incoming requests to the servers and accelerators that are most likely to have hot caches of past inference calculations.
* **High-performance communication APIs** for faster and more efficient data transfer between servers, with support for NVIDIA Inference Xfer Library (NIXL).
-### llm-d: Backed by industry leaders
+### llm-d: Backed by industry leaders
This new open source project has already garnered the support of a formidable coalition of leading gen AI model providers, AI accelerator pioneers, and premier AI cloud platforms. CoreWeave, Google Cloud, IBM Research and NVIDIA are founding contributors, with AMD, Cisco, Hugging Face, Intel, Lambda and Mistral AI as partners, underscoring the industry’s deep collaboration to architect the future of large-scale LLM serving. The llm-d community is further joined by founding supporters at the Sky Computing Lab at the University of California, originators of vLLM, and the LMCache Lab at the University of Chicago, originators of [LMCache](https://github.com/LMCache/LMCache)*.*
@@ -43,87 +43,90 @@ Rooted in its unwavering commitment to open collaboration, Red Hat recognizes th
The future of AI must be defined by limitless opportunity, not constrained by infrastructure silos. Red Hat sees a horizon where organizations can deploy any model, on any accelerator, across any cloud, delivering an exceptional, more consistent user experience without exorbitant costs. To unlock the true potential of gen AI investments, enterprises require a universal inference platform \- a standard for more seamless, high-performance AI innovation, both today and in the years to come.
-Just as Red Hat pioneered the open enterprise by transforming Linux into the bedrock of modern IT, the company is now poised to architect the future of AI inference. vLLM’s potential is that of a linchpin for standardized gen AI inference, and Red Hat is committed to building a thriving ecosystem around not just the vLLM community but also llm-d for distributed inference at scale. The vision is clear: regardless of the AI model or the underlying accelerator or the deployment environment, Red Hat intends to make vLLM the definitive open standard for inference across the new hybrid cloud.
+Just as Red Hat pioneered the open enterprise by transforming Linux into the bedrock of modern IT, the company is now poised to architect the future of AI inference. vLLM’s potential is that of a linchpin for standardized gen AI inference, and Red Hat is committed to building a thriving ecosystem around not just the vLLM community but also llm-d for distributed inference at scale. The vision is clear: regardless of the AI model or the underlying accelerator or the deployment environment, Red Hat intends to make vLLM the definitive open standard for inference across the new hybrid cloud.
-**Red Hat Summit**
+**Red Hat Summit**
Join the Red Hat Summit keynotes to hear the latest from Red Hat executives, customers and partners:
-* [**Modernized infrastructure meets enterprise-ready AI**](https://events.experiences.redhat.com/widget/redhat/sum25/SessionCatalog2025/session/1737554802676001HJ8q) — Tuesday, May 20, 8-10 a.m. EDT ([YouTube](https://youtube.com/live/Gr8jomztY2s?feature=share))
+* [**Modernized infrastructure meets enterprise-ready AI**](https://events.experiences.redhat.com/widget/redhat/sum25/SessionCatalog2025/session/1737554802676001HJ8q) — Tuesday, May 20, 8-10 a.m. EDT ([YouTube](https://youtube.com/live/Gr8jomztY2s?feature=share))
* [**Hybrid cloud evolves to deliver enterprise innovation**](https://events.experiences.redhat.com/widget/redhat/sum25/SessionCatalog2025/session/1737554802763001Hr0T) — Wednesday, May 21, 8-9:30 a.m. EDT ([YouTube](https://youtube.com/live/g0K0pJIKHBU?feature=share))
-**Supporting Quotes**
-*Brian Stevens, senior vice president and AI CTO, Red Hat*
+**Supporting Quotes**
+*Brian Stevens, senior vice president and AI CTO, Red Hat*
“The launch of the llm-d community, backed by a vanguard of AI leaders, marks a pivotal moment in addressing the need for scalable gen AI inference, a crucial obstacle that must be overcome to enable broader enterprise AI adoption. By tapping the innovation of vLLM and the proven capabilities of Kubernetes, llm-d paves the way for distributed, scalable and high-performing AI inference across the expanded hybrid cloud, supporting any model, any accelerator, on any cloud environment and helping realize a vision of limitless AI potential.”
-*Ramine Roane, corporate vice president, AI Product Management, AMD*
+*Ramine Roane, corporate vice president, AI Product Management, AMD*
"AMD is proud to be a founding member of the llm-d community, contributing our expertise in high-performance GPUs to advance AI inference for evolving enterprise AI needs. As organizations navigate the increasing complexity of generative AI to achieve greater scale and efficiency, AMD looks forward to meeting this industry demand through the llm-d project."
-*Shannon McFarland, vice president, Cisco Open Source Program Office & Head of Cisco DevNet*
+*Shannon McFarland, vice president, Cisco Open Source Program Office & Head of Cisco DevNet*
“The llm-d project is an exciting step forward for practical generative AI. llm-d empowers developers to programmatically integrate and scale generative AI inference, unlocking new levels of innovation and efficiency in the modern AI landscape. Cisco is proud to be part of the llm-d community, where we’re working together to explore real-world use cases that help organizations apply AI more effectively and efficiently.”
-*Chen Goldberg, senior vice president, Engineering, CoreWeave*
-“CoreWeave is proud to be a founding contributor to the llm-d project and to deepen our long-
-standing commitment to open source AI. From our early partnership with EleutherAI to our ongoing work advancing inference at scale, we’ve consistently invested in making powerful AI infrastructure more accessible. We’re excited to collaborate with an incredible group of partners
-and the broader developer community to build a flexible, high-performance inference engine
+*Chen Goldberg, senior vice president, Engineering, CoreWeave*
+“CoreWeave is proud to be a founding contributor to the llm-d project and to deepen our long-
+standing commitment to open source AI. From our early partnership with EleutherAI to our ongoing work advancing inference at scale, we’ve consistently invested in making powerful AI infrastructure more accessible. We’re excited to collaborate with an incredible group of partners
+and the broader developer community to build a flexible, high-performance inference engine
that accelerates innovation and lays the groundwork for open, interoperable AI.”
-*Mark Lohmeyer, vice president and general manager, AI & Computing Infrastructure, Google Cloud*
+*Mark Lohmeyer, vice president and general manager, AI & Computing Infrastructure, Google Cloud*
"Efficient AI inference is paramount as organizations move to deploying AI at scale and deliver value for their users. As we enter this new age of inference, Google Cloud is proud to build upon our legacy of open source contributions as a founding contributor to the llm-d project. This new community will serve as a critical catalyst for distributed AI inference at scale, helping users realize enhanced workload efficiency with increased optionality for their infrastructure resources."
-*Jeff Boudier, Head of Product, Hugging Face*
+*Jeff Boudier, Head of Product, Hugging Face*
“We believe every company should be able to build and run their own models. With vLLM leveraging the Hugging Face transformers library as the source of truth for model definitions; a wide diversity of models large and small is available to power text, audio, image and video AI applications. Eight million AI Builders use Hugging Face to collaborate on over two million AI models and datasets openly shared with the global community. We are excited to support the llm-d project to enable developers to take these applications to scale.”
-*Priya Nagpurkar, vice president, Hybrid Cloud and AI Platform, IBM Research*
+*Priya Nagpurkar, vice president, Hybrid Cloud and AI Platform, IBM Research*
“At IBM, we believe the next phase of AI is about efficiency and scale. We’re focused on unlocking value for enterprises through AI solutions they can deploy effectively. As a founding contributor to llm-d, IBM is proud to be a key part of building a differentiated hardware agnostic distributed AI inference platform. We’re looking forward to continued contributions towards the growth and success of this community to transform the future of AI inference.”
-*Bill Pearson, vice president, Data Center & AI Software Solutions and Ecosystem, Intel*
-“The launch of llm-d will serve as a key inflection point for the industry in driving AI transformation at scale, and Intel is excited to participate as a founding supporter. Intel’s involvement with llm-d is the latest milestone in our decades-long collaboration with Red Hat to empower enterprises with open source solutions that they can deploy anywhere, on their platform of choice. We look forward to further extending and building AI innovation through the llm-d community.”
+*Bill Pearson, vice president, Data Center & AI Software Solutions and Ecosystem, Intel*
+“The launch of llm-d will serve as a key inflection point for the industry in driving AI transformation at scale, and Intel is excited to participate as a founding supporter. Intel’s involvement with llm-d is the latest milestone in our decades-long collaboration with Red Hat to empower enterprises with open source solutions that they can deploy anywhere, on their platform of choice. We look forward to further extending and building AI innovation through the llm-d community.”
- *Eve Callicoat, senior staff engineer, ML Platform, Lambda*
+ *Eve Callicoat, senior staff engineer, ML Platform, Lambda*
"Inference is where the real-world value of AI is delivered, and llm-d represents a major leap forward. Lambda is proud to support a project that makes state-of-the-art inference accessible, efficient, and open."
-*Ujval Kapasi, vice president, Engineering AI Frameworks, NVIDIA*
+*Ujval Kapasi, vice president, Engineering AI Frameworks, NVIDIA*
“The llm-d project is an important addition to the open source AI ecosystem and reflects NVIDIA’s support for collaboration to drive innovation in generative AI. Scalable, highly performant inference is key to the next wave of generative and agentic AI. We’re working with Red Hat and other supporting partners to foster llm-d community engagement and industry adoption, helping accelerate llm-d with innovations from NVIDIA Dynamo such as NIXL.”
-*Ion Stoica, Professor and Director of Sky Computing Lab, University of California, Berkeley*
-“We are pleased to see Red Hat build upon the established success of vLLM, which originated in our lab to help address the speed and memory challenges that come with running large AI models. Open source projects like vLLM, and now llm-d anchored in vLLM, are at the frontier of AI innovation tackling the most demanding AI inference requirements and moving the needle for the industry at large.”
+*Ion Stoica, Professor and Director of Sky Computing Lab, University of California, Berkeley*
+“We are pleased to see Red Hat build upon the established success of vLLM, which originated in our lab to help address the speed and memory challenges that come with running large AI models. Open source projects like vLLM, and now llm-d anchored in vLLM, are at the frontier of AI innovation tackling the most demanding AI inference requirements and moving the needle for the industry at large.”
-*Junchen Jiang, CS Professor, LMCache Lab, University of Chicago*
+*Junchen Jiang, CS Professor, LMCache Lab, University of Chicago*
“Distributed KV cache optimizations, such as offloading, compression, and blending, have been a key focus of our lab, and we are excited to see llm-d leveraging LMCache as a core component to reduce time to first token as well as improve throughput, particularly in long-context inference.”
**Additional Resources**
-* Learn more about [llm-d](https://www.llm-d.ai)
-* Read more about [vLLM](https://www.redhat.com/en/topics/ai/what-is-vllm)
-* Find out more about [contributing to llm-d](https://github.com/llm-d)
-* Learn more about [Red Hat Summit](http://red.ht/I2Zk1e)
-* See all of Red Hat’s announcements this week in the [Red Hat Summit newsroom](https://red.ht/3QrRUAh)
+* Learn more about [llm-d](https://www.llm-d.ai)
+* Read more about [vLLM](https://www.redhat.com/en/topics/ai/what-is-vllm)
+* Find out more about [contributing to llm-d](https://github.com/llm-d)
+* Learn more about [Red Hat Summit](http://red.ht/I2Zk1e)
+* See all of Red Hat’s announcements this week in the [Red Hat Summit newsroom](https://red.ht/3QrRUAh)
* Follow [@RedHatSummit](https://twitter.com/redhatsummit) or [\#RHSummit](https://twitter.com/hashtag/rhsummit) on X for event-specific updates
**Connect with Red Hat**
-* Learn more about [Red Hat](http://red.ht/IOS5vm)
-* Get more news in the [Red Hat newsroom](http://red.ht/1qeXuma)
-* Read the [Red Hat blog](http://red.ht/1zzgkXp)
-* Follow [Red Hat on X](https://red.ht/3Ghe0TT.)
-* Follow [Red Hat on Instagram](https://red.ht/4iBsqwB)
-* Follow [Red Hat on LinkedIn](https://red.ht/4hHewrv)
+* Learn more about [Red Hat](http://red.ht/IOS5vm)
+* Get more news in the [Red Hat newsroom](http://red.ht/1qeXuma)
+* Read the [Red Hat blog](http://red.ht/1zzgkXp)
+* Follow [Red Hat on X](https://red.ht/3Ghe0TT.)
+* Follow [Red Hat on Instagram](https://red.ht/4iBsqwB)
+* Follow [Red Hat on LinkedIn](https://red.ht/4hHewrv)
* Watch [Red Hat videos on YouTube](https://red.ht/44B8oxL)
-**About Red Hat**
+**About Red Hat**
[Red Hat](https://www.redhat.com/en) is the open hybrid cloud technology leader, delivering a trusted, consistent and comprehensive foundation for transformative IT innovation and AI applications. Its portfolio of cloud, developer, AI, Linux, automation and application platform technologies enables any application, anywhere—from the datacenter to the edge. As the world's leading provider of enterprise open source software solutions, Red Hat invests in open ecosystems and communities to solve tomorrow's IT challenges. Collaborating with partners and customers, Red Hat helps them build, connect, automate, secure and manage their IT environments, supported by consulting services and [award-winning](https://access.redhat.com/recognition) training and certification offerings.
-**Forward-Looking Statements**
+**Forward-Looking Statements**
Except for the historical information and discussions contained herein, statements contained in this press release may constitute forward-looking statements within the meaning of the Private Securities Litigation Reform Act of 1995\. Forward-looking statements are based on the company’s current assumptions regarding future business and financial performance. These statements involve a number of risks, uncertainties and other factors that could cause actual results to differ materially. Any forward-looking statement in this press release speaks only as of the date on which it is made. Except as required by law, the company assumes no obligation to update or revise any forward-looking statements.
-**Media Contact:**
-John Terrill
-Red Hat
-\+1-571-421-8132
+**Media Contact:**
+John Terrill
+Red Hat
+\+1-571-421-8132
[jterrill@redhat.com](mailto:jterrill@redhat.com)
*\#\#\#*
-*Red Hat and the Red Hat logo are trademarks or registered trademarks of Red Hat, Inc. or its subsidiaries in the U.S. and other countries.*
+*Red Hat and the Red Hat logo are trademarks or registered trademarks of Red Hat, Inc. or its subsidiaries in the U.S. and other countries.*
[^1]: Forecast Analysis: AI Semiconductors, Worldwide, Alan Priestley, Gartner, 2 August 2024 \- ID G00818912 GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in
+
+
diff --git a/blog/2025-05-20_announce.md b/blog/2025-05-20_announce.md
index dd21f9c..9626511 100644
--- a/blog/2025-05-20_announce.md
+++ b/blog/2025-05-20_announce.md
@@ -8,14 +8,14 @@ authors:
- robshaw
- smarterclayton
- chcost
-
+
tags: [hello, welcome, llm-d]
hide_table_of_contents: false
---
## Announcing the llm-d community
-llm-d is a Kubernetes-native high-performance distributed LLM inference framework
+llm-d is a Kubernetes-native high-performance distributed LLM inference framework
\- a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.
With llm-d, users can operationalize gen AI deployments with a modular, high-performance, end-to-end serving solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in [Inference Gateway (IGW)](https://github.com/kubernetes-sigs/gateway-api-inference-extension?tab=readme-ov-file).
@@ -32,9 +32,9 @@ Kubernetes typically scales out application workloads with uniform replicas and
This simple pattern is very effective for most request patterns, which have the following characteristics:
-* Requests are short-lived and generally uniform in resource utilization
-* Requests have generally uniform latency service level objectives (SLOs)
-* Each replica can process each request equally well
+* Requests are short-lived and generally uniform in resource utilization
+* Requests have generally uniform latency service level objectives (SLOs)
+* Each replica can process each request equally well
* Specializing variants and coordinating replicas to process a single request is not useful
#### LLM Serving Is Unique
@@ -47,8 +47,8 @@ Let's take a look at each one step-by-step:
*A. Requests are expensive with significant variance in resource utilization.*
-* Each LLM inference request has a different "shape" to it, as measured by the number of input tokens and output tokens. There is significant variance in these parameters across requests and workloads.
- * RAG has long inputs \- prompt and retrieved docs \- and short generated outputs
+* Each LLM inference request has a different "shape" to it, as measured by the number of input tokens and output tokens. There is significant variance in these parameters across requests and workloads.
+ * RAG has long inputs \- prompt and retrieved docs \- and short generated outputs
* Reasoning has a short or medium inputs and long generated outputs

@@ -57,8 +57,8 @@ Let's take a look at each one step-by-step:
*B. Routing to specific replicas with cached prior computation can achieve orders of magnitude better latency.*
-* Many common LLM workloads have "multi-turn" request patterns, where the same prompt is sent iteratively to the same instance.
- * Agentic (tool calls are iterative request flow)
+* Many common LLM workloads have "multi-turn" request patterns, where the same prompt is sent iteratively to the same instance.
+ * Agentic (tool calls are iterative request flow)
* Code completion task (requests reuse current codebase as context)

@@ -73,7 +73,7 @@ Let's take a look at each one step-by-step:
* Standard LLM deployments perform the prefill and decode phases of inference within a single replica.Given that prefill and decode phases of inference have different resource requirements, co-locating these phases on the same replica leads to inefficient resource use, especially for long sequences.
-* **Disaggregation** (e.g. [Distserve](https://arxiv.org/abs/2401.09670)) separates prefill and decode phases onto different variants, enabling independent optimization and scaling of each phase.
+* **Disaggregation** (e.g. [Distserve](https://arxiv.org/abs/2401.09670)) separates prefill and decode phases onto different variants, enabling independent optimization and scaling of each phase.
* Google [leverages disaggregated serving on TPUs](https://cloud.google.com/blog/products/compute/whats-new-with-ai-hypercomputer) to provide better first-token latency and simplify operational scaling.
* DeepSeek released a [discussion of the design of their inference system](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md), which leverages aggressive disaggregation to achieve remarkable performance at scale.
@@ -82,10 +82,10 @@ Let's take a look at each one step-by-step:
*D. Production deployments often have a range of quality of service (QoS) requirements.*
-* Use cases for a single LLM endpoint can have a wide variety of quality of service requirements. Consider the following examples:
- * Latency is the most important factor: Code completion requests and search responses need to minimize latency to provide an "in the loop" experience. O(ms) latency tolerance.
- * Latency is important: Chat agent sessions and email drafting with interactive use cases. O(seconds) latency tolerance.
- * Latency tolerant: Video call and email summarization and "deep research" agents with daily or hourly usage patterns. O(minutes) latency tolerance.
+* Use cases for a single LLM endpoint can have a wide variety of quality of service requirements. Consider the following examples:
+ * Latency is the most important factor: Code completion requests and search responses need to minimize latency to provide an "in the loop" experience. O(ms) latency tolerance.
+ * Latency is important: Chat agent sessions and email drafting with interactive use cases. O(seconds) latency tolerance.
+ * Latency tolerant: Video call and email summarization and "deep research" agents with daily or hourly usage patterns. O(minutes) latency tolerance.
* Latency agnostic: Overnight batch processing workloads, meeting minute generation, and autonomous agents. O(hours) latency tolerance.
* Given the compute intensity (and, therefore, high costs) of LLMs, tight latency SLOs are substantially more expensive to achieve. This spectrum of latency requirements presents an opportunity to further optimize infrastructure efficiency – the more latency tolerant a workload is, the more we can optimize infrastructure efficiency amongst other workloads.
@@ -102,8 +102,8 @@ The objective of llm-d is to create a well-lit path for anyone to adopt the lead
To achieve this goal, we have the following design principles for the project:
-* **Operationalizability:** modular and resilient architecture with native integration into Kubernetes via Inference Gateway API
-* **Flexibility:** cross-platform (active work to support NVIDIA, Google TPU, AMD, and Intel), with extensible implementations of key composable layers of the stack
+* **Operationalizability:** modular and resilient architecture with native integration into Kubernetes via Inference Gateway API
+* **Flexibility:** cross-platform (active work to support NVIDIA, Google TPU, AMD, and Intel), with extensible implementations of key composable layers of the stack
* **Performance**: leverage distributed optimizations like disaggregation and prefix-aware routing to achieve the highest tok/$ while meeting SLOs
#### Architecture
@@ -111,7 +111,7 @@ To achieve this goal, we have the following design principles for the project:
To achieve this objective, we designed llm-d with a modular and layered architecture on top of industry-standard open-source technologies \- vLLM, Kubernetes, and Inference Gateway.
-* [**vLLM**. vLLM](https://docs.vllm.ai/en/latest/) is the leading open-source LLM inference engine, supporting a wide range of models (including Llama and DeepSeek) and hardware accelerators (including NVIDIA GPU, Google TPU, AMD ) with high performance.
+* [**vLLM**. vLLM](https://docs.vllm.ai/en/latest/) is the leading open-source LLM inference engine, supporting a wide range of models (including Llama and DeepSeek) and hardware accelerators (including NVIDIA GPU, Google TPU, AMD ) with high performance.
* [**Kubernetes**](https://kubernetes.io/docs/home/) **(K8s)**. K8s is an open source container orchestration engine for automating deployment, scaling, and management of containerized applications. It is the industry standard for deploying and updating LLM inference engines across various hardware accelerators.
@@ -121,30 +121,30 @@ To achieve this objective, we designed llm-d with a modular and layered architec
And our key new contributions:
-* **vLLM Optimized Inference Scheduler** \- IGW defines a pattern for customizable "smart" load-balancing via the [Endpoint Picker Protocol (EPP)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol). Leveraging enhanced operational telemetry exposed by vLLM, the inference scheduler implements the filtering and scoring algorithms necessary to make "smart" scheduling decisions around disaggregated serving, prefix-cache-awareness, and load-awareness, validated to be used out-of-the-box by llm-d users. Advanced teams can also tweak or implement their own scorers and filterers to further customize for their use cases, while still benefiting from upcoming operational features in the inference gateway, like flow control and latency-aware balancing.
+* **vLLM Optimized Inference Scheduler** \- IGW defines a pattern for customizable "smart" load-balancing via the [Endpoint Picker Protocol (EPP)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol). Leveraging enhanced operational telemetry exposed by vLLM, the inference scheduler implements the filtering and scoring algorithms necessary to make "smart" scheduling decisions around disaggregated serving, prefix-cache-awareness, and load-awareness, validated to be used out-of-the-box by llm-d users. Advanced teams can also tweak or implement their own scorers and filterers to further customize for their use cases, while still benefiting from upcoming operational features in the inference gateway, like flow control and latency-aware balancing.
* For more details, see our Northstar: [\[PUBLIC\] llm-d Scheduler Northstar](https://docs.google.com/document/d/1kE1LY8OVjiOgKVD9-9Po96HODbTIbgHp4qgvw06BCOc/edit?tab=t.0)
-* **Disaggregated Serving with [vLLM](https://github.com/vllm-project/vllm) \-** llm-d leverages vLLM's recently enabled support for disaggregated serving via a pluggable KV Connector API to run prefill and decode on independent instances, using high-performance transport libraries like [NVIDIA's NIXL](https://github.com/ai-dynamo/nixl).
-
- In llm-d, we plan to support two "well-lit" paths for prefill/decode (P/D) disaggregation:
- * Latency optimized implementation using fast interconnects (IB, RDMA, ICI)
- * Throughput optimized implementation using data center networking
+* **Disaggregated Serving with [vLLM](https://github.com/vllm-project/vllm) \-** llm-d leverages vLLM's recently enabled support for disaggregated serving via a pluggable KV Connector API to run prefill and decode on independent instances, using high-performance transport libraries like [NVIDIA's NIXL](https://github.com/ai-dynamo/nixl).
+
+ In llm-d, we plan to support two "well-lit" paths for prefill/decode (P/D) disaggregation:
+ * Latency optimized implementation using fast interconnects (IB, RDMA, ICI)
+ * Throughput optimized implementation using data center networking
* For more details, see our Northstar:[\[PUBLIC\] llm-d Disaggregated Serving Northstar](https://docs.google.com/document/d/1FNN5snmipaTxEA1FGEeSH7Z_kEqskouKD1XYhVyTHr8/edit?tab=t.0#heading=h.ycwld2oth1kj)
-* **Disaggregated Prefix Caching with vLLM** \- llm-d uses the same vLLM KV connector API used in disaggregated serving to provide a pluggable cache for previous calculations, including offloading KVs to host, remote storage, and systems like [LMCache](https://github.com/LMCache/LMCache).
-
- In llm-d, we plan to support two "well-lit" paths for KV cache disaggregation:
- * Independent caching with basic offloading to host memory and disk, providing a zero operational cost mechanism that utilizes all system resources
- * Shared caching with KV transfer between instances and shared storage with global indexing, providing potential for higher performance at the cost of a more operationally complex system.
+* **Disaggregated Prefix Caching with vLLM** \- llm-d uses the same vLLM KV connector API used in disaggregated serving to provide a pluggable cache for previous calculations, including offloading KVs to host, remote storage, and systems like [LMCache](https://github.com/LMCache/LMCache).
+
+ In llm-d, we plan to support two "well-lit" paths for KV cache disaggregation:
+ * Independent caching with basic offloading to host memory and disk, providing a zero operational cost mechanism that utilizes all system resources
+ * Shared caching with KV transfer between instances and shared storage with global indexing, providing potential for higher performance at the cost of a more operationally complex system.
* For more details, see our Northstar: [\[PUBLIC\] llm-d Prefix Caching Northstar](https://docs.google.com/document/d/1d-jKVHpTJ_tkvy6Pfbl3q2FM59NpfnqPAh__Uz_bEZ8/edit?tab=t.0#heading=h.6qazyl873259)
* **Variant Autoscaling over Hardware, Workload, and Traffic** \- Accelerator hardware varies dramatically in terms of compute, memory, and cost, workloads sharing the same models vary by their required quality of service, the distinct phases of LLM inference and large mixture-of-expert models vary on whether they are compute, memory, or network bound, and incoming traffic varies over time and by workload. Today, all of these decisions are made at deployment time, and almost all deployers struggle to enable autoscaling to reduce their costs safely.
- Drawing on extensive experience from end users and OSS collaborators like AIBrix, we plan to implement a traffic- and hardware-aware autoscaler that:
- * Measures the capacity of each model server instance
- * Derive a load function that takes into account different request shapes and QoS
- * Using the recent traffic mix \- QPS (Queries Per Second), QoS, and shape distribution \- calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, and label each instance with a grouping
- * Report load metrics per grouping that allows Kubernetes horizontal pod autoscaling to match hardware in use to hardware needed without violating SLOs
+ Drawing on extensive experience from end users and OSS collaborators like AIBrix, we plan to implement a traffic- and hardware-aware autoscaler that:
+ * Measures the capacity of each model server instance
+ * Derive a load function that takes into account different request shapes and QoS
+ * Using the recent traffic mix \- QPS (Queries Per Second), QoS, and shape distribution \- calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, and label each instance with a grouping
+ * Report load metrics per grouping that allows Kubernetes horizontal pod autoscaling to match hardware in use to hardware needed without violating SLOs
* For more details, see our Northstar: [\[PUBLIC\] llm-d Autoscaling Northstar](https://docs.google.com/document/d/1inTneLEZTv3rDEBB9KLOB9K6oMq8c3jkogARJqdt_58/edit?tab=t.0)
#### Example llm-d Features
@@ -153,9 +153,9 @@ llm-d integrates IGW and vLLM together, enabling a high performance distributed
**Prefix and KV cache-aware routing**
-The first key collaboration between IGW and vLLM in llm-d was developing prefix-cache aware routing to complement the existing KV cache utilization aware load balancing in IGW.
+The first key collaboration between IGW and vLLM in llm-d was developing prefix-cache aware routing to complement the existing KV cache utilization aware load balancing in IGW.
-We conducted a series of experiments to evaluate the performance of the [llm-d-inference-scheduler](https://github.com/llm-d/llm-d-inference-scheduler) with prefix-aware routing on 2 NVIDIA 8xH100 nodes using the [LMbenchmark in a long-input/short-output configuration designed](https://github.com/LMCache/LMBenchmark/tree/main/synthetic-multi-round-qa) to stress KV cache reuse and routing decision quality.
+We conducted a series of experiments to evaluate the performance of the [llm-d-inference-scheduler](https://github.com/llm-d/llm-d-inference-scheduler) with prefix-aware routing on 2 NVIDIA 8xH100 nodes using the [LMbenchmark in a long-input/short-output configuration designed](https://github.com/LMCache/LMBenchmark/tree/main/synthetic-multi-round-qa) to stress KV cache reuse and routing decision quality.
| | Model | Configuration | ISL | OSL | Latency SLO |
| :---- | :---- | :---- | :---- | :---- | :---- |
@@ -167,8 +167,8 @@ We conducted a series of experiments to evaluate the performance of the [llm-d-i
**Key Observations:**
-* **S1:** At 4 QPS, llm-d achieves a mean TTFT approximately 3X lower than the baseline (lower is better).
-* **S2:** llm-d delivers \~50% higher QPS than the baseline while meeting SLO requirements (higher is better).
+* **S1:** At 4 QPS, llm-d achieves a mean TTFT approximately 3X lower than the baseline (lower is better).
+* **S2:** llm-d delivers \~50% higher QPS than the baseline while meeting SLO requirements (higher is better).
* **S3:** llm-d sustains 2X the baseline QPS under SLO constraints (higher is better).
These results show that llm-d's cache- and prefix-aware scheduling effectively reduces TTFT and increases QPS compared to the baseline, while consistently meeting SLA requirements.
@@ -179,7 +179,7 @@ Try it out with the \`base.yaml\` config in our [quickstart](https://github.com/
We've completed an initial implementation of P/D disaggregation with vLLM and llm-d-inference-scheduler, which delivers promising speedups for prefill-heavy workloads (20:1 ISL | OSL). Our next focus is finalizing the implementation with heterogeneous TP and completing comprehensive benchmarks for disaggregated serving. Short-term priorities include enabling heterogeneous TP, scaling with high-performance P/D \+ EP\<\>DP for large scale MoEs, and DP-aware load balancing. We will follow up with a detailed performance blog in the coming weeks.
-Try it out with the pd-nixl.yaml config in our [quickstart](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart).
+Try it out with the pd-nixl.yaml config in our [quickstart](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart).
### Get started with llm-d
@@ -187,9 +187,11 @@ llm-d builds brings together the performance of vLLM with the operationalizabili
We welcome AI engineers and researchers to join the llm-d community and contribute:
-* Check out our repository on Github: [https://github.com/llm-d/llm-d](https://github.com/llm-d/llm-d)
-* Join our developer slack: [https://inviter.co/llm-d-slack](https://inviter.co/llm-d-slack)
+* Check out our repository on Github: [https://github.com/llm-d/llm-d](https://github.com/llm-d/llm-d)
+* Join our developer slack: [https://inviter.co/llm-d-slack](https://inviter.co/llm-d-slack)
* Try out our quick starts to deploy llm-d on your Kubernetes cluster: [https://github.com/llm-d/llm-d-deployer/tree/main/quickstart](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart)
Please join us. The future of AI is open.
+
diff --git a/blog/2025-06-03_week_1_round_up.md b/blog/2025-06-03_week_1_round_up.md
index c3e7345..5b626a1 100644
--- a/blog/2025-06-03_week_1_round_up.md
+++ b/blog/2025-06-03_week_1_round_up.md
@@ -5,7 +5,7 @@ slug: llm-d-week-1-round-up
authors:
- petecheslock
-
+
tags: [news]
hide_table_of_contents: false
@@ -54,4 +54,7 @@ We use Google Groups to share architecture diagrams and other content. Please jo
* [LinkedIn](http://linkedin.com/company/llm-d)
* [@\_llm\_d\_](https://twitter.com/_llm_d_)
* [r/llm\_d](https://www.reddit.com/r/llm_d/)
-* YouTube - coming soon
\ No newline at end of file
+* YouTube - coming soon
+
+
diff --git a/blog/2025-06-25_community_update.md b/blog/2025-06-25_community_update.md
index d8b6027..cf66e4c 100644
--- a/blog/2025-06-25_community_update.md
+++ b/blog/2025-06-25_community_update.md
@@ -75,3 +75,6 @@ There are many ways to contribute to llm-d:
6. Check out our [Contributor Guidelines](https://llm-d.ai/docs/community/contribute) to start contributing code
We're looking forward to hearing from you and working together to make llm-d even better!
+
+
diff --git a/blog/2025-07-29_llm-d-v0.2-our-first-well-lit-paths.md b/blog/2025-07-29_llm-d-v0.2-our-first-well-lit-paths.md
index 985ab82..56a4c57 100644
--- a/blog/2025-07-29_llm-d-v0.2-our-first-well-lit-paths.md
+++ b/blog/2025-07-29_llm-d-v0.2-our-first-well-lit-paths.md
@@ -27,8 +27,8 @@ Our deployments have been tested and benchmarked on recent GPUs, such as H200 no
We’ve defined and improved three well-lit paths that form the foundation of this release:
-* [**Intelligent inference scheduling over any vLLM deployment**](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/inference-scheduling): support for precise prefix-cache aware routing with no additional infrastructure, out-of-the-box load-aware scheduling for better tail latency that “just works”, and a new configurable scheduling profile system enable teams to see immediate latency wins and still customize scheduling behavior for their workloads and infrastructure.
-* [**P/D disaggregation**:](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/pd-disaggregation) support for separating prefill and decode workloads to improve latency and GPU utilization for long-context scenarios.
+* [**Intelligent inference scheduling over any vLLM deployment**](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/inference-scheduling): support for precise prefix-cache aware routing with no additional infrastructure, out-of-the-box load-aware scheduling for better tail latency that “just works”, and a new configurable scheduling profile system enable teams to see immediate latency wins and still customize scheduling behavior for their workloads and infrastructure.
+* [**P/D disaggregation**:](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/pd-disaggregation) support for separating prefill and decode workloads to improve latency and GPU utilization for long-context scenarios.
* [**Wide expert parallelism for DeepSeek R1 (EP/DP)**](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/wide-ep-lws): support for large-scale multi-node deployments using expert and data parallelism patterns for MoE models. This includes optimized deployments leveraging NIXL+UCX for inter-node communication, with fixes and improvements to reduce latency, and demonstrates the use of LeaderWorkerSet for Kubernetes-native inference orchestration.
All of these scenarios are reproducible: we provide reference hardware specs, workloads, and benchmarking harness support, so others can evaluate, reproduce, and extend these benchmarks easily. This also reflects improvements to our deployment tooling and benchmarking framework, a new "machinery" that allows users to set up, test, and analyze these scenarios consistently.
@@ -47,9 +47,9 @@ We've refactored the deployer into a Helm-first, modular structure, splitting ch
The path for Prefill/Decode (P/D) disaggregation and multi-node DP/EP MoE deployments is now more clearly defined and tested. This work integrates and optimizes key [vLLM 0.10.0](https://github.com/vllm-project/vllm/releases/tag/v0.10.0) kernel improvements, including DeepGEMM and CUTLASS for expert parallel compute, as well as PPLX and DeepEP kernels and intra- and inter-node communication fixes and optimizations and multi-node scenarios. We now include:
-* Kubernetes-native deployment recipes now support API servers per DP rank for one-pod-per-rank placement, enhancing scalability and control
-* Helm charts are updated to support LeaderWorkerSet (LWS) for multi-node setups and direct one-pod-per-DP-rank deployments
-* Optimized intra-node communication by enabling DeepEP to use cuda\_ipc efficiently
+* Kubernetes-native deployment recipes now support API servers per DP rank for one-pod-per-rank placement, enhancing scalability and control
+* Helm charts are updated to support LeaderWorkerSet (LWS) for multi-node setups and direct one-pod-per-DP-rank deployments
+* Optimized intra-node communication by enabling DeepEP to use cuda\_ipc efficiently
* Enhanced NIXL+UCX performance, with fixes and optimizations that significantly reduce inter-node communication overhead, particularly for long context workloads
These validated scenarios are backed by benchmark baselines and example deployments via our quickstarts, offering clearer guidance on what works well today. As part of the "well-lit path" we have also identified limitations including known edge cases around response sizes and failure modes where more work is required.
@@ -84,9 +84,9 @@ Multi-arch support, smaller images, and hardened configurations ensure a reliabl
Here are some key lessons we learned so far in our progress with llm-d:
-* **Low-hanging fruit matters.** Targeted optimizations, like reducing KV‑cache transfer overhead between prefill and decode workers and refining prefix‑aware scheduling, delivered significant gains in throughput and tail latency. These quick wins required minimal change but paved the way for the deeper architectural improvements planned in upcoming releases.
-* **Using bleeding-edge libraries is hard.** Many key libraries associated with distributed inference are immature. Through our applied experiments in our well-lit paths and in close collaboration with ecosystem partners, we have improved much of the key infrastructure the larger community relies on in real-world conditions.
-* **Build on proven paths.** This validates why llm-d exists: to help users avoid discovering these problems themselves, offering reproducible deployments, performance baselines, and extensibility. llm-d focuses on building these paths so our users don’t need to troubleshoot these complex challenges in isolation.
+* **Low-hanging fruit matters.** Targeted optimizations, like reducing KV‑cache transfer overhead between prefill and decode workers and refining prefix‑aware scheduling, delivered significant gains in throughput and tail latency. These quick wins required minimal change but paved the way for the deeper architectural improvements planned in upcoming releases.
+* **Using bleeding-edge libraries is hard.** Many key libraries associated with distributed inference are immature. Through our applied experiments in our well-lit paths and in close collaboration with ecosystem partners, we have improved much of the key infrastructure the larger community relies on in real-world conditions.
+* **Build on proven paths.** This validates why llm-d exists: to help users avoid discovering these problems themselves, offering reproducible deployments, performance baselines, and extensibility. llm-d focuses on building these paths so our users don’t need to troubleshoot these complex challenges in isolation.
* **Community matters.** Working closely with the NVIDIA Dynamo community, we've tackled NIXL/UCX performance overheads for long context workloads, leading to significant improvements and active upstream contributions.
### Our survey
@@ -99,10 +99,10 @@ Conversational AI (82.9%) and real-time applications (56.1%) stood out as the mo
Today, [llm-d 0.2](https://github.com/llm-d/llm-d/releases/tag/v0.2.0) offers:
-* Modular Helm charts and clear deployment workflows.
-* Verified support for P/D, DP/EP, pod-per-rank, and heterogeneous GPUs (H200, B200).
-* Reproducible performance baselines, now with MoE support.
-* New foundations for routing and scheduler extensibility.
+* Modular Helm charts and clear deployment workflows.
+* Verified support for P/D, DP/EP, pod-per-rank, and heterogeneous GPUs (H200, B200).
+* Reproducible performance baselines, now with MoE support.
+* New foundations for routing and scheduler extensibility.
* A developer, and researcher-friendly platform with tested examples, with detailed guides on the way.
## A growing community
@@ -111,31 +111,31 @@ The best part of llm-d has been watching the community grow around it. We're thr
Much of the work happens within our seven Special Interest Groups (SIGs), each focused on a key area:
-* **Inference Scheduler** – Developing smarter routing and load‑balancing strategies, including KV‑cache‑aware scheduling.
-* **P/D Disaggregation** – Advancing phase‑separation strategies to improve resource‑utilization efficiency.
-* **KV Disaggregation** – Advancing and optimizing distributed KV‑cache management.
-* **Installation** – Streamlining deployment on Kubernetes, from single‑node setups to large multi‑node clusters.
-* **Benchmarking** – Building tools to automate performance validation and make scenarios easier to reproduce and extend.
-* **Autoscaling** – Adapting resources dynamically based on workload demands.
+* **Inference Scheduler** – Developing smarter routing and load‑balancing strategies, including KV‑cache‑aware scheduling.
+* **P/D Disaggregation** – Advancing phase‑separation strategies to improve resource‑utilization efficiency.
+* **KV Disaggregation** – Advancing and optimizing distributed KV‑cache management.
+* **Installation** – Streamlining deployment on Kubernetes, from single‑node setups to large multi‑node clusters.
+* **Benchmarking** – Building tools to automate performance validation and make scenarios easier to reproduce and extend.
+* **Autoscaling** – Adapting resources dynamically based on workload demands.
* **Observability** – Providing deep visibility into system performance and health.
We're also collaborating with other great open-source communities like vLLM, Dynamo, and LMCache. Every one of these groups is open, and we’d love for you to join in. Whether you want to contribute code, share ideas, or just listen in, you are welcome. You can find details for each SIG, including their leaders and meeting times, on [our community page](https://llm-d.ai/docs/community/sigs).
-## What's next:
+## What's next:
Looking ahead, our community is focusing on these key areas:
-* **Core optimizations**
- * TCP-based request dispatch upstream
- * Disaggregation protocol refinements, including possible sidecar removal
- * CPU cache offloading to expand memory capacity
- * KV event awareness baked directly into routing decisions
- * SLO-driven scheduling architecture for predictable performance
-* **Benchmarking enhancements:**
- * Expanded reproducibility guides.
- * Complete performance validation for core scenarios.
-* **Developer experience improvements:**
- * Expanded examples for inference gateway and scheduler extensibility.
+* **Core optimizations**
+ * TCP-based request dispatch upstream
+ * Disaggregation protocol refinements, including possible sidecar removal
+ * CPU cache offloading to expand memory capacity
+ * KV event awareness baked directly into routing decisions
+ * SLO-driven scheduling architecture for predictable performance
+* **Benchmarking enhancements:**
+ * Expanded reproducibility guides.
+ * Complete performance validation for core scenarios.
+* **Developer experience improvements:**
+ * Expanded examples for inference gateway and scheduler extensibility.
* Central Helm charts and expanded documentation.
See our [roadmap issue](https://github.com/llm-d/llm-d/issues/146) to see what is coming next and make your voice heard\!
@@ -149,3 +149,6 @@ Community engagement is key to our success:
* [**Join our community calls**](https://red.ht/llm-d-public-calendar) (Wed 12:30pm ET)
Contribute on [GitHub](https://github.com/llm-d), join our community calls, join the SIGs and build with us\!
+
+
diff --git a/docs/community/contact_us.md b/docs/community/contact_us.md
index f20258c..6c2c607 100644
--- a/docs/community/contact_us.md
+++ b/docs/community/contact_us.md
@@ -20,3 +20,5 @@ You can also find us on
- [**LinkedIn:** https://linkedin.com/company/llm-d ](https://linkedin.com/company/llm-d)
- [**X:** https://x.com/\_llm_d\_](https://x.com/_llm_d_)
+
diff --git a/src/components/Welcome/index.js b/src/components/Welcome/index.js
index 334e22b..bf174a6 100644
--- a/src/components/Welcome/index.js
+++ b/src/components/Welcome/index.js
@@ -37,6 +37,9 @@ export default function Welcome() {
for most models across a diverse and comprehensive set of hardware accelerators.
+
+