llm-d
diff --git a/‎blog/2025-05-20_News.md
Lines changed: 0 additions & 1 deletion b/‎blog/2025-05-20_News.md
Lines changed: 0 additions & 1 deletion
diff --git a/‎blog/2025-05-20_announce.md
Lines changed: 18 additions & 32 deletions b/‎blog/2025-05-20_announce.md
Lines changed: 18 additions & 32 deletions
diff --git a/‎blog/2025-06-03_week_1_round_up.md
Lines changed: 3 additions & 5 deletions b/‎blog/2025-06-03_week_1_round_up.md
Lines changed: 3 additions & 5 deletions
@@ -3,7 +3,6 @@ title: llm-d Press Release
 description: Official Press Release for llm-d
 slug: llm-d-press-release
 
- 
 tags: [news]
 
 hide_table_of_contents: false
 
@@ -5,23 +5,9 @@ slug: llm-d-announce
 date: 2025-05-20T08:00
 
 authors:
-  - name: Robert Shaw
-    title: Director of Engineering, Red Hat
-    url: https://github.com/robertgshaw2-redhat
-    image_url: https://avatars.githubusercontent.com/u/114415538?v=4
-    email: [email protected]
-    
-  - name: Clayton Coleman
-    title: Distinguished Engineer, Google
-    url: https://github.com/smarterclayton
-    image_url: https://avatars.githubusercontent.com/u/1163175?v=4
-    email: [email protected]
-
-  - name: Carlos Costa
-    title: Distinguished Engineer, IBM
-    url: https://github.com/chcost
-    image_url: https://avatars.githubusercontent.com/u/26551701?v=4
-    email: [email protected]
+  - robshaw
+  - smarterclayton
+  - chcost
 
 tags: [hello, welcome, llm-d]
 hide_table_of_contents: false
@@ -57,11 +43,11 @@ The LLM inference workload, however, is unique with slow, non-uniform, expensive
 
 ![Figure 2: Comparison of modern HTTP requests](../docs/assets/images/image7_33.png)
 
-Let’s take a look at each one step-by-step:
+Let's take a look at each one step-by-step:
 
 *A. Requests are expensive with significant variance in resource utilization.*
 
-* Each LLM inference request has a different “shape” to it, as measured by the number of input tokens and output tokens. There is significant variance in these parameters across requests and workloads.   
+* Each LLM inference request has a different "shape" to it, as measured by the number of input tokens and output tokens. There is significant variance in these parameters across requests and workloads.   
   * RAG has long inputs \- prompt and retrieved docs \- and short generated outputs  
   * Reasoning has a short or medium inputs and long generated outputs
 
@@ -71,13 +57,13 @@ Let’s take a look at each one step-by-step:
 
 *B. Routing to specific replicas with cached prior computation can achieve orders of magnitude better latency.*
 
-* Many common LLM workloads have “multi-turn” request patterns, where the same prompt is sent iteratively to the same instance.  
+* Many common LLM workloads have "multi-turn" request patterns, where the same prompt is sent iteratively to the same instance.  
   * Agentic (tool calls are iterative request flow)  
   * Code completion task (requests reuse current codebase as context)
 
 ![The agentic pattern sequence](../docs/assets/images/image8_0.jpg)
 
-* LLM inference servers like vLLM implement a method called “automatic prefix caching”, which enables “skipping” a significant amount of prefill computation when there is a cache hit. If requests are routed to vLLM replicas that have the data in the cache, we skip computation. Increasing the likelihood of prefix cache hits with a larger cache size can dramatically improve tail latencies.
+* LLM inference servers like vLLM implement a method called "automatic prefix caching", which enables "skipping" a significant amount of prefill computation when there is a cache hit. If requests are routed to vLLM replicas that have the data in the cache, we skip computation. Increasing the likelihood of prefix cache hits with a larger cache size can dramatically improve tail latencies.
 
 ![The prefix aching method](../docs/assets/images/image3.jpg)
 
@@ -97,16 +83,16 @@ Let’s take a look at each one step-by-step:
 *D. Production deployments often have a range of quality of service (QoS) requirements.*
 
 * Use cases for a single LLM endpoint can have a wide variety of quality of service requirements. Consider the following examples:  
-  * Latency is the most important factor: Code completion requests and search responses need to minimize latency to provide an “in the loop” experience. O(ms) latency tolerance.  
+  * Latency is the most important factor: Code completion requests and search responses need to minimize latency to provide an "in the loop" experience. O(ms) latency tolerance.  
   * Latency is important: Chat agent sessions and email drafting with interactive use cases. O(seconds) latency tolerance.   
-  * Latency tolerant: Video call and email summarization and “deep research” agents with daily or hourly usage patterns. O(minutes) latency tolerance.  
+  * Latency tolerant: Video call and email summarization and "deep research" agents with daily or hourly usage patterns. O(minutes) latency tolerance.  
   * Latency agnostic: Overnight batch processing workloads, meeting minute generation, and autonomous agents. O(hours) latency tolerance.
 
 * Given the compute intensity (and, therefore, high costs) of LLMs, tight latency SLOs are substantially more expensive to achieve. This spectrum of latency requirements presents an opportunity to further optimize infrastructure efficiency – the more latency tolerant a workload is, the more we can optimize infrastructure efficiency amongst other workloads.
 
 ### Why llm-d?
 
-To exploit these characteristics and achieve optimal performance for LLM workloads, the inference serving landscape is rapidly transitioning towards distributed cluster-scale architectures. For instance, in its “Open Source Week”, the DeepSeek team published the design of its [inference system](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md), which aggressively leverages disaggregation and KV caching to achieve remarkable performance per $ of compute.
+To exploit these characteristics and achieve optimal performance for LLM workloads, the inference serving landscape is rapidly transitioning towards distributed cluster-scale architectures. For instance, in its "Open Source Week", the DeepSeek team published the design of its [inference system](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md), which aggressively leverages disaggregation and KV caching to achieve remarkable performance per $ of compute.
 
 However, for most GenAI innovators, ML platform teams, and IT operations groups, these benefits remain out of reach. Building and operating a complex, monolithic system is time-consuming and challenging, especially in the context of the rapid pace of innovation and enterprise deployments with tens or hundreds of models for divergent use cases. This complexity risks time to market, higher operational costs and sprawl, and difficulty adopting and experimenting.
 
@@ -129,25 +115,25 @@ To achieve this objective, we designed llm-d with a modular and layered architec
 
 * [**Kubernetes**](https://kubernetes.io/docs/home/) **(K8s)**. K8s is an open source container orchestration engine for automating deployment, scaling, and management of containerized applications. It is the industry standard for deploying and updating LLM inference engines across various hardware accelerators.
 
-* [**Inference Gateway**](https://gateway-api-inference-extension.sigs.k8s.io/) **(IGW)**. IGW is an official Kubernetes project that extends the [Gateway API](https://gateway-api.sigs.k8s.io/) (the next generation of Kubernetes Ingress and Load Balancing API) with inference-specific routing. IGW includes many important features like model routing, serving priority, and extensible scheduling logic for “smart” load balancing. IGW integrates with many different gateway implementations, such as Envoy, making it widely portable across Kubernetes clusters.
+* [**Inference Gateway**](https://gateway-api-inference-extension.sigs.k8s.io/) **(IGW)**. IGW is an official Kubernetes project that extends the [Gateway API](https://gateway-api.sigs.k8s.io/) (the next generation of Kubernetes Ingress and Load Balancing API) with inference-specific routing. IGW includes many important features like model routing, serving priority, and extensible scheduling logic for "smart" load balancing. IGW integrates with many different gateway implementations, such as Envoy, making it widely portable across Kubernetes clusters.
 
 ![](../docs/assets/images/llm-d-arch-simplified.svg)
 
 And our key new contributions:
 
-* **vLLM Optimized Inference Scheduler** \- IGW defines a pattern for customizable “smart” load-balancing via the [Endpoint Picker Protocol (EPP)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol). Leveraging enhanced operational telemetry exposed by vLLM, the inference scheduler implements the filtering and scoring algorithms necessary to make “smart” scheduling decisions around disaggregated serving, prefix-cache-awareness, and load-awareness, validated to be used out-of-the-box by llm-d users. Advanced teams can also tweak or implement their own scorers and filterers to further customize for their use cases, while still benefiting from upcoming operational features in the inference gateway, like flow control and latency-aware balancing.  
+* **vLLM Optimized Inference Scheduler** \- IGW defines a pattern for customizable "smart" load-balancing via the [Endpoint Picker Protocol (EPP)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol). Leveraging enhanced operational telemetry exposed by vLLM, the inference scheduler implements the filtering and scoring algorithms necessary to make "smart" scheduling decisions around disaggregated serving, prefix-cache-awareness, and load-awareness, validated to be used out-of-the-box by llm-d users. Advanced teams can also tweak or implement their own scorers and filterers to further customize for their use cases, while still benefiting from upcoming operational features in the inference gateway, like flow control and latency-aware balancing.  
   * For more details, see our Northstar: [\[PUBLIC\] llm-d Scheduler Northstar](https://docs.google.com/document/d/1kE1LY8OVjiOgKVD9-9Po96HODbTIbgHp4qgvw06BCOc/edit?tab=t.0)
 
-* **Disaggregated Serving with [vLLM](https://github.com/vllm-project/vllm) \-** llm-d leverages vLLM’s recently enabled support for disaggregated serving via a pluggable KV Connector API to run prefill and decode on independent instances, using high-performance transport libraries like [NVIDIA’s NIXL](https://github.com/ai-dynamo/nixl).    
+* **Disaggregated Serving with [vLLM](https://github.com/vllm-project/vllm) \-** llm-d leverages vLLM's recently enabled support for disaggregated serving via a pluggable KV Connector API to run prefill and decode on independent instances, using high-performance transport libraries like [NVIDIA's NIXL](https://github.com/ai-dynamo/nixl).    
 
-  In llm-d, we plan to support two “well-lit” paths for prefill/decode (P/D) disaggregation:  
+  In llm-d, we plan to support two "well-lit" paths for prefill/decode (P/D) disaggregation:  
   * Latency optimized implementation using fast interconnects (IB, RDMA, ICI)  
   * Throughput optimized implementation using data center networking  
   * For more details, see our Northstar:[\[PUBLIC\] llm-d Disaggregated Serving Northstar](https://docs.google.com/document/d/1FNN5snmipaTxEA1FGEeSH7Z_kEqskouKD1XYhVyTHr8/edit?tab=t.0#heading=h.ycwld2oth1kj)
 
 * **Disaggregated Prefix Caching with vLLM** \-  llm-d uses the same vLLM KV connector API used in disaggregated serving to provide a pluggable cache for previous calculations, including offloading KVs to host, remote storage, and systems like [LMCache](https://github.com/LMCache/LMCache).   
 
-  In llm-d, we plan to support two “well-lit” paths for KV cache disaggregation:  
+  In llm-d, we plan to support two "well-lit" paths for KV cache disaggregation:  
   * Independent caching with basic offloading to host memory and disk, providing a zero operational cost mechanism that utilizes all system resources  
   * Shared caching with KV transfer between instances and shared storage with global indexing, providing potential for higher performance at the cost of a more operationally complex system.  
   * For more details, see our Northstar: [\[PUBLIC\] llm-d Prefix Caching Northstar](https://docs.google.com/document/d/1d-jKVHpTJ_tkvy6Pfbl3q2FM59NpfnqPAh__Uz_bEZ8/edit?tab=t.0#heading=h.6qazyl873259)
@@ -163,7 +149,7 @@ And our key new contributions:
 
 #### Example llm-d Features
 
-llm-d integrates IGW and vLLM together, enabling a high performance distributed serving stack. Let’s discuss some of the example features enabled by llm-d.
+llm-d integrates IGW and vLLM together, enabling a high performance distributed serving stack. Let's discuss some of the example features enabled by llm-d.
 
 **Prefix and KV cache-aware routing**
 
@@ -185,13 +171,13 @@ We conducted a series of experiments to evaluate the performance of the [llm-d-i
 * **S2:** llm-d delivers \~50% higher QPS than the baseline while meeting SLO requirements (higher is better).  
 * **S3:** llm-d sustains 2X the baseline QPS under SLO constraints (higher is better).
 
-These results show that llm-d’s cache- and prefix-aware scheduling effectively reduces TTFT and increases QPS compared to the baseline, while consistently meeting SLA requirements.
+These results show that llm-d's cache- and prefix-aware scheduling effectively reduces TTFT and increases QPS compared to the baseline, while consistently meeting SLA requirements.
 
 Try it out with the \`base.yaml\` config in our [quickstart](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart). And as a customization example, see the  [template](https://github.com/llm-d/llm-d-inference-scheduler/blob/main/docs/create_new_filter.md) for adding your own scheduler filter.
 
 **P/D disaggregation**
 
-We’ve completed an initial implementation of P/D disaggregation with vLLM and llm-d-inference-scheduler, which delivers promising speedups for prefill-heavy workloads (20:1 ISL | OSL). Our next focus is finalizing the implementation with heterogeneous TP and completing comprehensive benchmarks for disaggregated serving. Short-term priorities include enabling heterogeneous TP, scaling with high-performance P/D \+ EP\<\>DP for large scale MoEs, and DP-aware load balancing. We will follow up with a detailed performance blog in the coming weeks.
+We've completed an initial implementation of P/D disaggregation with vLLM and llm-d-inference-scheduler, which delivers promising speedups for prefill-heavy workloads (20:1 ISL | OSL). Our next focus is finalizing the implementation with heterogeneous TP and completing comprehensive benchmarks for disaggregated serving. Short-term priorities include enabling heterogeneous TP, scaling with high-performance P/D \+ EP\<\>DP for large scale MoEs, and DP-aware load balancing. We will follow up with a detailed performance blog in the coming weeks.
 
 Try it out with the pd-nixl.yaml config in our [quickstart](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart). 
 
 
@@ -4,11 +4,7 @@ description: Latest news and active design discussions from the llm-d project te
 slug: llm-d-week-1-round-up
 
 authors:
-  - name: Pete Cheslock
-    title: AI Community Architect, Red Hat
-    url: https://github.com/petecheslock
-    image_url: https://avatars.githubusercontent.com/u/511733?v=4
-    email: [email protected]
+  - petecheslock
 
 tags: [news]
 
@@ -25,13 +21,15 @@ We've hit 1000 ⭐️'s on [GitHub](https://github.com/llm-d/llm-d)
 
 ![llm-d Star Chart](../docs/assets/images/star-history-202563.png)
 
+<!-- truncate -->
 
 **Here are some of the active design conversations:**
 
 :::tip Join our Google Group
 We use Google Groups to share architecture diagrams and other content. Please join: [llm-d-contributors Google Group](https://groups.google.com/g/llm-d-contributors)
 :::
 
+
 * [2025-06-01 \[PUBLIC\] llm-d KVTransfer Protocol](https://docs.google.com/document/d/1zBkToR9XWjvBYLxu15JeoGpq16nH5sFFensZP_3lJQU/view)
 * [Revisiting The InferenceModel API](https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/view)
 * [ModelService: Declarative Inference Serving on llm-d](https://docs.google.com/document/d/1HA-2yNZpc1F4KhyeYA30shjZpYEDqGIJXqVgDVv3SWU/view)