Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 0 additions & 14 deletions blog/2025-05-07_welcome_llmd.md

This file was deleted.

13 changes: 0 additions & 13 deletions blog/2025-05-19.md

This file was deleted.

209 changes: 209 additions & 0 deletions blog/2025-05-20_announce.md

Large diffs are not rendered by default.

40 changes: 22 additions & 18 deletions blog/authors.yml
Original file line number Diff line number Diff line change
@@ -1,22 +1,26 @@
Huey:
name: Huw
title: The Nephew in Red

Dewey:
name: Dewydd
title: The one in Blue
redhat:
name: RedHat
url: https://redhat.com
image_url: https://www.redhat.com/rhdc/managed-files/Logo-Red_Hat-Hat_icon-Standard-RGB.svg


Louie:
name: Lewellyn
title: That one in green
robshaw:
name: Robert Shaw
title: Director of Engineering, Red Hat
url: https://github.com/robertgshaw2-redhat
image_url: https://avatars.githubusercontent.com/u/114415538?v=4
email: [email protected]

kahuna:
name: Big kahuna
title: The one in charge
smarterclayton:
name: Clayton Coleman
title: Distinguished Engineer, Google
url: https://github.com/smarterclayton
image_url: https://avatars.githubusercontent.com/u/1163175?v=4
email: [email protected]

redhat-author:
name: RedHat
title: One of the sponsors
url: https://redhat.com
image_url: https://www.redhat.com/rhdc/managed-files/Logo-Red_Hat-Hat_icon-Standard-RGB.svg
chcost:
name: Carlos Costa
title: Distinguished Engineer, IBM
url: https://github.com/chcost
image_url: https://avatars.githubusercontent.com/u/26551701?v=4
email: [email protected]
10 changes: 8 additions & 2 deletions blog/tags.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ llm-d:
description: llm-d tag description

news:
label: News Releases!
label: News Releases
permalink: /news-releases
description: Used for "official" news releases in the blog

Expand All @@ -34,6 +34,12 @@ hola:
description: Hola tag description

blog:
label: just a blog
label: blog posts
permalink: /blog
description: everyday blog posts


announce:
label: Announcements
permalink: /announce
description: Announcements that aren't news releases
5 changes: 3 additions & 2 deletions docs/architecture/00_architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ sidebar_position: 0
label: llm-d Architecture
---
# Overview of llm-d architecture
`llm-d` is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.
`llm-d` is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve large language models at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.

With `llm-d`, users can operationalize GenAI deployments with a modular solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in [Inference Gateway (IGW)](https://github.com/kubernetes-sigs/gateway-api-inference-extension).

Expand All @@ -14,7 +14,7 @@ Built by leaders in the Kubernetes and vLLM projects, `llm-d` is a community-dri
`llm-d` adopts a layered architecture on top of industry-standard open technologies: vLLM, Kubernetes, and Inference Gateway.


![llm-d Architecture](../assets/images/llm-d-arch.svg)
![llm-d Architecture](../assets/images/llm-d-arch-simplified.svg)



Expand All @@ -31,6 +31,7 @@ Key features of `llm-d` include:
- **Variant Autoscaling over Hardware, Workload, and Traffic** (🚧): We plan to implement a traffic- and hardware-aware autoscaler that (a) measures the capacity of each model server instance, (b) derive a load function that takes into account different request shapes and QoS, and (c) assesses recent traffic mix (QPS, QoS, and shapes)
Using the recent traffic mix to calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency. [See our Northstar design](https://docs.google.com/document/d/1inTneLEZTv3rDEBB9KLOB9K6oMq8c3jkogARJqdt_58/edit?tab=t.0)

For more, see the [project proposal](https://github.com/llm-d/llm-d/blob/dev/docs/proposals/llm-d.md)

## Getting Started

Expand Down
113 changes: 0 additions & 113 deletions docs/architecture/Component Architecture/02_inf-extension.md

This file was deleted.

120 changes: 120 additions & 0 deletions docs/architecture/Component Architecture/02_inf-simulator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
sidebar_position: 2
sidebar_label: Inference Simulator
---
# vLLM Simulator
To help with development and testing we have developed a light weight vLLM simulator. It does not truly
run inference, but it does emulate responses to the HTTP REST endpoints of vLLM.
Currently it supports partial OpenAI-compatible API:
- /v1/chat/completions
- /v1/completions
- /v1/models

In addition, it supports a subset of vLLM's Prometheus metrics. These metrics are exposed via the /metrics HTTP REST endpoint. Currently supported are the following metrics:
- vllm:lora_requests_info

The simulated inferense has no connection with the model and LoRA adapters specified in the command line parameters. The /v1/models endpoint returns simulated results based on those same command line parameters.

The simulator supports two modes of operation:
- `echo` mode: the response contains the same text that was received in the request. For `/v1/chat/completions` the last message for the role=`user` is used.
- `random` mode: the response is randomly chosen from a set of pre-defined sentences.

Timing of the response is defined by two parameters: `time-to-first-token` and `inter-token-latency`.

For a request with `stream=true`: `time-to-first-token` defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream.

For a requst with `stream=false`: the response is returned after delay of `<time-to-first-token> + (<inter-token-latency> * (<number_of_output_tokens> - 1))`

It can be run standalone or in a Pod for testing under packages such as Kind.

## Limitations
API responses contains a subset of the fields provided by the OpenAI API.

<details>
<summary>Click to show the structure of requests/responses</summary>

- `/v1/chat/completions`
- **request**
- stream
- model
- messages
- role
- content
- **response**
- id
- created
- model
- choices
- index
- finish_reason
- message
- `/v1/completions`
- **request**
- stream
- model
- prompt
- max_tokens (for future usage)
- **response**
- id
- created
- model
- choices
- text
- `/v1/models`
- **response**
- object (list)
- data
- id
- object (model)
- created
- owned_by
- root
- parent
</details>
<br/>
For more details see the [vLLM documentation](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#openai-completions-api-with-vllm)

## Command line parameters
- `port`: the port the simulator listents on, mandatory
- `model`: the currently 'loaded' model, mandatory
- `lora`: a list of available LoRA adapters, separated by commas, optional, by default empty
- `mode`: the simulator mode, optional, by default `random`
- `echo`: returns the same text that was sent in the request
- `random`: returns a sentence chosen at random from a set of pre-defined sentences
- `time-to-first-token`: the time to the first token (in milliseconds), optional, by default zero
- `inter-token-latency`: the time to 'generate' each additional token (in milliseconds), optional, by default zero
- `max-loras`: maximum number of LoRAs in a single batch, optional, default is one
- `max-cpu-loras`: maximum number of LoRAs to store in CPU memory, optional, must be >= than max_loras, default is max_loras
- `max-running-requests`: maximum number of inference requests that could be processed at the same time


## Working with docker image

### Building
To build a Docker image of the vLLM Simulator, run:
```bash
make build-llm-d-inference-sim-image
```

### Running
To run the vLLM Simulator image under Docker, run:
```bash
docker run --rm --publish 8000:8000 ai-aware-router/llm-d-inference-sim:0.0.1 /ai-aware-router/llm-d-inference-sim --port 8000 --model "Qwen/Qwen2.5-1.5B-Instruct" --lora "tweet-summary-0,tweet-summary-1"
```
**Note:** The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.

## Standalone testing

### Building
To build the vLLM simulator, run:
```bash
make build-llm-d-inference-sim
```

### Running
To run the router in a standalone test environment, run:
```bash
./bin/llm-d-inference-sim --model my_model --port 8000
```


Binary file added docs/assets/images/carlos costa.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/clayton coleman.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/image1_116.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/image2_4.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/image3.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/image4_57.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/image5_46.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/image7_33.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/image8_0.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/assets/images/llm-d-arch-simplified.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/robert shaw headshot.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/community/contact_us.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ There are several ways you can join the community effort to develop and enhance
- Via the [**Github pages for llm-d:** https://github.com/llm-d](https://github.com/llm-d)**
- Via our [**Slack Workspace:** https://llm-d.slack.com](https://llm-d.slack.com)
- Via [**Reddit**: Reddit:https://www.reddit.com/r/llm_d/](Reddit:https://www.reddit.com/r/llm_d/)
- We host a weekly standup for contributors on Wednesdays at 1230pm ET. Please join: [Meeting Details](https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=NG9yZ3AyYTN0N3VlaW01b21xbWV2c21uNjRfMjAyNTA1MjhUMTYzMDAwWiByb2JzaGF3QHJlZGhhdC5jb20&tmsrc=robshaw%40redhat.com&scp=ALL)
- We use Google Groups to share architecture diagrams and other content. Please join: [Google Group](https://groups.google.com/g/llm-d-contributors)

You can also find us on

Expand Down
Loading