Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions docs/architecture/00_architecture.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ import VideoEmbed from '@site/src/components/VideoEmbed';


![llm-d Architecture](../assets/images/llm-d-arch-simplified.svg)


Key features of `llm-d` include:

Expand Down Expand Up @@ -54,12 +54,13 @@ See the guided experience with our [quickstart](https://github.com/llm-d/llm-d-d
llm-d repo is a metaproject with subcomponents that can be cloned individually.

To clone all the components:

```bash
git clone --recurse-submodules https://github.com/llm-d/llm-d.git
```
git clone --recurse-submodules https://github.com/llm-d/llm-d.git
```

**Tip**
As a customization example, see [here](https://github.com/llm-d/llm-d/tree/dev) a template for adding a scheduler scorer.
As a customization example, see [here](https://github.com/llm-d/llm-d/tree/dev) a template for adding a scheduler scorer.

## Releases

Expand Down
46 changes: 23 additions & 23 deletions docs/architecture/Components/06_kv-cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,24 @@ sidecar_label: KV-Cache Manager

## Introduction

LLM inference can be computationally expensive due to the sequential nature of token generation.
KV-caching plays a critical role in optimizing this process. By storing previously computed key and value attention vectors,
KV-cache reuse avoids redundant computations during inference, significantly reducing latency and resource consumption.
This is particularly beneficial for long context multi-turn conversations or Agentic (&RAG) applications where
previously computed information can be leveraged effectively.
LLM inference can be computationally expensive due to the sequential nature of token generation.
KV-caching plays a critical role in optimizing this process. By storing previously computed key and value attention vectors,
KV-cache reuse avoids redundant computations during inference, significantly reducing latency and resource consumption.
This is particularly beneficial for long context multi-turn conversations or Agentic (&RAG) applications where
previously computed information can be leveraged effectively.
Efficient KV-cache management and routing are essential for scaling LLM inference and delivering a responsive user experience.

llmd-kv-cache-manager is a pluggable KV-cache Manager for KV-cache Aware Routing in LLM serving platforms.

This initial work will expand in capacity as development continues.

See the [docs folder in the repository](https://github.com/llm-d/llm-d-kv-cache-manager/blob/main/docs/README.md) for more information on goals, architecture and more.

## Goals

The KV-Cache-Manager is designed to connect high-level serving-stack goals with concrete system capabilities through a layered objective structure:

- **Improve user experience**
- **Improve user experience**
- By reducing Time-To-First-Token (TTFT)
- Enabled through higher KVCache hit rates and reduced tensor transfers
- Supported by smart routing and distributed cache availability
Expand All @@ -38,26 +38,26 @@ The KV-Cache-Manager is designed to connect high-level serving-stack goals with
- User session duplication/migration for true and seamless load balancing


## Vision
## Vision

This goal structure above is shaped by our vision for emerging use cases like RAG and agentic workflows,
which involve heavy context-reuse across sessions and instances.
Shared documents, tool prompts, and workflow steps create overlapping token streams that benefit significantly from
cross-instance KVCache coordination.
This goal structure above is shaped by our vision for emerging use cases like RAG and agentic workflows,
which involve heavy context-reuse across sessions and instances.
Shared documents, tool prompts, and workflow steps create overlapping token streams that benefit significantly from
cross-instance KVCache coordination.

To implement this vision, the KVCache-Manager incorporates proactive cache placement, session duplication,
and cluster-level cache APIs - bridging gaps in current serving stacks where KVCache management and utilization is
To implement this vision, the KVCache-Manager incorporates proactive cache placement, session duplication,
and cluster-level cache APIs - bridging gaps in current serving stacks where KVCache management and utilization is
not yet treated as a first-class concern.

## Architecture Overview

The code defines a [kvcache.Indexer](https://github.com/llm-d/llm-d-kv-cache-manager/tree/main/pkg/kv-cache/indexer.go) module that efficiently maintains a global view of KV-cache states and localities.
The code defines a [kvcache.Indexer](https://github.com/llm-d/llm-d-kv-cache-manager/blob/main/pkg/kvcache/indexer.go) module that efficiently maintains a global view of KV-cache states and localities.
In the current state of vLLM, the only available information on KV-cache availability is that of the offloaded tensors to KV-cache Engines via the Connector API.

The `kvcache.Indexer` module is a pluggable Go package designed for use by orchestrators to enable KV-cache-aware scheduling decisions.

```mermaid
graph
graph
subgraph Cluster
Router
subgraph KVCacheManager[KV-cache Manager]
Expand All @@ -75,7 +75,7 @@ graph
Router -->|"Score(prompt, ModelName, relevantPods)"| kvcache.Indexer
kvcache.Indexer -->|"{Pod to Scores map}"| Router
Router -->|Route| vLLMNode

kvcache.Indexer -->|"FindLongestTokenizedPrefix(prompt, ModelName) -> tokens"| PrefixStore
PrefixStore -->|"DigestPromptAsync"| PrefixStore
kvcache.Indexer -->|"GetPodsForKeys(tokens) -> {KVBlock keys to Pods} availability map"| KVBlockToPodIndex
Expand All @@ -88,7 +88,7 @@ This overview greatly simplifies the actual architecture and combines steps acro



## Architecture
## Architecture

For even more a detailed architecture, refer to the [architecture](https://github.com/llm-d/llm-d-kv-cache-manager/tree/main/docs/architecture.md) document.

Expand All @@ -98,7 +98,7 @@ The architecture is designed to efficiently maintain a global view of KV-cache s

```mermaid
sequenceDiagram
participant U as User
participant U as User
participant KVI as kvcache.Indexer
box
participant KVBS as KVBlockScorer
Expand Down Expand Up @@ -130,7 +130,7 @@ KVI->>PS: 2. FindLongestTokenizedPrefix(prompt, ModelName)
end
PS->>KVI: 2.2 Tokens of longest prefix

# get block keys
# get block keys
KVI->>TPR: 3 GetBlockKeys(tokens, ModelName)
TPR->>KVI: 3.1 BlockKeys

Expand Down Expand Up @@ -207,8 +207,8 @@ Future enhancements will enable the `llm-d-kv-cache-manager` component to proces

## Examples

- [KV-cache Indexer](https://github.com/llm-d/llm-d-kv-cache-manager/tree/main/examples/kv-cache-index/):
- [KV-cache Indexer](https://github.com/llm-d/llm-d-kv-cache-manager/tree/main/examples/kv-cache-index/):
- A reference implementation of using the `kvcache.Indexer` module.
- [KV-cache Aware Scorer](https://github.com/llm-d/llm-d-kv-cache-manager/tree/main/examples/kv-cache-aware-scorer/):
- A reference implementation of integrating the `kvcache.Indexer` module in
- [KV-cache Aware Scorer](https://github.com/llm-d/llm-d-kv-cache-manager/tree/main/examples/kv-cache-aware-scorer/):
- A reference implementation of integrating the `kvcache.Indexer` module in
[llm-d-inference-scheduler](https://github.com/llm-d/llm-d-inference-scheduler) in a KV-cache aware scorer.
9 changes: 4 additions & 5 deletions docs/guide/guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,15 @@ The user guide is organized in sections to help you get started with llm-d and t

llm-d is an open source project providing distributed inferencing for GenAI runtimes on any Kubernetes cluster. Its highly performant, scalable architecture helps reduce costs through a spectrum of hardware efficiency improvements. The project prioritizes ease of deployment+use as well as SRE needs + day 2 operations associated with running large GPU clusters.

[For more information check out the Architecture Documentation](/architecture/00_architecture.md)
[For more information check out the Architecture Documentation](./architecture/architecture)

## Installation: Start here to minimize your frustration

This guide will walk you through the steps to install and deploy the llm-d quickstart demo on a Kubernetes cluster.

- [Prerequisites](./guide/Installation/prerequisites) Make sure your compute resources and system configuration are ready
- [Quick Start](./guide/Installation/quickstart) If your resources are ready, "kick the tires" with our Quick Start!


- [Prerequisites](./Installation/prerequisites.md) Make sure your compute resources and system configuration are ready
- [Quick Start](./Installation/quickstart.md) If your resources are ready, "kick the tires" with our Quick Start!




6 changes: 3 additions & 3 deletions docs/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ sidebar_position: 1

## Fork and Clone the Repository

Fork the repositorty of the site on our [GitHub](https://https://github.com/RedHatOfficial/static-website-template-for-ospo).
Fork the repositorty of the site on our [GitHub](https://https://github.com/RedHatOfficial/static-website-template-for-ospo).

Clone your fork to your own computer then using this command and replacing the link with your own HTTPS clone link found underneath the **Code** button (see image below):

Expand Down Expand Up @@ -35,12 +35,12 @@ Run the server itself with this command:
npm start
```

The `cd` command changes the directory you're working with.
The `cd` command changes the directory you're working with.

The `npm start` command builds your website locally and serves it through a development server, ready for you to view at http://localhost:3000/.

Open `docs/intro.md` (this page) and edit some lines: the site **reloads automatically** and displays your changes.

## Using Docusaurus

Docusaurus is a static-site generator that convetakes Markdown files and donverts them into a documentation website. It is written in JavaScript however no prior knowledge of JavaScript is needed to edit the website.
Docusaurus is a static-site generator that convetakes Markdown files and donverts them into a documentation website. It is written in JavaScript however no prior knowledge of JavaScript is needed to edit the website.
6 changes: 3 additions & 3 deletions src/components/Install/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ export default function Install() {
alt="1. "
src={require('/docs/assets/counting-01.png').default}
></img>
<a className="link" href="docs/guide/installation/prerequisites#compute">
<a className="link" href="docs/guide/Installation/prerequisites#compute">
Check the Prerequisites
</a>
</h3>
Expand All @@ -34,7 +34,7 @@ export default function Install() {
alt="2. "
src={require('/docs/assets/counting-02.png').default}
></img>
<a className="link" href="docs/guide/installation/quickstart#install">
<a className="link" href="docs/guide/Installation/quickstart#install">
Run the Quickstart
</a>
</h3>
Expand All @@ -46,7 +46,7 @@ export default function Install() {
alt="3. "
src={require('/docs/assets/counting-03.png').default}
></img>
<a className="link" href="docs/guide/installation/quickstart#examples">Explore llm-d!</a></h3>
<a className="link" href="docs/guide/Installation/quickstart#examples">Explore llm-d!</a></h3>
{/* -------------------------------------------------------------------------- */}
<a className="static-button install-button button-link" href="docs/guide">
Complete install methods here
Expand Down
13 changes: 5 additions & 8 deletions src/components/Welcome/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,13 @@ export default function Welcome() {
<h2 className="welcome-h2">
llm-d: a Kubernetes-native high-performance distributed LLM inference framework
</h2>


<div className="button-group">
<a className="static-button button-link" href="docs/architecture/architecture">
Architecture
</a>
<a
className="static-button button-link"
href="docs/guide/Installation/Prerequisites"
>
<a className="static-button button-link" href="docs/guide/Installation/prerequisites" >
{/* Link to install page on the docs */}
Installation
</a>
Expand All @@ -35,13 +32,13 @@ export default function Welcome() {

<div className="hidden-for-mobile">
<p>
llm-d is a well-lit path for anyone to serve at scale,
with the fastest time-to-value and competitive performance per dollar,
llm-d is a well-lit path for anyone to serve at scale,
with the fastest time-to-value and competitive performance per dollar,
for most models across a diverse and comprehensive set of hardware accelerators.
</p>

</div>

</div>
</div>
);
Expand Down