You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Fixing broken links.
this should fix the broken links from Jun 25ths check.
Signed-off-by: JJ Asghar <[email protected]>
* fixed mdx and relitive paths
fixed mdx and relitive paths for docs
Signed-off-by: JJ Asghar <[email protected]>
* Removed endings of files
Not sure if this will actually work...
Signed-off-by: JJ Asghar <[email protected]>
* Fixing the architecture links.
- fixed some white space
- fixed the architecture link
Signed-off-by: JJ Asghar <[email protected]>
* we need "I" not "i"
docusourus can redirect, but still gives a 404 first.
Signed-off-by: JJ Asghar <[email protected]>
* removed the hard coded llm-d.ai url
This will work relitive now.
Signed-off-by: JJ Asghar <[email protected]>
---------
Signed-off-by: JJ Asghar <[email protected]>
Copy file name to clipboardExpand all lines: docs/architecture/Components/06_kv-cache.md
+23-23Lines changed: 23 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,24 +6,24 @@ sidecar_label: KV-Cache Manager
6
6
7
7
## Introduction
8
8
9
-
LLM inference can be computationally expensive due to the sequential nature of token generation.
10
-
KV-caching plays a critical role in optimizing this process. By storing previously computed key and value attention vectors,
11
-
KV-cache reuse avoids redundant computations during inference, significantly reducing latency and resource consumption.
12
-
This is particularly beneficial for long context multi-turn conversations or Agentic (&RAG) applications where
13
-
previously computed information can be leveraged effectively.
9
+
LLM inference can be computationally expensive due to the sequential nature of token generation.
10
+
KV-caching plays a critical role in optimizing this process. By storing previously computed key and value attention vectors,
11
+
KV-cache reuse avoids redundant computations during inference, significantly reducing latency and resource consumption.
12
+
This is particularly beneficial for long context multi-turn conversations or Agentic (&RAG) applications where
13
+
previously computed information can be leveraged effectively.
14
14
Efficient KV-cache management and routing are essential for scaling LLM inference and delivering a responsive user experience.
15
15
16
16
llmd-kv-cache-manager is a pluggable KV-cache Manager for KV-cache Aware Routing in LLM serving platforms.
17
17
18
18
This initial work will expand in capacity as development continues.
19
-
19
+
20
20
See the [docs folder in the repository](https://github.com/llm-d/llm-d-kv-cache-manager/blob/main/docs/README.md) for more information on goals, architecture and more.
21
21
22
22
## Goals
23
23
24
24
The KV-Cache-Manager is designed to connect high-level serving-stack goals with concrete system capabilities through a layered objective structure:
25
25
26
-
-**Improve user experience**
26
+
-**Improve user experience**
27
27
- By reducing Time-To-First-Token (TTFT)
28
28
- Enabled through higher KVCache hit rates and reduced tensor transfers
29
29
- Supported by smart routing and distributed cache availability
@@ -38,26 +38,26 @@ The KV-Cache-Manager is designed to connect high-level serving-stack goals with
38
38
- User session duplication/migration for true and seamless load balancing
39
39
40
40
41
-
## Vision
41
+
## Vision
42
42
43
-
This goal structure above is shaped by our vision for emerging use cases like RAG and agentic workflows,
44
-
which involve heavy context-reuse across sessions and instances.
45
-
Shared documents, tool prompts, and workflow steps create overlapping token streams that benefit significantly from
46
-
cross-instance KVCache coordination.
43
+
This goal structure above is shaped by our vision for emerging use cases like RAG and agentic workflows,
44
+
which involve heavy context-reuse across sessions and instances.
45
+
Shared documents, tool prompts, and workflow steps create overlapping token streams that benefit significantly from
46
+
cross-instance KVCache coordination.
47
47
48
-
To implement this vision, the KVCache-Manager incorporates proactive cache placement, session duplication,
49
-
and cluster-level cache APIs - bridging gaps in current serving stacks where KVCache management and utilization is
48
+
To implement this vision, the KVCache-Manager incorporates proactive cache placement, session duplication,
49
+
and cluster-level cache APIs - bridging gaps in current serving stacks where KVCache management and utilization is
50
50
not yet treated as a first-class concern.
51
51
52
52
## Architecture Overview
53
53
54
-
The code defines a [kvcache.Indexer](https://github.com/llm-d/llm-d-kv-cache-manager/tree/main/pkg/kv-cache/indexer.go) module that efficiently maintains a global view of KV-cache states and localities.
54
+
The code defines a [kvcache.Indexer](https://github.com/llm-d/llm-d-kv-cache-manager/blob/main/pkg/kvcache/indexer.go) module that efficiently maintains a global view of KV-cache states and localities.
55
55
In the current state of vLLM, the only available information on KV-cache availability is that of the offloaded tensors to KV-cache Engines via the Connector API.
56
56
57
57
The `kvcache.Indexer` module is a pluggable Go package designed for use by orchestrators to enable KV-cache-aware scheduling decisions.
kvcache.Indexer -->|"GetPodsForKeys(tokens) -> {KVBlock keys to Pods} availability map"| KVBlockToPodIndex
@@ -88,7 +88,7 @@ This overview greatly simplifies the actual architecture and combines steps acro
88
88
89
89
90
90
91
-
## Architecture
91
+
## Architecture
92
92
93
93
For even more a detailed architecture, refer to the [architecture](https://github.com/llm-d/llm-d-kv-cache-manager/tree/main/docs/architecture.md) document.
94
94
@@ -98,7 +98,7 @@ The architecture is designed to efficiently maintain a global view of KV-cache s
Copy file name to clipboardExpand all lines: docs/guide/guide.md
+4-5Lines changed: 4 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,16 +10,15 @@ The user guide is organized in sections to help you get started with llm-d and t
10
10
11
11
llm-d is an open source project providing distributed inferencing for GenAI runtimes on any Kubernetes cluster. Its highly performant, scalable architecture helps reduce costs through a spectrum of hardware efficiency improvements. The project prioritizes ease of deployment+use as well as SRE needs + day 2 operations associated with running large GPU clusters.
12
12
13
-
[For more information check out the Architecture Documentation](/architecture/00_architecture.md)
13
+
[For more information check out the Architecture Documentation](./architecture/architecture)
14
14
15
15
## Installation: Start here to minimize your frustration
16
16
17
17
This guide will walk you through the steps to install and deploy the llm-d quickstart demo on a Kubernetes cluster.
18
18
19
+
-[Prerequisites](./guide/Installation/prerequisites) Make sure your compute resources and system configuration are ready
20
+
-[Quick Start](./guide/Installation/quickstart) If your resources are ready, "kick the tires" with our Quick Start!
21
+
19
22
20
-
-[Prerequisites](./Installation/prerequisites.md) Make sure your compute resources and system configuration are ready
21
-
-[Quick Start](./Installation/quickstart.md) If your resources are ready, "kick the tires" with our Quick Start!
Copy file name to clipboardExpand all lines: docs/intro.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ sidebar_position: 1
6
6
7
7
## Fork and Clone the Repository
8
8
9
-
Fork the repositorty of the site on our [GitHub](https://https://github.com/RedHatOfficial/static-website-template-for-ospo).
9
+
Fork the repositorty of the site on our [GitHub](https://https://github.com/RedHatOfficial/static-website-template-for-ospo).
10
10
11
11
Clone your fork to your own computer then using this command and replacing the link with your own HTTPS clone link found underneath the **Code** button (see image below):
12
12
@@ -35,12 +35,12 @@ Run the server itself with this command:
35
35
npm start
36
36
```
37
37
38
-
The `cd` command changes the directory you're working with.
38
+
The `cd` command changes the directory you're working with.
39
39
40
40
The `npm start` command builds your website locally and serves it through a development server, ready for you to view at http://localhost:3000/.
41
41
42
42
Open `docs/intro.md` (this page) and edit some lines: the site **reloads automatically** and displays your changes.
43
43
44
44
## Using Docusaurus
45
45
46
-
Docusaurus is a static-site generator that convetakes Markdown files and donverts them into a documentation website. It is written in JavaScript however no prior knowledge of JavaScript is needed to edit the website.
46
+
Docusaurus is a static-site generator that convetakes Markdown files and donverts them into a documentation website. It is written in JavaScript however no prior knowledge of JavaScript is needed to edit the website.
0 commit comments