-
Notifications
You must be signed in to change notification settings - Fork 96
feat: Add vLLM Data Parallel support to llm-d-inference-scheduler #392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| resources: | ||
| - https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=v1.0.0 | ||
| - https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=v1.1.0-rc.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v1.1.0 is out. let's use formal release:
| - https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=v1.1.0-rc.1 | |
| - https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=v1.1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The go.mod file still refers to v1.1.0-rc.1. I'd prefer to keep things in sync.
The move to GIE 1.1.0 should be in a separate PR
| - "watch" | ||
| - "list" | ||
| - apiGroups: | ||
| - "inference.networking.k8s.io" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove inferencepools from the inference.networking.x-k8s.io group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it in in case someone wants to use the v1alpha2 version...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kgateway deployment files still use the v1alpha2 InferencePool
| verbs: | ||
| - update | ||
| - patch | ||
| - apiGroups: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto - remove from inference.networking.x-k8s.io group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kgateway deployment files still use the v1alpha2 InferencePool
| - "--zmq-endpoint=tcp://${EPP_NAME}.default.svc.cluster.local:5557" | ||
| - "--event-batch-size=16" | ||
| - "--tokenizers-cache-dir=/tokenizer-cache" | ||
| - "--data-parallel-size=${VLLM_DATA_PARALLEL_SIZE}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused.
the flag --data-parallel-size is reading from env var VLLM_DATA_PARALLEL_SIZE which makes this configureable (which is great).
but the ports list below is hardcoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ports just need to be exposed. There can't be more than eight ports. The flag simply says how many we want to use.
| - "--vllm-port=8200" | ||
| - "--connector=lmcache" | ||
| - "--secure-proxy=false" | ||
| - "--data-parallel-size=${VLLM_DATA_PARALLEL_SIZE}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought dp is not used together with pd.
(this file is vllm-sim-pd).
what am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the non-pd deployment files to include the sidecar. The script has been changed to launch using the non-pd deployment files even if VLLM_DATA_PARALLEL_SIZE is greater than one.
A subsequent PR will add DP support for PD deployments. For now we'll leave this change here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary use for DP will be with PD deployments, I think supporting it is a feature blocker. Today's hot deployments are TP+DP+EP with PD 😆.
| for _, target := range profileResult.TargetPods { | ||
| newPodInfo := target.GetPod().Clone() | ||
| newPodInfo.Port = h.targetPort | ||
| targetPod := &types.PodMetrics{Pod: newPodInfo, MetricsState: target.GetMetrics().Clone()} | ||
| newResult.TargetPods = append(newResult.TargetPods, targetPod) | ||
| } | ||
| modifiedResults := map[string]*types.ProfileRunResult{singleProfileName: &newResult} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I got the idea of why we need newResult or modifiedResults.
can't we iterate over range profileResult.TargetPods and update only the port to PrimaryPort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to depend on the fact that somewhere up the stack the PodInfo that the plugin received was a clone....
| kind: EndpointPickerConfig | ||
| plugins: | ||
| - type: prefix-cache-scorer | ||
| - type: decode-filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit confusing cause dp is currently used without pd.
I understand why it's used, but maybe there is a less confusing alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That filter is in all of the configuration files here. I actually don't know why....
8eee855 to
7605b33
Compare
|
For proxies that support multi-port, will we be able to use the canonical approach of connecting to the individual ports directly? |
|
@ahg-g wrote:
Yes. The code is this PR is basically a work around for Gateways like Istio, for now, that don't support routing to multiple ports in the same vLLM pod. If your Gateway, like KGateway, supports multi-port, then simply don't use the new data-parallel-profile-handler plugin. There will be a follow up PR that will add PD support for Data Parallel in the sidecar and in the profile handler. I will make sure that any changes there are optional. |
2e66fd5 to
d1c8b60
Compare
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
d1c8b60 to
e30494a
Compare
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
|
/lgtm |
This PR adds support to the llm-d-inference-scheduler for vLLM's Data Parallel (DP) feature. This vLLM feature enables the running of multiple HTTP Servers in the same container.
In particular this PR attempts to work around the Istio issue 57638
This PR adds a new profile handler plugin
data-parallel-profile-handlerwhich needs to be used to enable DP when Disambiguated Prefill/Decode (PD) isn't used. This plugin changes the scheduling result to always schedule the request on the "base" port on the vLLM decode pod and adds a header for the sidecar.Additionally the sidecar has been extended to work with the header added by the new
data-parallel-profile-handlerplugin. This is a work around to deal with Istio issue 57638 in which we can only route to teh "base" port of vLLM pods. It also always proxies all direct HTTP calls to the "extra" HTTP servers started by vLLM.Ref #380