Skip to content

Conversation

@shmuelk
Copy link
Collaborator

@shmuelk shmuelk commented Oct 28, 2025

This PR adds support to the llm-d-inference-scheduler for vLLM's Data Parallel (DP) feature. This vLLM feature enables the running of multiple HTTP Servers in the same container.

In particular this PR attempts to work around the Istio issue 57638

This PR adds a new profile handler plugin data-parallel-profile-handler which needs to be used to enable DP when Disambiguated Prefill/Decode (PD) isn't used. This plugin changes the scheduling result to always schedule the request on the "base" port on the vLLM decode pod and adds a header for the sidecar.

Additionally the sidecar has been extended to work with the header added by the new data-parallel-profile-handler plugin. This is a work around to deal with Istio issue 57638 in which we can only route to teh "base" port of vLLM pods. It also always proxies all direct HTTP calls to the "extra" HTTP servers started by vLLM.

Ref #380


resources:
- https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=v1.0.0
- https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=v1.1.0-rc.1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1.1.0 is out. let's use formal release:

Suggested change
- https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=v1.1.0-rc.1
- https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=v1.1.0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The go.mod file still refers to v1.1.0-rc.1. I'd prefer to keep things in sync.

The move to GIE 1.1.0 should be in a separate PR

- "watch"
- "list"
- apiGroups:
- "inference.networking.k8s.io"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove inferencepools from the inference.networking.x-k8s.io group?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it in in case someone wants to use the v1alpha2 version...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kgateway deployment files still use the v1alpha2 InferencePool

verbs:
- update
- patch
- apiGroups:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto - remove from inference.networking.x-k8s.io group.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kgateway deployment files still use the v1alpha2 InferencePool

- "--zmq-endpoint=tcp://${EPP_NAME}.default.svc.cluster.local:5557"
- "--event-batch-size=16"
- "--tokenizers-cache-dir=/tokenizer-cache"
- "--data-parallel-size=${VLLM_DATA_PARALLEL_SIZE}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused.
the flag --data-parallel-size is reading from env var VLLM_DATA_PARALLEL_SIZE which makes this configureable (which is great).

but the ports list below is hardcoded.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ports just need to be exposed. There can't be more than eight ports. The flag simply says how many we want to use.

- "--vllm-port=8200"
- "--connector=lmcache"
- "--secure-proxy=false"
- "--data-parallel-size=${VLLM_DATA_PARALLEL_SIZE}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought dp is not used together with pd.
(this file is vllm-sim-pd).
what am I missing?

Copy link
Collaborator Author

@shmuelk shmuelk Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the non-pd deployment files to include the sidecar. The script has been changed to launch using the non-pd deployment files even if VLLM_DATA_PARALLEL_SIZE is greater than one.

A subsequent PR will add DP support for PD deployments. For now we'll leave this change here

Copy link
Member

@vMaroon vMaroon Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary use for DP will be with PD deployments, I think supporting it is a feature blocker. Today's hot deployments are TP+DP+EP with PD 😆.

Comment on lines 108 to 114
for _, target := range profileResult.TargetPods {
newPodInfo := target.GetPod().Clone()
newPodInfo.Port = h.targetPort
targetPod := &types.PodMetrics{Pod: newPodInfo, MetricsState: target.GetMetrics().Clone()}
newResult.TargetPods = append(newResult.TargetPods, targetPod)
}
modifiedResults := map[string]*types.ProfileRunResult{singleProfileName: &newResult}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I got the idea of why we need newResult or modifiedResults.

can't we iterate over range profileResult.TargetPods and update only the port to PrimaryPort?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to depend on the fact that somewhere up the stack the PodInfo that the plugin received was a clone....

kind: EndpointPickerConfig
plugins:
- type: prefix-cache-scorer
- type: decode-filter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit confusing cause dp is currently used without pd.
I understand why it's used, but maybe there is a less confusing alternative.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That filter is in all of the configuration files here. I actually don't know why....

@ahg-g
Copy link

ahg-g commented Oct 28, 2025

For proxies that support multi-port, will we be able to use the canonical approach of connecting to the individual ports directly?

@shmuelk
Copy link
Collaborator Author

shmuelk commented Oct 28, 2025

@ahg-g wrote:

For proxies that support multi-port, will we be able to use the canonical approach of connecting to the individual ports directly?

Yes. The code is this PR is basically a work around for Gateways like Istio, for now, that don't support routing to multiple ports in the same vLLM pod. If your Gateway, like KGateway, supports multi-port, then simply don't use the new data-parallel-profile-handler plugin.

There will be a follow up PR that will add PD support for Data Parallel in the sidecar and in the profile handler. I will make sure that any changes there are optional.

Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
Signed-off-by: Shmuel Kallner <[email protected]>
@nirrozenbaum
Copy link
Collaborator

/lgtm
/approve

@github-actions github-actions bot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 3, 2025
@github-actions github-actions bot merged commit 90dcc59 into llm-d:main Nov 3, 2025
6 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in llm-d-inference-scheduler Nov 3, 2025
@shmuelk shmuelk deleted the data-parallel branch November 3, 2025 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm "Looks good to me", indicates that a PR is ready to be merged.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants