-
Notifications
You must be signed in to change notification settings - Fork 485
Description
🚀 Feature Description and Motivation
In LLM inference scenarios, model parallelism and distributed inference have extremely high requirements for network throughput to exchange data, making networking a critical bottleneck. Modern datacenters have diverse network architectures (e.g., IB, RoCE, NVSwitch) with multiple levels of switches having different throughput and latency characteristics.
Network Topology Aware Scheduling allows workloads to be scheduled to the best performance domain with the highest throughput and lowest latency, accelerating data exchange for inference operations.
Therefore, I'm proposing adding Network Topology Aware Scheduling support to AIBrix by integrating with Volcano's network topology aware scheduling features: https://volcano.sh/en/docs/network_topology_aware_scheduling/. This will enable LLM inference workloads to be scheduled within optimal network performance domains, significantly improving inference performance.
Use Case
Consider a datacenter with 8 GPU nodes under a 3-tier switch hierarchy:

S0 to S6 can be considered as switches. If the scheduler is unaware of the underlying network topology, in a 1P1D scenario, the worst-case scenario is that one is on node0 and the other on node7. This requires traversing multiple layers of switches, resulting in long data exchange links and a decrease in inference performance. The best case scenario is that 1P1D is scheduled to the leaf switch, e.g., under S0.
Proposed Solution
Volcano already supports network topology-aware scheduling, and the Kubeflow community also supports configuring training jobs with network topology-aware scheduling.
- https://www.kubeflow.org/docs/components/trainer/operator-guides/job-scheduling/volcano/
- https://volcano.sh/en/docs/network_topology_aware_scheduling/
Therefore, I think StormService could also support network topology configuration, either by adding a custom field or by configuring network topology requirements in StormService annotations. Then, the StormService Controller could automatically create a Volcano PodGroup for the StormService and inherit the StormService's network topology constraints to the PodGroup, allowing the Volcano scheduler to schedule the PodGroup. This would enable Prefill and Decode pods to be scheduled to the more efficient network performance domains.