Feature Request: Network Topology Aware Scheduling Support in AIBrix

### 🚀 Feature Description and Motivation

In LLM inference scenarios, model parallelism and distributed inference have extremely high requirements for network throughput to exchange data, making networking a critical bottleneck. Modern datacenters have diverse network architectures (e.g., IB, RoCE, NVSwitch) with multiple levels of switches having different throughput and latency characteristics.

**Network Topology Aware Scheduling** allows workloads to be scheduled to the best performance domain with the highest throughput and lowest latency, accelerating data exchange for inference operations.

Therefore, I'm proposing adding **Network Topology Aware Scheduling** support to AIBrix by integrating with Volcano's network topology aware scheduling features: https://volcano.sh/en/docs/network_topology_aware_scheduling/. This will enable LLM inference workloads to be scheduled within optimal network performance domains, significantly improving inference performance.

### Use Case

Consider a datacenter with 8 GPU nodes under a 3-tier switch hierarchy:
<img width="747" height="404" alt="Image" src="https://github.com/user-attachments/assets/e07b6593-137c-4867-9022-97db0c643b86" />
S0 to S6 can be considered as switches. If the scheduler is unaware of the underlying network topology, in a 1P1D scenario, the worst-case scenario is that one is on node0 and the other on node7. This requires traversing multiple layers of switches, resulting in long data exchange links and a decrease in inference performance. The best case scenario is that 1P1D is scheduled to the leaf switch, e.g., under S0.

### Proposed Solution

Volcano already supports network topology-aware scheduling, and the Kubeflow community also supports configuring training jobs with network topology-aware scheduling. 
- https://www.kubeflow.org/docs/components/trainer/operator-guides/job-scheduling/volcano/
- https://volcano.sh/en/docs/network_topology_aware_scheduling/

Therefore, I think StormService could also support network topology configuration, either by adding a custom field or by configuring network topology requirements in StormService annotations. Then, the StormService Controller could automatically create a Volcano PodGroup for the StormService and inherit the StormService's network topology constraints to the PodGroup, allowing the Volcano scheduler to schedule the PodGroup. This would enable Prefill and Decode pods to be scheduled to the more efficient network performance domains.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Network Topology Aware Scheduling Support in AIBrix #1730

🚀 Feature Description and Motivation

Use Case

Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Network Topology Aware Scheduling Support in AIBrix #1730

Description

🚀 Feature Description and Motivation

Use Case

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions