Scheduling
Topology Aware Scheduling (TAS)
Kaiwo supports Kueue's Topology Aware Scheduling to co-locate workload pods within the same topology domain (e.g., same rack or network block). This can significantly improve performance for distributed training by reducing inter-node communication latency.
TAS is opt-in at the workload level. If you do not set either field, your workload is scheduled normally without topology constraints.
Fields:
preferredTopologyLabel: Kueue will try to place all pods within the topology domain identified by this label. If the pods cannot fit at that level, Kueue moves up to the next topology level. If they cannot fit even at the highest level, pods are distributed across multiple domains. This is a best-effort preference.requiredTopologyLabel: Kueue must place all pods within a single topology domain at the specified level. If this is not possible, the workload will not be admitted.
Topology label values correspond to the levels defined in the cluster's Topology resource. Common levels from most specific to least specific:
| Label | Meaning |
|---|---|
kubernetes.io/hostname |
Same physical node |
kaiwo/topology-rack |
Same network rack |
kaiwo/topology-block |
Same network block |
Example:
apiVersion: kaiwo.silogen.ai/v1alpha1
kind: KaiwoJob
metadata:
name: distributed-training
namespace: ai-research
spec:
gpus: 16
gpuVendor: amd
ray: true
preferredTopologyLabel: kaiwo/topology-rack # Try to place all workers in the same rack
image: my-training-image:latest
To require placement within a single rack (workload will wait if not possible):
Prerequisites
TAS requires that the cluster administrator has configured a Topology and that the ResourceFlavor used by your workload has topologyName set. If the flavor does not reference a topology, setting these fields will have no effect on scheduling. See the admin configuration guide for setup details.
Resource Allocation
replicas, gpus, gpusPerReplica, and gpuVendor
These fields collectively control the number of workload instances and how GPUs are allocated across them. Their interaction depends on the workload type (Job/Service) and whether Ray is used (ray: true).
Purpose:
replicas: Sets the desired number of instances (pods). Default: 1. Ignored for non-Ray Jobs.gpus: Specifies the total number of GPUs requested across all replicas. Default: 0.gpusPerReplica: Specifies the number of GPUs requested per replica. Default: 0.gpuVendor: Eitheramd(default) ornvidia. Determines the GPU resource key (e.g.,amd.com/gpu,nvidia.com/gpu).
Behavior:
-
Non-Ray Workloads (
ray: false):- KaiwoJob: Only one pod is created.
replicasis ignored.gpusorgpusPerReplica(if set > 0) determines the GPU request for the single pod's container. If bothgpusandgpusPerReplicaare set,gpusPerReplicatakes precedence if > 0, otherwisegpusis used. - KaiwoService (Deployment):
replicasdirectly sets thedeployment.spec.replicas.gpusorgpusPerReplica(if set > 0) determines the GPU request for each replica's container. If bothgpusandgpusPerReplicaare set,gpusPerReplicatakes precedence if > 0, otherwisegpusis used (implyinggpusPerReplica = gpus / replicas, though this division isn't explicitly performed; the request per pod is set based on the determinedgpusPerReplicavalue).
- KaiwoJob: Only one pod is created.
-
Ray Workloads (
ray: true):- The controller performs a calculation (
CalculateNumberOfReplicas) considering cluster node capacity (specifically, the minimum GPU capacity available on nodes matching thegpuVendor, referred to asminGpusPerNode). - User Precedence: If the user explicitly sets both
replicas(> 0) andgpusPerReplica(> 0), these values are used directly, provided the total requested GPUs (replicas * gpusPerReplica) does not exceed the total available GPUs of the specifiedgpuVendorin the cluster. Thegpusfield is ignored in this case. - Calculation Fallback: If the user does not explicitly set both
replicasandgpusPerReplica, or if the requested total exceeds cluster capacity, the controller calculates the optimalreplicasandgpusPerReplicabased on thegpusfield and the cluster'sminGpusPerNode.- The
totalUserRequestedGpusis determined (usinggpusfield, capped at total cluster capacity). - The final
replicasis calculated asceil(totalUserRequestedGpus / minGpusPerNode). - The final
gpusPerReplicais calculated astotalUserRequestedGpus / replicas.
- The
- The calculated or user-provided
replicasvalue sets the Ray worker group replica count (minReplicas,maxReplicas,replicas). This is due the fact that Kueue does not support Ray's autoscaling. - The calculated or user-provided
gpusPerReplicavalue sets the GPU resource request/limit for each Ray worker pod's container.
- The controller performs a calculation (
Summary Table (Ray Workloads):
User Input (spec.*) |
Calculation Performed? | Outcome (replicas, gpusPerReplica) |
Notes |
|---|---|---|---|
replicas > 0, gpusPerReplica > 0 |
No* | Uses user's replicas, user's gpusPerReplica |
*If total fits cluster. gpus ignored. Highest precedence. |
gpus > 0 (only) |
Yes | Calculated based on gpus and minGpusPerNode |
Aims to maximize GPUs per node up to minGpusPerNode. |
replicas > 0, gpus > 0 |
Yes | Calculated based on gpus and minGpusPerNode (user replicas ignored) |
Falls back to calculation based on total gpus. |
gpusPerReplica > 0, gpus > 0 |
Yes | Calculated based on gpus and minGpusPerNode (user gpusPerReplica ignored) |
Falls back to calculation based on total gpus. |
| All three set | No* | Uses user's replicas, user's gpusPerReplica |
*If total fits cluster (like row 1). Otherwise, calculates based on gpus. |
None set (or only gpuVendor) |
No | replicas=1, gpusPerReplica=0 |
No GPUs requested. |