Configuration Guide
Administrators configure Kaiwo primarily through
- The cluster-scoped
KaiwoQueueConfigCustom Resource Definition (CRD) - The cluster-scoped
KaiwoConfigCRD - Environment variables or flags passed to the Kaiwo operator
Users configure the CLI tool separately.
KaiwoQueueConfig CRD
This is the central point for managing how Kaiwo interacts with Kueue. There can be only one KaiwoQueueConfig resource in the cluster, and its metadata.name must be kaiwo (or the value specified in the KaiwoConfig Custom Resource field spec.defaultKaiwoQueueConfigName, which defaults to kaiwo, see more below).
Default Configuration on Startup:
The Kaiwo operator includes a startup routine that checks if a KaiwoQueueConfig named kaiwo exists. If it does not, the operator automatically creates a default one. This default configuration aims to provide a functional baseline:
- It attempts to auto-discover node pools based on common GPU labels (e.g.,
amd.com/gpu.product-name,nvidia.com/gpu.product,nvidia.com/gpu.count) and CPU/Memory capacity. - It creates corresponding Kueue
ResourceFlavorresources based on this discovery, labeling the nodes withkaiwo/nodepool=<generated-flavor-name>. - It defines a single Kueue
ClusterQueuenamedkaiwo(or the value ofDEFAULT_CLUSTER_QUEUE_NAME), configured to use all discoveredResourceFlavorsand their estimated capacities asnominalQuota. - It specifies that this default
ClusterQueueshould have a correspondingLocalQueueautomatically created in thekaiwonamespace. - It does not define any
WorkloadPriorityClassresources by default.
You can modify this automatically created configuration or create your own kaiwo resource manually using: kubectl edit kaiwoqueueconfig kaiwo or by applying a YAML manifest.
Key Fields (spec):
-
resourceFlavors: Defines the types of hardware resources available in the cluster, corresponding to KueueResourceFlavorresources.name: A unique name for the flavor (e.g.,amd-mi300-8gpu,nvidia-a100-40gb,cpu-standard).nodeLabels: A map of labels that nodes must possess to be considered part of this flavor. This is crucial for scheduling pods onto the correct hardware. Example:{"kaiwo/nodepool": "amd-mi300-nodes"}.taints: (Optional) A list of Kubernetes taints associated with this flavor. Pods scheduled to this flavor will need corresponding tolerations. Kaiwo automatically adds tolerations for GPU taints ifADD_TAINTS_TO_GPU_NODESis enabled.
Auto-Discovery vs. Explicit Definition
If
spec.resourceFlavorsis empty or omitted in thekaiwoKaiwoQueueConfig, the operator's startup logic attempts to auto-discover node pools and create corresponding flavors as described above. While convenient for initial setup, explicitly definingresourceFlavorsin theKaiwoQueueConfigprovides more precise control and is generally recommended for production environments. Explicitly defined flavors will override any auto-discovered ones during reconciliation. -
clusterQueues: Defines the KueueClusterQueueresources managed by Kaiwo.name: The name of theClusterQueue(e.g.,team-a-queue,default-gpu-queue).spec: The full KueueClusterQueueSpec. This is where you define resource quotas, cohorts, preemption policies, etc. See Kueue ClusterQueue Documentation.resourceGroups: Define sets of flavors and their associated quotas (nominalQuota). This links the queue to the available hardware defined inresourceFlavors.namespaceSelector: Controls which namespaces can use this queue viaLocalQueueresources if those LocalQueues exist. Note that Kaiwo's automaticLocalQueuecreation relies on thenamespacesfield below, not this selector.
namespaces: A list of namespace names where Kaiwo should automatically create and manage a KueueLocalQueuepointing to thisClusterQueue. TheLocalQueuecreated will have the same name as theClusterQueue.
-
workloadPriorityClasses: Defines KueueWorkloadPriorityClassresources.- Follows the standard Kueue
WorkloadPriorityClassstructure (name,value,description). Kaiwo ensures these exist as defined. See Kueue Priority Documentation.
- Follows the standard Kueue
Example KaiwoQueueConfig:
apiVersion: kaiwo.silogen.ai/v1alpha1
kind: KaiwoQueueConfig
metadata:
name: kaiwo # Must be named 'kaiwo' (or DEFAULT_KAIWO_QUEUE_CONFIG_NAME)
spec:
resourceFlavors:
- name: amd-mi300-8gpu
nodeLabels:
kaiwo/nodepool: amd-mi300-nodes # Nodes with this label belong to this flavor
# Add other identifying labels if needed, e.g., topology.amd.com/gpu-count: '8'
# taints: # Optional, define if specific taints apply ONLY to these nodes
# - key: "amd.com/gpu"
# operator: "Exists"
# effect: "NoSchedule"
- name: cpu-high-mem
nodeLabels:
kaiwo/nodepool: cpu-high-mem-nodes
clusterQueues:
- name: ai-research-queue # Name of the ClusterQueue
namespaces: # Auto-create/manage LocalQueues in these namespaces
- ai-research-ns-1
- ai-research-ns-2
spec: # Standard Kueue ClusterQueueSpec
queueingStrategy: BestEffortFIFO
resourceGroups:
- coveredResources: ["cpu", "memory", "amd.com/gpu"] # Resources managed by this group
flavors:
- name: amd-mi300-8gpu # Reference to a defined resourceFlavor
resources:
- name: "cpu"
nominalQuota: "192" # Total CPU quota for this flavor in this queue
- name: "memory"
nominalQuota: "1024Gi" # Total Memory quota
- name: "amd.com/gpu"
nominalQuota: "8" # Total GPU quota
- coveredResources: ["cpu", "memory"]
flavors:
- name: cpu-high-mem
resources:
- name: "cpu"
nominalQuota: "256"
- name: "memory"
nominalQuota: "2048Gi"
# cohort: "gpu-cohort" # Optional: Group queues for borrowing/preemption
# preemption: ...
workloadPriorityClasses:
- name: high-priority
value: 1000
- name: low-priority
value: 100
Controller Operation and Kueue Resource Synchronization
The KaiwoQueueConfigController acts as a translator, continuously ensuring that the Kueue resources in your cluster accurately reflect the configuration defined in the single kaiwo KaiwoQueueConfig resource. It monitors this resource and automatically manages the lifecycle of the associated Kueue objects:
-
spec.resourceFlavors-> KueueResourceFlavor:- Each entry in this list directly defines a Kueue
ResourceFlavor. - The controller ensures a corresponding
ResourceFlavorexists for each entry, creating or updating it as necessary based on the specifiedname,nodeLabels, andtaints. - If an entry is removed from this list, the controller deletes the corresponding
ResourceFlavor.
- Each entry in this list directly defines a Kueue
-
spec.clusterQueues-> KueueClusterQueueandLocalQueue:- Each entry in this list defines a Kueue
ClusterQueue. The controller translates the structure into a standardClusterQueueSpecand ensures the resource exists and matches the definition. Removing an entry deletes the correspondingClusterQueue. - The
namespacesfield within eachclusterQueuesentry dictates where KueueLocalQueues should exist. The controller automatically creates aLocalQueue(named after theClusterQueue) in each listed namespace, pointing to the correspondingClusterQueue. If a namespace is removed from the list, or the parentClusterQueueentry is removed, the controller deletes the associatedLocalQueuein that namespace.
- Each entry in this list defines a Kueue
-
spec.workloadPriorityClasses-> KueueWorkloadPriorityClass:- Each entry defines a Kueue
WorkloadPriorityClass. - The controller manages these resources, ensuring they exist with the specified
name,value, anddescription. - Removing an entry results in the deletion of the corresponding
WorkloadPriorityClass.
- Each entry defines a Kueue
Owner References
The controller establishes the kaiwo KaiwoQueueConfig as the owner of all the Kueue resources it creates. This linkage ensures that if the KaiwoQueueConfig is deleted, Kubernetes automatically cleans up all the managed Kueue resources (ResourceFlavor, ClusterQueue, LocalQueue, WorkloadPriorityClass).
Kueue resource management
Kaiwo takes ownership of Kueue ResourceFlavor, ClusterQueue, LocalQueue and WorkloadPriorityClass resources. This means that resources of these types that are created manually, i.e. not via the KaiwoQueueConfig, may be deleted by the Kaiwo Controller
The controller updates the status.status field of the KaiwoQueueConfig resource (Pending, Ready, or Failed) to indicate the current state of synchronization between the desired configuration and the actual Kueue resources in the cluster. This continuous reconciliation keeps the Kueue setup aligned with the central KaiwoQueueConfig.
KaiwoConfig CRD
The Kaiwo Operator's runtime configuration is managed through the KaiwoConfig Custom Resource Definition (CRD). This approach allows Kubernetes administrators to dynamically adjust operator behavior without requiring a restart. The operator always retrieves the most recent configuration values during each reconcile loop.
Configuration Structure
The primary configuration resource is the KaiwoConfig CRD, typically maintained as a singleton within the Kubernetes cluster. Its key components are encapsulated in the KaiwoConfigSpec, which briefly includes:
kueue: Configures default integration settings with Kueue, including the default cluster queue name.ray: Specifies Ray-specific parameters, including default container images and memory allocations.storage: Manages default filesystem paths for mounting data storage and HuggingFace caches.nodes: Defines node-specific settings such as GPU resource keys, GPU node taints, and node pool exclusions.scheduling: Sets scheduling-related configurations, like the Kubernetes scheduler name.resourceMonitoring: Configures resource monitoring, including averaging intervals, utilization thresholds, and targeted namespaces.defaultKaiwoQueueConfigName: Specifies the default name for the Kaiwo queue configuration object.
Specifying the Configuration CR
The Kaiwo Operator identifies its configuration resource via the environment variable CONFIG_NAME. By default, this is set to kaiwo. Ensure that a KaiwoConfig resource with this exact name exists in your cluster. The operator automatically creates a default configuration at startup if none exists.
Note
The operator waits up to 30 seconds for the specified configuration resource to be found. If no resource is detected within this period, the operator pod will fail with an error.
Example KaiwoConfig CR
Here's a minimal example of a valid KaiwoConfig definition:
apiVersion: config.kaiwo.silogen.ai/v1alpha1
kind: KaiwoConfig
metadata:
name: kaiwo
spec:
scheduling:
kubeSchedulerName: "kaiwo-scheduler"
resourceMonitoring:
averagingTime: "20m"
lowUtilizationThreshold: 20
profile: "gpu"
For detailed descriptions of individual configuration fields, please see the full API reference.
Operator Environmental Variables
Some configuration is not changeable during runtime, or they may be often referenced from other config maps or secrets, and thus they are stored as environmental variables. These settings are not dynamic, and the operator must be restarted in order for changes to these values to take effect.
Kueue
DEFAULT_KAIWO_QUEUE_CONFIG_NAME: The name of the singletonKaiwoQueueConfigcustom resource to be used (defaults tokaiwo)DEFAULT_CLUSTER_QUEUE_NAME: The name of the default Kueue cluster queue (defaults tokaiwo)
Resource Monitoring
To enable and configure resource monitoring within the Kaiwo Operator, the following environment variables must be set on the operator deployment:
RESOURCE_MONITORING_ENABLED=true– Enables the resource monitoring component.RESOURCE_MONITORING_PROMETHEUS_ENDPOINT=<prometheus-endpoint>– Specifies the Prometheus endpoint to query metrics from.RESOURCE_MONITORING_POLLING_INTERVAL=10m– Sets the interval between metric polling queries.
Other configuration options
WEBHOOK_CERT_DIRECTORY: Path to manually provided webhook certificates (overrides automatic management if set). See Installation.
Forthcoming feature
ENFORCE_KAIWO_ON_GPU_WORKLOADS (Default: false): If true, the mutating admission webhook for batchv1.Job will automatically add the kaiwo.silogen.ai/managed: "true" label to any job requesting GPU resources, forcing it to be managed by Kaiwo/Kueue.
All environmental variables are typically set in the operator's Deployment manifest.
Command-Line Flags:
Refer to the output of kaiwo-operator --help (or check cmd/operator/main.go) for flags controlling metrics, health probes, leader election, and certificate paths.