Skip to content

API Reference

Packages

kaiwo.silogen.ai/v1alpha1

Package v1alpha1 contains API Schema definitions for the kaiwo v1alpha1 API group.

Resource Types

AzureBlobStorageDownloadItem

AzureBlobStorageDownloadItem defines parameters for downloading data from Azure Blob Storage.

Appears in: - DownloadTaskConfig - ObjectStorageDownloadSpec

Field Description Default Validation
connectionString ValueReference ConnectionString references a Kubernetes Secret containing the Azure Storage connection string. See ValueReference.
containers CloudDownloadBucket array Containers lists the Azure Blob Storage containers and the specific files/folders to download from them. See CloudDownloadBucket.

CloudDownloadBucket

CloudDownloadBucket represents a specific bucket (S3, GCS) or container (Azure) to download from.

Appears in: - AzureBlobStorageDownloadItem - GCSDownloadItem - S3DownloadItem

Field Description Default Validation
name string Name is the name of the bucket or container.
files CloudDownloadFile array Files lists specific files to download from this bucket/container.
folders CloudDownloadFolder array Folders lists specific folders (prefixes) to download from this bucket/container.

ClusterQueue

ClusterQueue defines the configuration for a Kueue ClusterQueue managed by Kaiwo.

Appears in: - KaiwoQueueConfigSpec

Field Description Default Validation
name string Name specifies the name of the Kueue ClusterQueue resource.
spec ClusterQueueSpec Spec contains the desired Kueue ClusterQueueSpec. Kaiwo ensures the corresponding ClusterQueue resource matches this spec. See Kueue documentation for ClusterQueueSpec fields like resourceGroups, cohort, preemption, etc.
namespaces string array Namespaces optionally lists Kubernetes namespaces where Kaiwo should automatically create a Kueue LocalQueue resource pointing to this ClusterQueue.
If one or more namespaces are provided, the KaiwoQueueConfig controller takes over managing the LocalQueues for this ClusterQueue.
Leave this empty if you want to be able to create your own LocalQueues for this ClusterQueue.

ClusterQueueSpec

Appears in: - ClusterQueue

Field Description Default Validation
resourceGroups ResourceGroup array resourceGroups describes groups of resources.
Each resource group defines the list of resources and a list of flavors
that provide quotas for these resources.
Each resource and each flavor can only form part of one resource group.
resourceGroups can be up to 16.
MaxItems: 16
cohort CohortReference cohort that this ClusterQueue belongs to. CQs that belong to the
same cohort can borrow unused resources from each other.

A CQ can be a member of a single borrowing cohort. A workload submitted
to a queue referencing this CQ can borrow quota from any CQ in the cohort.
Only quota for the [resource, flavor] pairs listed in the CQ can be
borrowed.
If empty, this ClusterQueue cannot borrow from any other ClusterQueue and
vice versa.

A cohort is a name that links CQs together, but it doesn't reference any
object.
queueingStrategy QueueingStrategy QueueingStrategy indicates the queueing strategy of the workloads
across the queues in this ClusterQueue.
Current Supported Strategies:

- StrictFIFO: workloads are ordered strictly by creation time.
Older workloads that can't be admitted will block admitting newer
workloads even if they fit available quota.
- BestEffortFIFO: workloads are ordered by creation time,
however older workloads that can't be admitted will not block
admitting newer workloads that fit existing quota.
BestEffortFIFO Enum: [StrictFIFO BestEffortFIFO]
namespaceSelector LabelSelector namespaceSelector defines which namespaces are allowed to submit workloads to
this clusterQueue. Beyond this basic support for policy, a policy agent like
Gatekeeper should be used to enforce more advanced policies.
Defaults to null which is a nothing selector (no namespaces eligible).
If set to an empty selector \{\}, then all namespaces are eligible.
flavorFungibility FlavorFungibility flavorFungibility defines whether a workload should try the next flavor
before borrowing or preempting in the flavor being evaluated.
{ }
preemption ClusterQueuePreemption { }
admissionChecks AdmissionCheckReference array admissionChecks lists the AdmissionChecks required by this ClusterQueue.
Cannot be used along with AdmissionCheckStrategy.
admissionChecksStrategy AdmissionChecksStrategy admissionCheckStrategy defines a list of strategies to determine which ResourceFlavors require AdmissionChecks.
This property cannot be used in conjunction with the 'admissionChecks' property.
stopPolicy StopPolicy stopPolicy - if set to a value different from None, the ClusterQueue is considered Inactive, no new reservation being
made.

Depending on its value, its associated workloads will:

- None - Workloads are admitted
- HoldAndDrain - Admitted workloads are evicted and Reserving workloads will cancel the reservation.
- Hold - Admitted workloads will run to completion and Reserving workloads will cancel the reservation.
None Enum: [None Hold HoldAndDrain]
fairSharing FairSharing fairSharing defines the properties of the ClusterQueue when
participating in FairSharing. The values are only relevant
if FairSharing is enabled in the Kueue configuration.

CommonMetaSpec

CommonMetaSpec defines reusable metadata fields for workloads.

Appears in: - KaiwoJobSpec - KaiwoServiceSpec

Field Description Default Validation
user string User specifies the owner or creator of the workload. It should typically be the user's email address. This value is primarily used for labeling (kaiwo.silogen.ai/user) the generated resources (like Pods, Jobs, Deployments) for identification and filtering (e.g., with kaiwo list --user <email>).

In the future, if authentication is enabled, this must be the email address which is checked against authenticated user for match.
podTemplateSpecLabels object (keys:string, values:string) PodTemplateSpecLabels allows you to specify custom labels that will be added to the template.metadata.labels section of the generated Pods (within Jobs, Deployments, or RayCluster specs). Standard Kaiwo system labels (like kaiwo.silogen.ai/user, kaiwo.silogen.ai/name, etc.) are added automatically and take precedence if there are conflicts.
gpus integer Gpus specifies the total number of GPUs allocated to the workload. See here for more details on how this field impacts scheduling. 0
gpuVendor string GpuVendor specifies the GPU vendor (e.g., amd, nvidia, etc.). See here for more details on how this field impacts scheduling. amd
gpuModels string array GpuModels allows you to optionally specify the GPU models that your workload will run on. You can see available models either by using the CLI and running kaiwo status amd/nvidia or by using kubectl command kubectl get nodes -o custom-columns=NAME:.metadata.name,MODEL:.metadata.labels.kaiwo\/gpu-model
This field is used to filter the available nodes for scheduling. You can specify multiple models, and Kaiwo will select the best available node that matches one of the specified models.
version string Version allows you to specify an optional version string for the workload. This can be useful for tracking different iterations or configurations of the same logical workload. It does not directly affect resource creation but serves as metadata.
replicas integer Replicas specifies the number of replicas for the workload. See here for more details on how this field impacts scheduling. 1
gpusPerReplica integer GpusPerReplica specifies the number of GPUs allocated per replica. See here for more details on how this field impacts scheduling.

If you specify gpusPerReplica, you must also specify replicas.
duration Duration Duration specifies the maximum duration over which the workload can run. This is useful for avoiding workloads running indefinitely.
preferredTopologyLabel string PreferredTopologyLabel specifies the preferred topology label for scheduling the workload. This is used to influence how the workload is distributed across nodes in the cluster.
If not specified, Kaiwo will use the default topology labels defined in the default topology of KaiwoQueueConfig starting at the host level.
The levels are evaluated one-by-one going up from the level indicated by the label. If the PodSet cannot fit within a given topology label then the next topology level up is considered.
If the PodSet cannot fit at the highest topology level, then it is distributed among multiple topology domains
requiredTopologyLabel string RequiredTopologyLabel specifies the required topology label for scheduling the workload. This is used to ensure that the workload is scheduled on nodes that match the specified topology label.
resources ResourceRequirements Resources specify the default resource requirements applied for all pods inside the workflow.

This field defines default Kubernetes ResourceRequirements (requests and limits for CPU,
memory, ephemeral-storage) applied to all containers (including init containers) within
the workload's pods.

Behavior:

These values act as defaults. If a container within the underlying Job, Deployment,
or Ray spec (if provided by the user) already defines a specific request or limit
(e.g., memory limit), the value from resources for that specific metric will not override it.

Interaction with GPU fields: The GPU requests/limits (amd.com/gpu or nvidia.com/gpu)
are controlled exclusively by the gpus, gpusPerReplica, and gpuVendor fields
(and the associated calculation logic described above). Any GPU specifications within
the resources field are ignored.

Default CPU/Memory with GPUs: When Kaiwo generates the underlying
Job/Deployment/RayCluster spec (i.e., the user did not provide spec.job,
spec.deployment, or spec.rayService/spec.rayJob), and GPUs are requested
(gpusPerReplica > 0), Kaiwo applies default CPU and Memory requests/limits
based on the GPU count (e.g., 4 CPU cores and 32Gi Memory per GPU).
These GPU-derived defaults will override any CPU/Memory settings defined in
the resources field in this specific scenario. If the user does provide
the underlying spec, these GPU-derived CPU/Memory defaults are not applied,
respecting the user's definition or the values from the resources field.
image string Image specifies the default container image to be used for the primary workload container(s).

- If containers defined within the underlying Job, Deployment, or Ray spec do not specify an image, this image will be used.
- If this field is also empty, the latest tag of ghcr.io/silogen/rocm-ray is used
imagePullSecrets LocalObjectReference array ImagePullSecrets is a list of Kubernetes LocalObjectReference (containing just the secret name) referencing secrets needed to pull the container image(s). These are added to the imagePullSecrets field of the PodSpec for all generated pods.
env EnvVar array Env is a list of Kubernetes EnvVar structs. These environment variables are added to the primary workload container(s) in the generated pods. They are appended to any environment variables already defined in the underlying Job, Deployment, or Ray spec.
secretVolumes SecretVolume array SecretVolumes allows you to mount specific keys from Kubernetes Secrets as files into the workload containers.
ray boolean Ray determines whether the operator should use RayCluster for workload execution.
If true, Kaiwo will create Ray-specific resources.
If false (default), Kaiwo will create standard Kubernetes resources (BatchJob for KaiwoJob, Deployment for KaiwoService).
This setting dictates which underlying spec (job/rayJob or deployment/rayService) is primarily used.
false
storage StorageSpec Storage configures persistent storage using Kubernetes PersistentVolumeClaims (PVCs).

Enabling storage.data.download or storage.huggingFace.preCacheRepos will cause Kaiwo to create a temporary Kubernetes Job (the "download job") before starting the main workload. This job runs a container that performs the downloads into the respective PVCs. The main workload only starts after the download job completes successfully.
dangerous boolean Dangerous, if when set to true, Kaiwo will not add the default PodSecurityContext (which normally sets runAsUser: 1000, runAsGroup: 1000, fsGroup: 1000) to the generated pods. Use this only if you need to run containers as root or a different specific user and understand the security implications. false
clusterQueue string ClusterQueue specifies the name of the Kueue ClusterQueue that the workload should be submitted to for scheduling and resource management.

This value is set as the kueue.x-k8s.io/queue-name label on the underlying resources.

If omitted, it defaults to the value specified by the DEFAULT_CLUSTER_QUEUE_NAME environment variable in the Kaiwo controller (typically "kaiwo"), which is set during installation.

Note! If the applied KaiwoQueueConfig includes no quota for the default queue, no workload will run that tries to fall back on it.

The kaiwo submit CLI command can override this using the --queue flag or the clusterQueue field in the kaiwoconfig.yaml file.
priorityClass string WorkloadPriorityClass specifies the name of Kueue WorkloadPriorityClass to be assigned to the job's pods. This influences the scheduling priority relative to other pods in the cluster.

CommonStatusSpec

Appears in: - KaiwoJobStatus - KaiwoServiceStatus

Field Description Default Validation
startTime Time StartTime records the timestamp when the first pod associated with the workload started running.
conditions Condition array Conditions lists the observed conditions of the workload resource, following standard Kubernetes conventions. May include conditions reflecting the underlying Deployment or RayService state.
status WorkloadStatus Status reflects the current high-level phase of the workload lifecycle (e.g., PENDING, STARTING, READY, FAILED).
duration integer Duration indicates how long the service has been running since StartTime, in seconds. Calculated periodically while running.
observedGeneration integer ObservedGeneration records the .metadata.generation of the workload resource that was last processed by the controller.

DataStorageSpec

DataStorageSpec configures the primary data volume for the workload.

Appears in: - StorageSpec

Field Description Default Validation
mountPath string MountPath specifies the path inside the workload containers where the data PersistentVolumeClaim will be mounted. /workload
storageSize string StorageSize specifies the requested size for the data PersistentVolumeClaim (e.g., "100Gi", "1Ti"). If set, a PVC will be created.
download ObjectStorageDownloadSpec Download configures optional tasks to download data from various sources into the data volume before the main workload starts. See ObjectStorageDownloadSpec.

GCSDownloadItem

GCSDownloadItem defines parameters for downloading data from Google Cloud Storage.

Appears in: - DownloadTaskConfig - ObjectStorageDownloadSpec

Field Description Default Validation
applicationCredentials ValueReference ApplicationCredentials references a Kubernetes Secret containing the GCS service account key JSON file content. See ValueReference.
buckets CloudDownloadBucket array Buckets lists the GCS buckets and the specific files/folders to download from them. See CloudDownloadBucket.

GitDownloadItem

GitDownloadItem defines parameters for cloning a Git repository or parts of it.

Appears in: - DownloadTaskConfig - ObjectStorageDownloadSpec

Field Description Default Validation
repository string Repository specifies the Git repository URL (e.g., "https://github.com/user/repo.git").
branch string Branch specifies the branch to clone. This takes precedence over commit.
commit string Commit specifies the exact commit hash to check out. This is ignored if commit is specified.
username ValueReference Username optionally references a Secret containing the Git username for authentication. See ValueReference.
token ValueReference Token optionally references a Secret containing the Git token (or password) for authentication. See ValueReference.
path string Path specifies a sub-path within the repository to copy. If omitted, the entire repository is copied.
targetPath string TargetPath specifies the destination path relative to the data volume's mount point (DataStorageSpec.MountPath) where the repository or path content should be copied.

HfStorageSpec

HfStorageSpec configures storage specifically for Hugging Face model caching.

Appears in: - StorageSpec

Field Description Default Validation
mountPath string MountPath specifies the path inside workload containers where the Hugging Face cache PVC will be mounted.
This path is also automatically set as the HF_HOME environment variable in the containers.
/hf_cache
storageSize string StorageSize specifies the requested size for the Hugging Face cache PersistentVolumeClaim (e.g., "50Gi", "200Gi"). If set, a PVC will be created.
preCacheRepos HuggingFaceDownloadItem array PreCacheRepos is a list of Hugging Face repositories to download into the cache volume before the main workload starts.

HuggingFaceDownloadItem

HuggingFaceDownloadItem defines parameters for pre-caching a Hugging Face repository or specific files from it.

Appears in: - DownloadTaskConfig - HfStorageSpec

Field Description Default Validation
repoId string RepoID is the Hugging Face Hub repository ID (e.g., "meta-llama/Llama-2-7b-chat-hf").
files string array Files is an optional list of specific files to download from the repository. If omitted, the entire repository is downloaded.

KaiwoJob

KaiwoJob represents a batch workload managed by Kaiwo. It encapsulates either a standard Kubernetes Job or a RayJob, along with common metadata, storage configurations, and scheduling preferences. The Kaiwo controller reconciles this resource to create and manage the underlying workload objects.

Appears in: - KaiwoJobList

Field Description Default Validation
apiVersion string kaiwo.silogen.ai/v1alpha1
kind string KaiwoJob
metadata ObjectMeta Refer to Kubernetes API documentation for fields of metadata.
spec KaiwoJobSpec Spec defines the desired state of the KaiwoJob, including workload type (Job/RayJob), configuration, resources, and common metadata.
status KaiwoJobStatus Status reflects the most recently observed state of the KaiwoJob, including its phase, start/completion times, and conditions.

KaiwoJobList

KaiwoJobList

Field Description Default Validation
apiVersion string kaiwo.silogen.ai/v1alpha1
kind string KaiwoJobList
metadata ListMeta Refer to Kubernetes API documentation for fields of metadata.
items KaiwoJob array

KaiwoJobSpec

KaiwoJobSpec defines the desired state of KaiwoJob.

Appears in: - KaiwoJob

Field Description Default Validation
user string User specifies the owner or creator of the workload. It should typically be the user's email address. This value is primarily used for labeling (kaiwo.silogen.ai/user) the generated resources (like Pods, Jobs, Deployments) for identification and filtering (e.g., with kaiwo list --user <email>).

In the future, if authentication is enabled, this must be the email address which is checked against authenticated user for match.
podTemplateSpecLabels object (keys:string, values:string) PodTemplateSpecLabels allows you to specify custom labels that will be added to the template.metadata.labels section of the generated Pods (within Jobs, Deployments, or RayCluster specs). Standard Kaiwo system labels (like kaiwo.silogen.ai/user, kaiwo.silogen.ai/name, etc.) are added automatically and take precedence if there are conflicts.
gpus integer Gpus specifies the total number of GPUs allocated to the workload. See here for more details on how this field impacts scheduling. 0
gpuVendor string GpuVendor specifies the GPU vendor (e.g., amd, nvidia, etc.). See here for more details on how this field impacts scheduling. amd
gpuModels string array GpuModels allows you to optionally specify the GPU models that your workload will run on. You can see available models either by using the CLI and running kaiwo status amd/nvidia or by using kubectl command kubectl get nodes -o custom-columns=NAME:.metadata.name,MODEL:.metadata.labels.kaiwo\/gpu-model
This field is used to filter the available nodes for scheduling. You can specify multiple models, and Kaiwo will select the best available node that matches one of the specified models.
version string Version allows you to specify an optional version string for the workload. This can be useful for tracking different iterations or configurations of the same logical workload. It does not directly affect resource creation but serves as metadata.
replicas integer Replicas specifies the number of replicas for the workload. See here for more details on how this field impacts scheduling. 1
gpusPerReplica integer GpusPerReplica specifies the number of GPUs allocated per replica. See here for more details on how this field impacts scheduling.

If you specify gpusPerReplica, you must also specify replicas.
duration Duration Duration specifies the maximum duration over which the workload can run. This is useful for avoiding workloads running indefinitely.
preferredTopologyLabel string PreferredTopologyLabel specifies the preferred topology label for scheduling the workload. This is used to influence how the workload is distributed across nodes in the cluster.
If not specified, Kaiwo will use the default topology labels defined in the default topology of KaiwoQueueConfig starting at the host level.
The levels are evaluated one-by-one going up from the level indicated by the label. If the PodSet cannot fit within a given topology label then the next topology level up is considered.
If the PodSet cannot fit at the highest topology level, then it is distributed among multiple topology domains
requiredTopologyLabel string RequiredTopologyLabel specifies the required topology label for scheduling the workload. This is used to ensure that the workload is scheduled on nodes that match the specified topology label.
resources ResourceRequirements Resources specify the default resource requirements applied for all pods inside the workflow.

This field defines default Kubernetes ResourceRequirements (requests and limits for CPU,
memory, ephemeral-storage) applied to all containers (including init containers) within
the workload's pods.

Behavior:

These values act as defaults. If a container within the underlying Job, Deployment,
or Ray spec (if provided by the user) already defines a specific request or limit
(e.g., memory limit), the value from resources for that specific metric will not override it.

Interaction with GPU fields: The GPU requests/limits (amd.com/gpu or nvidia.com/gpu)
are controlled exclusively by the gpus, gpusPerReplica, and gpuVendor fields
(and the associated calculation logic described above). Any GPU specifications within
the resources field are ignored.

Default CPU/Memory with GPUs: When Kaiwo generates the underlying
Job/Deployment/RayCluster spec (i.e., the user did not provide spec.job,
spec.deployment, or spec.rayService/spec.rayJob), and GPUs are requested
(gpusPerReplica > 0), Kaiwo applies default CPU and Memory requests/limits
based on the GPU count (e.g., 4 CPU cores and 32Gi Memory per GPU).
These GPU-derived defaults will override any CPU/Memory settings defined in
the resources field in this specific scenario. If the user does provide
the underlying spec, these GPU-derived CPU/Memory defaults are not applied,
respecting the user's definition or the values from the resources field.
image string Image specifies the default container image to be used for the primary workload container(s).

- If containers defined within the underlying Job, Deployment, or Ray spec do not specify an image, this image will be used.
- If this field is also empty, the latest tag of ghcr.io/silogen/rocm-ray is used
imagePullSecrets LocalObjectReference array ImagePullSecrets is a list of Kubernetes LocalObjectReference (containing just the secret name) referencing secrets needed to pull the container image(s). These are added to the imagePullSecrets field of the PodSpec for all generated pods.
env EnvVar array Env is a list of Kubernetes EnvVar structs. These environment variables are added to the primary workload container(s) in the generated pods. They are appended to any environment variables already defined in the underlying Job, Deployment, or Ray spec.
secretVolumes SecretVolume array SecretVolumes allows you to mount specific keys from Kubernetes Secrets as files into the workload containers.
ray boolean Ray determines whether the operator should use RayCluster for workload execution.
If true, Kaiwo will create Ray-specific resources.
If false (default), Kaiwo will create standard Kubernetes resources (BatchJob for KaiwoJob, Deployment for KaiwoService).
This setting dictates which underlying spec (job/rayJob or deployment/rayService) is primarily used.
false
storage StorageSpec Storage configures persistent storage using Kubernetes PersistentVolumeClaims (PVCs).

Enabling storage.data.download or storage.huggingFace.preCacheRepos will cause Kaiwo to create a temporary Kubernetes Job (the "download job") before starting the main workload. This job runs a container that performs the downloads into the respective PVCs. The main workload only starts after the download job completes successfully.
dangerous boolean Dangerous, if when set to true, Kaiwo will not add the default PodSecurityContext (which normally sets runAsUser: 1000, runAsGroup: 1000, fsGroup: 1000) to the generated pods. Use this only if you need to run containers as root or a different specific user and understand the security implications. false
clusterQueue string ClusterQueue specifies the name of the Kueue ClusterQueue that the workload should be submitted to for scheduling and resource management.

This value is set as the kueue.x-k8s.io/queue-name label on the underlying resources.

If omitted, it defaults to the value specified by the DEFAULT_CLUSTER_QUEUE_NAME environment variable in the Kaiwo controller (typically "kaiwo"), which is set during installation.

Note! If the applied KaiwoQueueConfig includes no quota for the default queue, no workload will run that tries to fall back on it.

The kaiwo submit CLI command can override this using the --queue flag or the clusterQueue field in the kaiwoconfig.yaml file.
priorityClass string WorkloadPriorityClass specifies the name of Kueue WorkloadPriorityClass to be assigned to the job's pods. This influences the scheduling priority relative to other pods in the cluster.
entrypoint string EntryPoint defines the command or script that the primary container in the job's pod(s) should execute.

It can be a multi-line string. Shell script shebangs (#!/bin/bash) are detected.

For standard Kubernetes Jobs (ray: false), this populates the command and args fields of the container spec (typically ["/bin/sh", "-c", "<entrypoint_script>"]).

For RayJobs (ray: true), this populates the rayJob.spec.entrypoint field. For RayJobs, this must reference a Python script.

This overrides any default command specified in the container image or the underlying job or rayJob spec sections if they are also defined.
rayJob RayJob RayJob defines the RayJob configuration.

If this field is present (or if spec.ray is true), Kaiwo will create a RayJob resource instead of a standard batchv1.Job.

Common fields like image, resources, gpus, replicas, etc., will be merged into this spec, potentially overriding values defined here unless explicitly configured otherwise.

This provides fine-grained control over the Ray cluster configuration (head/worker groups) and Ray job submission parameters.
job Job Job defines the Kubernetes Job configuration.

If this field is present and spec.ray is false, Kaiwo will use this as the base for the created batchv1.Job.

Common fields like image, resources, gpus, entrypoint, etc., will be merged into this spec, potentially overriding values defined here.

This provides fine-grained control over standard Kubernetes Job parameters like backoffLimit, ttlSecondsAfterFinished, pod template details, etc.

KaiwoJobStatus

KaiwoJobStatus defines the observed state of KaiwoJob.

Appears in: - KaiwoJob

Field Description Default Validation
startTime Time StartTime records the timestamp when the first pod associated with the workload started running.
conditions Condition array Conditions lists the observed conditions of the workload resource, following standard Kubernetes conventions. May include conditions reflecting the underlying Deployment or RayService state.
status WorkloadStatus Status reflects the current high-level phase of the workload lifecycle (e.g., PENDING, STARTING, READY, FAILED).
duration integer Duration indicates how long the service has been running since StartTime, in seconds. Calculated periodically while running.
observedGeneration integer ObservedGeneration records the .metadata.generation of the workload resource that was last processed by the controller.
completionTime Time CompletionTime records the timestamp when the KaiwoJob finished execution (either successfully or with failure).

KaiwoQueueConfig

KaiwoQueueConfig manages Kueue resources like ClusterQueues, ResourceFlavors, and WorkloadPriorityClasses based on its spec. It acts as a central configuration point for Kaiwo's integration with Kueue. Typically, only one cluster-scoped resource named 'kaiwo' should exist. The controller ensures that the specified Kueue resources are created, updated, or deleted to match the desired state defined here. KaiwoQueueConfig manages Kueue resources.

Appears in: - KaiwoQueueConfigList

Field Description Default Validation
apiVersion string kaiwo.silogen.ai/v1alpha1
kind string KaiwoQueueConfig
metadata ObjectMeta Refer to Kubernetes API documentation for fields of metadata.
spec KaiwoQueueConfigSpec Spec defines the desired state for Kueue resources managed by Kaiwo.
status KaiwoQueueConfigStatus Status reflects the most recently observed state of the Kueue resource synchronization.

KaiwoQueueConfigList

KaiwoQueueConfigList contains a list of KaiwoQueueConfig resources.

Field Description Default Validation
apiVersion string kaiwo.silogen.ai/v1alpha1
kind string KaiwoQueueConfigList
metadata ListMeta Refer to Kubernetes API documentation for fields of metadata.
items KaiwoQueueConfig array

KaiwoQueueConfigSpec

KaiwoQueueConfigSpec defines the desired configuration for Kaiwo's management of Kueue resources. There should typically be only one KaiwoQueueConfig resource in the cluster, named 'kaiwo'.

Appears in: - KaiwoQueueConfig

Field Description Default Validation
clusterQueues ClusterQueue array ClusterQueues defines a list of Kueue ClusterQueues that Kaiwo should manage. Kaiwo ensures these ClusterQueues exist and match the provided specs. MaxItems: 1000
resourceFlavors ResourceFlavorSpec array ResourceFlavors defines a list of Kueue ResourceFlavors that Kaiwo should manage. Kaiwo ensures these ResourceFlavors exist and match the provided specs. If omitted or empty, Kaiwo attempts to automatically discover node pools and create default flavors based on node labels. MaxItems: 20
workloadPriorityClasses WorkloadPriorityClass array WorkloadPriorityClasses defines a list of Kueue WorkloadPriorityClasses that Kaiwo should manage. Kaiwo ensures these priority classes exist with the specified values. See Kueue documentation for WorkloadPriorityClass. MaxItems: 20
topologies Topology array Topologies defines a list of Kueue Topologies that Kaiwo should manage. Kaiwo ensures these Topologies exist with the specified values. See Kueue documentation for Topology. MaxItems: 10

KaiwoQueueConfigStatus

KaiwoQueueConfigStatus represents the observed state of KaiwoQueueConfig.

Appears in: - KaiwoQueueConfig

Field Description Default Validation
conditions Condition array Conditions lists the observed conditions of the KaiwoQueueConfig resource, such as whether the managed Kueue resources are synchronized and ready.
status QueueConfigStatusDescription Status reflects the overall status of the Kueue resource synchronization managed by this config (e.g., READY, FAILED).

KaiwoService

KaiwoService represents a long-running service workload managed by Kaiwo. It encapsulates either a standard Kubernetes Deployment or a RayService (via an AppWrapper), along with common metadata, storage configurations, and scheduling preferences. The Kaiwo controller reconciles this resource to create and manage the underlying workload objects.

Appears in: - KaiwoServiceList

Field Description Default Validation
apiVersion string kaiwo.silogen.ai/v1alpha1
kind string KaiwoService
metadata ObjectMeta Refer to Kubernetes API documentation for fields of metadata.
spec KaiwoServiceSpec Spec defines the desired state of the KaiwoService, including workload type (Deployment/RayService), configuration, resources, and common metadata.
status KaiwoServiceStatus Status reflects the most recently observed state of the KaiwoService, including its phase, start time, duration, and conditions.

KaiwoServiceList

KaiwoServiceList

Field Description Default Validation
apiVersion string kaiwo.silogen.ai/v1alpha1
kind string KaiwoServiceList
metadata ListMeta Refer to Kubernetes API documentation for fields of metadata.
items KaiwoService array

KaiwoServiceSpec

KaiwoServiceSpec defines the desired state of KaiwoService.

Appears in: - KaiwoService

Field Description Default Validation
user string User specifies the owner or creator of the workload. It should typically be the user's email address. This value is primarily used for labeling (kaiwo.silogen.ai/user) the generated resources (like Pods, Jobs, Deployments) for identification and filtering (e.g., with kaiwo list --user <email>).

In the future, if authentication is enabled, this must be the email address which is checked against authenticated user for match.
podTemplateSpecLabels object (keys:string, values:string) PodTemplateSpecLabels allows you to specify custom labels that will be added to the template.metadata.labels section of the generated Pods (within Jobs, Deployments, or RayCluster specs). Standard Kaiwo system labels (like kaiwo.silogen.ai/user, kaiwo.silogen.ai/name, etc.) are added automatically and take precedence if there are conflicts.
gpus integer Gpus specifies the total number of GPUs allocated to the workload. See here for more details on how this field impacts scheduling. 0
gpuVendor string GpuVendor specifies the GPU vendor (e.g., amd, nvidia, etc.). See here for more details on how this field impacts scheduling. amd
gpuModels string array GpuModels allows you to optionally specify the GPU models that your workload will run on. You can see available models either by using the CLI and running kaiwo status amd/nvidia or by using kubectl command kubectl get nodes -o custom-columns=NAME:.metadata.name,MODEL:.metadata.labels.kaiwo\/gpu-model
This field is used to filter the available nodes for scheduling. You can specify multiple models, and Kaiwo will select the best available node that matches one of the specified models.
version string Version allows you to specify an optional version string for the workload. This can be useful for tracking different iterations or configurations of the same logical workload. It does not directly affect resource creation but serves as metadata.
replicas integer Replicas specifies the number of replicas for the workload. See here for more details on how this field impacts scheduling. 1
gpusPerReplica integer GpusPerReplica specifies the number of GPUs allocated per replica. See here for more details on how this field impacts scheduling.

If you specify gpusPerReplica, you must also specify replicas.
duration Duration Duration specifies the maximum duration over which the workload can run. This is useful for avoiding workloads running indefinitely.
preferredTopologyLabel string PreferredTopologyLabel specifies the preferred topology label for scheduling the workload. This is used to influence how the workload is distributed across nodes in the cluster.
If not specified, Kaiwo will use the default topology labels defined in the default topology of KaiwoQueueConfig starting at the host level.
The levels are evaluated one-by-one going up from the level indicated by the label. If the PodSet cannot fit within a given topology label then the next topology level up is considered.
If the PodSet cannot fit at the highest topology level, then it is distributed among multiple topology domains
requiredTopologyLabel string RequiredTopologyLabel specifies the required topology label for scheduling the workload. This is used to ensure that the workload is scheduled on nodes that match the specified topology label.
resources ResourceRequirements Resources specify the default resource requirements applied for all pods inside the workflow.

This field defines default Kubernetes ResourceRequirements (requests and limits for CPU,
memory, ephemeral-storage) applied to all containers (including init containers) within
the workload's pods.

Behavior:

These values act as defaults. If a container within the underlying Job, Deployment,
or Ray spec (if provided by the user) already defines a specific request or limit
(e.g., memory limit), the value from resources for that specific metric will not override it.

Interaction with GPU fields: The GPU requests/limits (amd.com/gpu or nvidia.com/gpu)
are controlled exclusively by the gpus, gpusPerReplica, and gpuVendor fields
(and the associated calculation logic described above). Any GPU specifications within
the resources field are ignored.

Default CPU/Memory with GPUs: When Kaiwo generates the underlying
Job/Deployment/RayCluster spec (i.e., the user did not provide spec.job,
spec.deployment, or spec.rayService/spec.rayJob), and GPUs are requested
(gpusPerReplica > 0), Kaiwo applies default CPU and Memory requests/limits
based on the GPU count (e.g., 4 CPU cores and 32Gi Memory per GPU).
These GPU-derived defaults will override any CPU/Memory settings defined in
the resources field in this specific scenario. If the user does provide
the underlying spec, these GPU-derived CPU/Memory defaults are not applied,
respecting the user's definition or the values from the resources field.
image string Image specifies the default container image to be used for the primary workload container(s).

- If containers defined within the underlying Job, Deployment, or Ray spec do not specify an image, this image will be used.
- If this field is also empty, the latest tag of ghcr.io/silogen/rocm-ray is used
imagePullSecrets LocalObjectReference array ImagePullSecrets is a list of Kubernetes LocalObjectReference (containing just the secret name) referencing secrets needed to pull the container image(s). These are added to the imagePullSecrets field of the PodSpec for all generated pods.
env EnvVar array Env is a list of Kubernetes EnvVar structs. These environment variables are added to the primary workload container(s) in the generated pods. They are appended to any environment variables already defined in the underlying Job, Deployment, or Ray spec.
secretVolumes SecretVolume array SecretVolumes allows you to mount specific keys from Kubernetes Secrets as files into the workload containers.
ray boolean Ray determines whether the operator should use RayCluster for workload execution.
If true, Kaiwo will create Ray-specific resources.
If false (default), Kaiwo will create standard Kubernetes resources (BatchJob for KaiwoJob, Deployment for KaiwoService).
This setting dictates which underlying spec (job/rayJob or deployment/rayService) is primarily used.
false
storage StorageSpec Storage configures persistent storage using Kubernetes PersistentVolumeClaims (PVCs).

Enabling storage.data.download or storage.huggingFace.preCacheRepos will cause Kaiwo to create a temporary Kubernetes Job (the "download job") before starting the main workload. This job runs a container that performs the downloads into the respective PVCs. The main workload only starts after the download job completes successfully.
dangerous boolean Dangerous, if when set to true, Kaiwo will not add the default PodSecurityContext (which normally sets runAsUser: 1000, runAsGroup: 1000, fsGroup: 1000) to the generated pods. Use this only if you need to run containers as root or a different specific user and understand the security implications. false
clusterQueue string ClusterQueue specifies the name of the Kueue ClusterQueue that the workload should be submitted to for scheduling and resource management.

This value is set as the kueue.x-k8s.io/queue-name label on the underlying resources.

If omitted, it defaults to the value specified by the DEFAULT_CLUSTER_QUEUE_NAME environment variable in the Kaiwo controller (typically "kaiwo"), which is set during installation.

Note! If the applied KaiwoQueueConfig includes no quota for the default queue, no workload will run that tries to fall back on it.

The kaiwo submit CLI command can override this using the --queue flag or the clusterQueue field in the kaiwoconfig.yaml file.
priorityClass string WorkloadPriorityClass specifies the name of Kueue WorkloadPriorityClass to be assigned to the job's pods. This influences the scheduling priority relative to other pods in the cluster.
entrypoint string EntryPoint specifies the command or script executed in a Deployment.
Can also be defined inside Deployment struct as regular command in the form of string array.

It is not used when ray: true (use serveConfigV2 or the rayService spec instead for Ray entrypoints).
serveConfigV2 string Defines the applications and deployments to deploy, should be a YAML multi-line scalar string.
Can also be defined inside RayService struct
rayService RayService RayService allows providing a full rayv1.RayService spec.

If present (or spec.ray is true), Kaiwo creates a RayService (wrapped in an AppWrapper for Kueue integration) instead of a Deployment.

Common fields are merged into the RayClusterSpec within this spec.

Allows fine-grained control over the Ray cluster and Ray Serve configurations.
deployment Deployment Deployment allows providing a full appsv1.Deployment spec.

If present and spec.ray is false, this is used as the base for the created Deployment.

Common fields are merged into this spec.

Allows fine-grained control over Kubernetes Deployment parameters (strategy, selectors, pod template, etc.).

KaiwoServiceStatus

KaiwoServiceStatus defines the observed state of KaiwoService.

Appears in: - KaiwoService

Field Description Default Validation
startTime Time StartTime records the timestamp when the first pod associated with the workload started running.
conditions Condition array Conditions lists the observed conditions of the workload resource, following standard Kubernetes conventions. May include conditions reflecting the underlying Deployment or RayService state.
status WorkloadStatus Status reflects the current high-level phase of the workload lifecycle (e.g., PENDING, STARTING, READY, FAILED).
duration integer Duration indicates how long the service has been running since StartTime, in seconds. Calculated periodically while running.
observedGeneration integer ObservedGeneration records the .metadata.generation of the workload resource that was last processed by the controller.

ObjectStorageDownloadSpec

ObjectStorageDownloadSpec aggregates download tasks for various object storage and Git sources within the DataStorageSpec.

Appears in: - DataStorageSpec

Field Description Default Validation
s3 S3DownloadItem array S3 lists any S3 downloads
gcs GCSDownloadItem array GCS lists and Google Cloud Storage downloads
azureBlob AzureBlobStorageDownloadItem array AzureBlob lists any Azure Blob Storage downloads
git GitDownloadItem array Git lists any Git downloads

QueueConfigStatusDescription

Underlying type: string

Appears in: - KaiwoQueueConfigStatus

Field Description
READY
FAILED

ResourceFlavorSpec

ResourceFlavorSpec defines the configuration for a Kueue ResourceFlavor managed by Kaiwo.

Appears in: - KaiwoQueueConfigSpec

Field Description Default Validation
name string Name specifies the name of the Kueue ResourceFlavor resource (e.g., "amd-mi300-8gpu").
nodeLabels object (keys:string, values:string) NodeLabels specifies the labels that pods requesting this flavor must match on nodes. This is used by Kueue for scheduling decisions. Keys and values should correspond to actual node labels. Example: \{"kaiwo/nodepool": "amd-gpu-nodes"\} MaxProperties: 10
taints Taint array Taints specifies a list of taints associated with this flavor. MaxItems: 5
tolerations Toleration array Tolerations specifies a list of tolerations associated with this flavor. This is less common than using Taints; Kueue primarily uses Taints to derive Tolerations. MaxItems: 5
topologyName string TopologyName specifies the name of the Kueue Topology that this flavor belongs to. If specified, it must match one of the Topologies defined in the KaiwoQueueConfig.
This is used to group flavors by topology for scheduling purposes.

S3DownloadItem

S3DownloadItem defines parameters for downloading data from an S3-compatible object store.

Appears in: - DownloadTaskConfig - ObjectStorageDownloadSpec

Field Description Default Validation
endpointUrl string EndpointUrl specifies the S3 API endpoint URL (e.g., "https://s3.us-east-1.amazonaws.com" or a MinIO endpoint).
accessKeyId ValueReference AccessKeyId optionally references a Kubernetes Secret containing the S3 access key ID. See ValueReference.
secretKey ValueReference SecretKey optionally references a Kubernetes Secret containing the S3 secret access key. See ValueReference.
buckets CloudDownloadBucket array Buckets lists the S3 buckets and the specific files/folders to download from them. See CloudDownloadBucket.

SecretVolume

SecretVolume defines how to mount a specific key from a Kubernetes Secret into the workload's containers.

Appears in: - CommonMetaSpec - KaiwoJobSpec - KaiwoServiceSpec

Field Description Default Validation
name string Name defines the name of the Kubernetes Volume that will be created. Should be unique within the pod.
secretName string SecretName specifies the name of the Kubernetes Secret resource to mount from.
key string Key specifies the key within the Secret whose value should be mounted. If omitted, the entire secret might be mounted as files (depending on Kubernetes behavior).
subPath string SubPath defines the filename within the MountPath directory where the secret Key's content will be placed. Useful for mounting a single secret key as a file.
mountPath string MountPath defines the directory path inside the container where the secret volume (or the SubPath file) should be mounted.

StorageSpec

StorageSpec defines the storage configuration for the workload.

Appears in: - CommonMetaSpec - KaiwoJobSpec - KaiwoServiceSpec

Field Description Default Validation
storageEnabled boolean StorageEnabled must be true to enable the creation of any PersistentVolumeClaims defined within this spec. If false, data and huggingFace sections are ignored.
storageClassName string StorageClassName specifies the name of the Kubernetes StorageClass to use when creating PersistentVolumeClaims for data and huggingFace volumes. Must refer to an existing StorageClass in the cluster.
accessMode PersistentVolumeAccessMode AccessMode determines the access mode (e.g., ReadWriteOnce, ReadWriteMany, ReadOnlyMany) for the created PersistentVolumeClaims.

In a multi-node setting, ReadWriteMany is generally required, as pods scheduled on different nodes cannot access ReadWriteOnce PVCs. This is true even when replicas: 1 if you are using download jobs, as the download pod may get scheduled on a different pod than the main workload pod.
ReadWriteMany
data DataStorageSpec Data configures the main data PersistentVolumeClaim and optional pre-download tasks for it.
huggingFace HfStorageSpec HuggingFace configures a PersistentVolumeClaim specifically for caching Hugging Face models and datasets, with options for pre-caching.

Topology

Topology is the Schema for the topology API

Appears in: - KaiwoQueueConfigSpec

Field Description Default Validation
metadata ObjectMeta Refer to Kubernetes API documentation for fields of metadata.
spec TopologySpec Required: {}

TopologySpec

Appears in: - Topology

Field Description Default Validation
levels TopologyLevel array levels define the levels of topology. MaxItems: 8
MinItems: 1

ValueReference

ValueReference provides a way to reference sensitive values stored in Kubernetes Secrets, typically used for credentials needed by download tasks.

Appears in: - AzureBlobStorageDownloadItem - GCSDownloadItem - GitDownloadItem - S3DownloadItem

Field Description Default Validation
file string File specifies the expected path within the download job's container where the secret value will be mounted as a file. This path is usually automatically generated by the controller based on SecretName and SecretKey.
secretName string SecretName is the name of the Kubernetes Secret resource containing the value.
secretKey string SecretKey is the key within the specified Secret whose value should be used.

WorkloadStatus

Underlying type: string

Appears in: - CommonStatusSpec - KaiwoJobStatus - KaiwoServiceStatus

Field Description
`` WorkloadStatusNew indicates the resource has been created but not yet processed by the controller.
DOWNLOADING WorkloadStatusDownloading indicates that the resource is currently running the download job
PENDING WorkloadStatusPending indicates the resource is waiting for prerequisites (like Kueue admission) to complete.
STARTING WorkloadStatusStarting indicates the Kaiwo workload has been admitted, and the underlying workload (Job, Deployment, RayService) is being created or started.
RUNNING WorkloadStatusRunning indicates the workload pods are running. For KaiwoJob, this means the job has started execution. For KaiwoService, pods are up but may not yet be fully ready/healthy.
COMPLETE WorkloadStatusComplete indicates a KaiwoJob has finished successfully.
ERROR WorkloadStatusError indicates the workload encountered an error which can be recovered from.
FAILED WorkloadStatusFailed indicates the workload (KaiwoJob or KaiwoService) encountered an error and cannot proceed or recover.
TERMINATING WorkloadStatusTerminating indicates that the workload should begin to terminate the underlying resources.
TERMINATED WorkloadStatusTerminated indicates the workload has been terminated by the user or system. This could be due to duration deadline being met and pressure for GPU demand.