API Reference

Packages

kaiwo.silogen.ai/v1alpha1

kaiwo.silogen.ai/v1alpha1

Package v1alpha1 contains API Schema definitions for the kaiwo v1alpha1 API group.

Resource Types

KaiwoJob
KaiwoJobList
KaiwoQueueConfig
KaiwoQueueConfigList
KaiwoService
KaiwoServiceList

AzureBlobStorageDownloadItem

AzureBlobStorageDownloadItem defines parameters for downloading data from Azure Blob Storage.

Appears in: - DownloadTaskConfig - ObjectStorageDownloadSpec

Field	Description	Default	Validation
`connectionString` ValueReference	ConnectionString references a Kubernetes Secret containing the Azure Storage connection string. See `ValueReference`.
`containers` CloudDownloadBucket array	Containers lists the Azure Blob Storage containers and the specific files/folders to download from them. See `CloudDownloadBucket`.

CloudDownloadBucket

CloudDownloadBucket represents a specific bucket (S3, GCS) or container (Azure) to download from.

Appears in: - AzureBlobStorageDownloadItem - GCSDownloadItem - S3DownloadItem

Field	Description	Default	Validation
`name` string	Name is the name of the bucket or container.
`files` CloudDownloadFile array	Files lists specific files to download from this bucket/container.
`folders` CloudDownloadFolder array	Folders lists specific folders (prefixes) to download from this bucket/container.

ClusterQueue

ClusterQueue defines the configuration for a Kueue ClusterQueue managed by Kaiwo.

Appears in: - KaiwoQueueConfigSpec

Field	Description	Default	Validation
`name` string	Name specifies the name of the Kueue ClusterQueue resource.
`spec` ClusterQueueSpec	Spec contains the desired Kueue `ClusterQueueSpec`. Kaiwo ensures the corresponding ClusterQueue resource matches this spec. See Kueue documentation for `ClusterQueueSpec` fields like `resourceGroups`, `cohort`, `preemption`, etc.
`namespaces` string array	Namespaces optionally lists Kubernetes namespaces where Kaiwo should automatically create a Kueue `LocalQueue` resource pointing to this ClusterQueue. If one or more namespaces are provided, the KaiwoQueueConfig controller takes over managing the LocalQueues for this ClusterQueue. Leave this empty if you want to be able to create your own LocalQueues for this ClusterQueue.

ClusterQueueSpec

Appears in: - ClusterQueue

Field	Description	Default	Validation
`resourceGroups` ResourceGroup array	resourceGroups describes groups of resources. Each resource group defines the list of resources and a list of flavors that provide quotas for these resources. Each resource and each flavor can only form part of one resource group. resourceGroups can be up to 16.		MaxItems: 16
`cohort` CohortReference	cohort that this ClusterQueue belongs to. CQs that belong to the same cohort can borrow unused resources from each other. A CQ can be a member of a single borrowing cohort. A workload submitted to a queue referencing this CQ can borrow quota from any CQ in the cohort. Only quota for the [resource, flavor] pairs listed in the CQ can be borrowed. If empty, this ClusterQueue cannot borrow from any other ClusterQueue and vice versa. A cohort is a name that links CQs together, but it doesn't reference any object.
`queueingStrategy` QueueingStrategy	QueueingStrategy indicates the queueing strategy of the workloads across the queues in this ClusterQueue. Current Supported Strategies: - StrictFIFO: workloads are ordered strictly by creation time. Older workloads that can't be admitted will block admitting newer workloads even if they fit available quota. - BestEffortFIFO: workloads are ordered by creation time, however older workloads that can't be admitted will not block admitting newer workloads that fit existing quota.	BestEffortFIFO	Enum: [StrictFIFO BestEffortFIFO]
`namespaceSelector` LabelSelector	namespaceSelector defines which namespaces are allowed to submit workloads to this clusterQueue. Beyond this basic support for policy, a policy agent like Gatekeeper should be used to enforce more advanced policies. Defaults to null which is a nothing selector (no namespaces eligible). If set to an empty selector `\{\}`, then all namespaces are eligible.
`flavorFungibility` FlavorFungibility	flavorFungibility defines whether a workload should try the next flavor before borrowing or preempting in the flavor being evaluated.	{ }
`preemption` ClusterQueuePreemption		{ }
`admissionChecks` AdmissionCheckReference array	admissionChecks lists the AdmissionChecks required by this ClusterQueue. Cannot be used along with AdmissionCheckStrategy.
`admissionChecksStrategy` AdmissionChecksStrategy	admissionCheckStrategy defines a list of strategies to determine which ResourceFlavors require AdmissionChecks. This property cannot be used in conjunction with the 'admissionChecks' property.
`stopPolicy` StopPolicy	stopPolicy - if set to a value different from None, the ClusterQueue is considered Inactive, no new reservation being made. Depending on its value, its associated workloads will: - None - Workloads are admitted - HoldAndDrain - Admitted workloads are evicted and Reserving workloads will cancel the reservation. - Hold - Admitted workloads will run to completion and Reserving workloads will cancel the reservation.	None	Enum: [None Hold HoldAndDrain]
`fairSharing` FairSharing	fairSharing defines the properties of the ClusterQueue when participating in FairSharing. The values are only relevant if FairSharing is enabled in the Kueue configuration.

CommonMetaSpec

CommonMetaSpec defines reusable metadata fields for workloads.

Appears in: - KaiwoJobSpec - KaiwoServiceSpec

Field	Description	Default
`user` string	User specifies the owner or creator of the workload. It should typically be the user's email address. This value is primarily used for labeling (`kaiwo.silogen.ai/user`) the generated resources (like Pods, Jobs, Deployments) for identification and filtering (e.g., with `kaiwo list --user <email>`). In the future, if authentication is enabled, this must be the email address which is checked against authenticated user for match.
`podTemplateSpecLabels` object (keys:string, values:string)	PodTemplateSpecLabels allows you to specify custom labels that will be added to the `template.metadata.labels` section of the generated Pods (within Jobs, Deployments, or RayCluster specs). Standard Kaiwo system labels (like `kaiwo.silogen.ai/user`, `kaiwo.silogen.ai/name`, etc.) are added automatically and take precedence if there are conflicts.
`gpus` integer	Gpus specifies the total number of GPUs allocated to the workload. See here for more details on how this field impacts scheduling.	0
`gpuVendor` string	GpuVendor specifies the GPU vendor (e.g., amd, nvidia, etc.). See here for more details on how this field impacts scheduling.	amd
`gpuModels` string array	GpuModels allows you to optionally specify the GPU models that your workload will run on. You can see available models either by using the CLI and running `kaiwo status amd/nvidia` or by using kubectl command `kubectl get nodes -o custom-columns=NAME:.metadata.name,MODEL:.metadata.labels.kaiwo\/gpu-model` This field is used to filter the available nodes for scheduling. You can specify multiple models, and Kaiwo will select the best available node that matches one of the specified models.
`version` string	Version allows you to specify an optional version string for the workload. This can be useful for tracking different iterations or configurations of the same logical workload. It does not directly affect resource creation but serves as metadata.
`replicas` integer	Replicas specifies the number of replicas for the workload. See here for more details on how this field impacts scheduling.	1
`gpusPerReplica` integer	GpusPerReplica specifies the number of GPUs allocated per replica. See here for more details on how this field impacts scheduling. If you specify `gpusPerReplica`, you must also specify `replicas`.
`duration` Duration	Duration specifies the maximum duration over which the workload can run. This is useful for avoiding workloads running indefinitely.
`preferredTopologyLabel` string	PreferredTopologyLabel specifies the preferred topology label for scheduling the workload. This is used to influence how the workload is distributed across nodes in the cluster. If not specified, Kaiwo will use the default topology labels defined in the default topology of KaiwoQueueConfig starting at the host level. The levels are evaluated one-by-one going up from the level indicated by the label. If the PodSet cannot fit within a given topology label then the next topology level up is considered. If the PodSet cannot fit at the highest topology level, then it is distributed among multiple topology domains
`requiredTopologyLabel` string	RequiredTopologyLabel specifies the required topology label for scheduling the workload. This is used to ensure that the workload is scheduled on nodes that match the specified topology label.
`resources` ResourceRequirements	Resources specify the default resource requirements applied for all pods inside the workflow. This field defines default Kubernetes `ResourceRequirements` (requests and limits for CPU, memory, ephemeral-storage) applied to all containers (including init containers) within the workload's pods. Behavior: These values act as defaults. If a container within the underlying Job, Deployment, or Ray spec (if provided by the user) already defines a specific request or limit (e.g., `memory` limit), the value from `resources` for that specific metric will not override it. Interaction with GPU fields: The GPU requests/limits (`amd.com/gpu` or `nvidia.com/gpu`) are controlled exclusively by the `gpus`, `gpusPerReplica`, and `gpuVendor` fields (and the associated calculation logic described above). Any GPU specifications within the `resources` field are ignored. Default CPU/Memory with GPUs: When Kaiwo generates the underlying Job/Deployment/RayCluster spec (i.e., the user did not provide `spec.job`, `spec.deployment`, or `spec.rayService`/`spec.rayJob`), and GPUs are requested (`gpusPerReplica` > 0), Kaiwo applies default CPU and Memory requests/limits based on the GPU count (e.g., 4 CPU cores and 32Gi Memory per GPU). These GPU-derived defaults will override any CPU/Memory settings defined in the `resources` field in this specific scenario. If the user does provide the underlying spec, these GPU-derived CPU/Memory defaults are not applied, respecting the user's definition or the values from the `resources` field.
`image` string	Image specifies the default container image to be used for the primary workload container(s). - If containers defined within the underlying Job, Deployment, or Ray spec do not specify an image, this image will be used. - If this field is also empty, the latest tag of ghcr.io/silogen/rocm-ray is used
`imagePullSecrets` LocalObjectReference array	ImagePullSecrets is a list of Kubernetes `LocalObjectReference` (containing just the secret `name`) referencing secrets needed to pull the container image(s). These are added to the `imagePullSecrets` field of the PodSpec for all generated pods.
`env` EnvVar array	Env is a list of Kubernetes `EnvVar` structs. These environment variables are added to the primary workload container(s) in the generated pods. They are appended to any environment variables already defined in the underlying Job, Deployment, or Ray spec.
`secretVolumes` SecretVolume array	SecretVolumes allows you to mount specific keys from Kubernetes Secrets as files into the workload containers.
`ray` boolean	Ray determines whether the operator should use RayCluster for workload execution. If `true`, Kaiwo will create Ray-specific resources. If `false` (default), Kaiwo will create standard Kubernetes resources (BatchJob for `KaiwoJob`, Deployment for `KaiwoService`). This setting dictates which underlying spec (`job`/`rayJob` or `deployment`/`rayService`) is primarily used.	false
`storage` StorageSpec	Storage configures persistent storage using Kubernetes PersistentVolumeClaims (PVCs). Enabling `storage.data.download` or `storage.huggingFace.preCacheRepos` will cause Kaiwo to create a temporary Kubernetes Job (the "download job") before starting the main workload. This job runs a container that performs the downloads into the respective PVCs. The main workload only starts after the download job completes successfully.
`dangerous` boolean	Dangerous, if when set to `true`, Kaiwo will not add the default `PodSecurityContext` (which normally sets `runAsUser: 1000`, `runAsGroup: 1000`, `fsGroup: 1000`) to the generated pods. Use this only if you need to run containers as root or a different specific user and understand the security implications.	false
`clusterQueue` string	ClusterQueue specifies the name of the Kueue `ClusterQueue` that the workload should be submitted to for scheduling and resource management. This value is set as the `kueue.x-k8s.io/queue-name` label on the underlying resources. If omitted, it defaults to the value specified by the `DEFAULT_CLUSTER_QUEUE_NAME` environment variable in the Kaiwo controller (typically "kaiwo"), which is set during installation. Note! If the applied KaiwoQueueConfig includes no quota for the default queue, no workload will run that tries to fall back on it. The `kaiwo submit` CLI command can override this using the `--queue` flag or the `clusterQueue` field in the `kaiwoconfig.yaml` file.
`priorityClass` string	WorkloadPriorityClass specifies the name of Kueue `WorkloadPriorityClass` to be assigned to the job's pods. This influences the scheduling priority relative to other pods in the cluster.

CommonStatusSpec

Appears in: - KaiwoJobStatus - KaiwoServiceStatus

Field	Description	Default	Validation
`startTime` Time	StartTime records the timestamp when the first pod associated with the workload started running.
`conditions` Condition array	Conditions lists the observed conditions of the workload resource, following standard Kubernetes conventions. May include conditions reflecting the underlying Deployment or RayService state.
`status` WorkloadStatus	Status reflects the current high-level phase of the workload lifecycle (e.g., PENDING, STARTING, READY, FAILED).
`duration` integer	Duration indicates how long the service has been running since StartTime, in seconds. Calculated periodically while running.
`observedGeneration` integer	ObservedGeneration records the `.metadata.generation` of the workload resource that was last processed by the controller.

DataStorageSpec

DataStorageSpec configures the primary data volume for the workload.

Appears in: - StorageSpec

Field	Description	Default
`mountPath` string	MountPath specifies the path inside the workload containers where the data PersistentVolumeClaim will be mounted.	/workload
`storageSize` string	StorageSize specifies the requested size for the data PersistentVolumeClaim (e.g., "100Gi", "1Ti"). If set, a PVC will be created.
`download` ObjectStorageDownloadSpec	Download configures optional tasks to download data from various sources into the data volume before the main workload starts. See `ObjectStorageDownloadSpec`.

GCSDownloadItem

GCSDownloadItem defines parameters for downloading data from Google Cloud Storage.

Appears in: - DownloadTaskConfig - ObjectStorageDownloadSpec

Field	Description	Default	Validation
`applicationCredentials` ValueReference	ApplicationCredentials references a Kubernetes Secret containing the GCS service account key JSON file content. See `ValueReference`.
`buckets` CloudDownloadBucket array	Buckets lists the GCS buckets and the specific files/folders to download from them. See `CloudDownloadBucket`.

GitDownloadItem

GitDownloadItem defines parameters for cloning a Git repository or parts of it.

Appears in: - DownloadTaskConfig - ObjectStorageDownloadSpec

Field	Description	Default	Validation
`repository` string	Repository specifies the Git repository URL (e.g., "https://github.com/user/repo.git").
`branch` string	Branch specifies the branch to clone. This takes precedence over `commit`.
`commit` string	Commit specifies the exact commit hash to check out. This is ignored if `commit` is specified.
`username` ValueReference	Username optionally references a Secret containing the Git username for authentication. See `ValueReference`.
`token` ValueReference	Token optionally references a Secret containing the Git token (or password) for authentication. See `ValueReference`.
`path` string	Path specifies a sub-path within the repository to copy. If omitted, the entire repository is copied.
`targetPath` string	TargetPath specifies the destination path relative to the data volume's mount point (`DataStorageSpec.MountPath`) where the repository or `path` content should be copied.

HfStorageSpec

HfStorageSpec configures storage specifically for Hugging Face model caching.

Appears in: - StorageSpec

Field	Description	Default
`mountPath` string	MountPath specifies the path inside workload containers where the Hugging Face cache PVC will be mounted. This path is also automatically set as the `HF_HOME` environment variable in the containers.	/hf_cache
`storageSize` string	StorageSize specifies the requested size for the Hugging Face cache PersistentVolumeClaim (e.g., "50Gi", "200Gi"). If set, a PVC will be created.
`preCacheRepos` HuggingFaceDownloadItem array	PreCacheRepos is a list of Hugging Face repositories to download into the cache volume before the main workload starts.

HuggingFaceDownloadItem

HuggingFaceDownloadItem defines parameters for pre-caching a Hugging Face repository or specific files from it.

Appears in: - DownloadTaskConfig - HfStorageSpec

Field	Description	Default	Validation
`repoId` string	RepoID is the Hugging Face Hub repository ID (e.g., "meta-llama/Llama-2-7b-chat-hf").
`files` string array	Files is an optional list of specific files to download from the repository. If omitted, the entire repository is downloaded.

KaiwoJob

KaiwoJob represents a batch workload managed by Kaiwo. It encapsulates either a standard Kubernetes Job or a RayJob, along with common metadata, storage configurations, and scheduling preferences. The Kaiwo controller reconciles this resource to create and manage the underlying workload objects.

Appears in: - KaiwoJobList

Field	Description	Default	Validation
`apiVersion` string	`kaiwo.silogen.ai/v1alpha1`
`kind` string	`KaiwoJob`
`metadata` ObjectMeta	Refer to Kubernetes API documentation for fields of `metadata`.
`spec` KaiwoJobSpec	Spec defines the desired state of the KaiwoJob, including workload type (Job/RayJob), configuration, resources, and common metadata.
`status` KaiwoJobStatus	Status reflects the most recently observed state of the KaiwoJob, including its phase, start/completion times, and conditions.

KaiwoJobList

Field	Description	Default	Validation
`apiVersion` string	`kaiwo.silogen.ai/v1alpha1`
`kind` string	`KaiwoJobList`
`metadata` ListMeta	Refer to Kubernetes API documentation for fields of `metadata`.
`items` KaiwoJob array

KaiwoJobSpec

KaiwoJobSpec defines the desired state of KaiwoJob.

Appears in: - KaiwoJob

Field	Description	Default
`user` string	User specifies the owner or creator of the workload. It should typically be the user's email address. This value is primarily used for labeling (`kaiwo.silogen.ai/user`) the generated resources (like Pods, Jobs, Deployments) for identification and filtering (e.g., with `kaiwo list --user <email>`). In the future, if authentication is enabled, this must be the email address which is checked against authenticated user for match.
`podTemplateSpecLabels` object (keys:string, values:string)	PodTemplateSpecLabels allows you to specify custom labels that will be added to the `template.metadata.labels` section of the generated Pods (within Jobs, Deployments, or RayCluster specs). Standard Kaiwo system labels (like `kaiwo.silogen.ai/user`, `kaiwo.silogen.ai/name`, etc.) are added automatically and take precedence if there are conflicts.
`gpus` integer	Gpus specifies the total number of GPUs allocated to the workload. See here for more details on how this field impacts scheduling.	0
`gpuVendor` string	GpuVendor specifies the GPU vendor (e.g., amd, nvidia, etc.). See here for more details on how this field impacts scheduling.	amd
`gpuModels` string array	GpuModels allows you to optionally specify the GPU models that your workload will run on. You can see available models either by using the CLI and running `kaiwo status amd/nvidia` or by using kubectl command `kubectl get nodes -o custom-columns=NAME:.metadata.name,MODEL:.metadata.labels.kaiwo\/gpu-model` This field is used to filter the available nodes for scheduling. You can specify multiple models, and Kaiwo will select the best available node that matches one of the specified models.
`version` string	Version allows you to specify an optional version string for the workload. This can be useful for tracking different iterations or configurations of the same logical workload. It does not directly affect resource creation but serves as metadata.
`replicas` integer	Replicas specifies the number of replicas for the workload. See here for more details on how this field impacts scheduling.	1
`gpusPerReplica` integer	GpusPerReplica specifies the number of GPUs allocated per replica. See here for more details on how this field impacts scheduling. If you specify `gpusPerReplica`, you must also specify `replicas`.
`duration` Duration	Duration specifies the maximum duration over which the workload can run. This is useful for avoiding workloads running indefinitely.
`preferredTopologyLabel` string	PreferredTopologyLabel specifies the preferred topology label for scheduling the workload. This is used to influence how the workload is distributed across nodes in the cluster. If not specified, Kaiwo will use the default topology labels defined in the default topology of KaiwoQueueConfig starting at the host level. The levels are evaluated one-by-one going up from the level indicated by the label. If the PodSet cannot fit within a given topology label then the next topology level up is considered. If the PodSet cannot fit at the highest topology level, then it is distributed among multiple topology domains
`requiredTopologyLabel` string	RequiredTopologyLabel specifies the required topology label for scheduling the workload. This is used to ensure that the workload is scheduled on nodes that match the specified topology label.
`resources` ResourceRequirements	Resources specify the default resource requirements applied for all pods inside the workflow. This field defines default Kubernetes `ResourceRequirements` (requests and limits for CPU, memory, ephemeral-storage) applied to all containers (including init containers) within the workload's pods. Behavior: These values act as defaults. If a container within the underlying Job, Deployment, or Ray spec (if provided by the user) already defines a specific request or limit (e.g., `memory` limit), the value from `resources` for that specific metric will not override it. Interaction with GPU fields: The GPU requests/limits (`amd.com/gpu` or `nvidia.com/gpu`) are controlled exclusively by the `gpus`, `gpusPerReplica`, and `gpuVendor` fields (and the associated calculation logic described above). Any GPU specifications within the `resources` field are ignored. Default CPU/Memory with GPUs: When Kaiwo generates the underlying Job/Deployment/RayCluster spec (i.e., the user did not provide `spec.job`, `spec.deployment`, or `spec.rayService`/`spec.rayJob`), and GPUs are requested (`gpusPerReplica` > 0), Kaiwo applies default CPU and Memory requests/limits based on the GPU count (e.g., 4 CPU cores and 32Gi Memory per GPU). These GPU-derived defaults will override any CPU/Memory settings defined in the `resources` field in this specific scenario. If the user does provide the underlying spec, these GPU-derived CPU/Memory defaults are not applied, respecting the user's definition or the values from the `resources` field.
`image` string	Image specifies the default container image to be used for the primary workload container(s). - If containers defined within the underlying Job, Deployment, or Ray spec do not specify an image, this image will be used. - If this field is also empty, the latest tag of ghcr.io/silogen/rocm-ray is used
`imagePullSecrets` LocalObjectReference array	ImagePullSecrets is a list of Kubernetes `LocalObjectReference` (containing just the secret `name`) referencing secrets needed to pull the container image(s). These are added to the `imagePullSecrets` field of the PodSpec for all generated pods.
`env` EnvVar array	Env is a list of Kubernetes `EnvVar` structs. These environment variables are added to the primary workload container(s) in the generated pods. They are appended to any environment variables already defined in the underlying Job, Deployment, or Ray spec.
`secretVolumes` SecretVolume array	SecretVolumes allows you to mount specific keys from Kubernetes Secrets as files into the workload containers.
`ray` boolean	Ray determines whether the operator should use RayCluster for workload execution. If `true`, Kaiwo will create Ray-specific resources. If `false` (default), Kaiwo will create standard Kubernetes resources (BatchJob for `KaiwoJob`, Deployment for `KaiwoService`). This setting dictates which underlying spec (`job`/`rayJob` or `deployment`/`rayService`) is primarily used.	false
`storage` StorageSpec	Storage configures persistent storage using Kubernetes PersistentVolumeClaims (PVCs). Enabling `storage.data.download` or `storage.huggingFace.preCacheRepos` will cause Kaiwo to create a temporary Kubernetes Job (the "download job") before starting the main workload. This job runs a container that performs the downloads into the respective PVCs. The main workload only starts after the download job completes successfully.
`dangerous` boolean	Dangerous, if when set to `true`, Kaiwo will not add the default `PodSecurityContext` (which normally sets `runAsUser: 1000`, `runAsGroup: 1000`, `fsGroup: 1000`) to the generated pods. Use this only if you need to run containers as root or a different specific user and understand the security implications.	false
`clusterQueue` string	ClusterQueue specifies the name of the Kueue `ClusterQueue` that the workload should be submitted to for scheduling and resource management. This value is set as the `kueue.x-k8s.io/queue-name` label on the underlying resources. If omitted, it defaults to the value specified by the `DEFAULT_CLUSTER_QUEUE_NAME` environment variable in the Kaiwo controller (typically "kaiwo"), which is set during installation. Note! If the applied KaiwoQueueConfig includes no quota for the default queue, no workload will run that tries to fall back on it. The `kaiwo submit` CLI command can override this using the `--queue` flag or the `clusterQueue` field in the `kaiwoconfig.yaml` file.
`priorityClass` string	WorkloadPriorityClass specifies the name of Kueue `WorkloadPriorityClass` to be assigned to the job's pods. This influences the scheduling priority relative to other pods in the cluster.
`entrypoint` string	EntryPoint defines the command or script that the primary container in the job's pod(s) should execute. It can be a multi-line string. Shell script shebangs (`#!/bin/bash`) are detected. For standard Kubernetes Jobs (`ray: false`), this populates the `command` and `args` fields of the container spec (typically `["/bin/sh", "-c", "<entrypoint_script>"]`). For RayJobs (`ray: true`), this populates the `rayJob.spec.entrypoint` field. For RayJobs, this must reference a Python script. This overrides any default command specified in the container image or the underlying `job` or `rayJob` spec sections if they are also defined.
`rayJob` RayJob	RayJob defines the RayJob configuration. If this field is present (or if `spec.ray` is `true`), Kaiwo will create a `RayJob` resource instead of a standard `batchv1.Job`. Common fields like `image`, `resources`, `gpus`, `replicas`, etc., will be merged into this spec, potentially overriding values defined here unless explicitly configured otherwise. This provides fine-grained control over the Ray cluster configuration (head/worker groups) and Ray job submission parameters.
`job` Job	Job defines the Kubernetes Job configuration. If this field is present and `spec.ray` is `false`, Kaiwo will use this as the base for the created `batchv1.Job`. Common fields like `image`, `resources`, `gpus`, `entrypoint`, etc., will be merged into this spec, potentially overriding values defined here. This provides fine-grained control over standard Kubernetes Job parameters like `backoffLimit`, `ttlSecondsAfterFinished`, pod template details, etc.

KaiwoJobStatus

KaiwoJobStatus defines the observed state of KaiwoJob.

Appears in: - KaiwoJob

Field	Description	Default	Validation
`startTime` Time	StartTime records the timestamp when the first pod associated with the workload started running.
`conditions` Condition array	Conditions lists the observed conditions of the workload resource, following standard Kubernetes conventions. May include conditions reflecting the underlying Deployment or RayService state.
`status` WorkloadStatus	Status reflects the current high-level phase of the workload lifecycle (e.g., PENDING, STARTING, READY, FAILED).
`duration` integer	Duration indicates how long the service has been running since StartTime, in seconds. Calculated periodically while running.
`observedGeneration` integer	ObservedGeneration records the `.metadata.generation` of the workload resource that was last processed by the controller.
`completionTime` Time	CompletionTime records the timestamp when the KaiwoJob finished execution (either successfully or with failure).

KaiwoQueueConfig

KaiwoQueueConfig manages Kueue resources like ClusterQueues, ResourceFlavors, and WorkloadPriorityClasses based on its spec. It acts as a central configuration point for Kaiwo's integration with Kueue. Typically, only one cluster-scoped resource named 'kaiwo' should exist. The controller ensures that the specified Kueue resources are created, updated, or deleted to match the desired state defined here. KaiwoQueueConfig manages Kueue resources.

Appears in: - KaiwoQueueConfigList

Field	Description	Default	Validation
`apiVersion` string	`kaiwo.silogen.ai/v1alpha1`
`kind` string	`KaiwoQueueConfig`
`metadata` ObjectMeta	Refer to Kubernetes API documentation for fields of `metadata`.
`spec` KaiwoQueueConfigSpec	Spec defines the desired state for Kueue resources managed by Kaiwo.
`status` KaiwoQueueConfigStatus	Status reflects the most recently observed state of the Kueue resource synchronization.

KaiwoQueueConfigList

KaiwoQueueConfigList contains a list of KaiwoQueueConfig resources.

Field	Description	Default	Validation
`apiVersion` string	`kaiwo.silogen.ai/v1alpha1`
`kind` string	`KaiwoQueueConfigList`
`metadata` ListMeta	Refer to Kubernetes API documentation for fields of `metadata`.
`items` KaiwoQueueConfig array

KaiwoQueueConfigSpec

KaiwoQueueConfigSpec defines the desired configuration for Kaiwo's management of Kueue resources. There should typically be only one KaiwoQueueConfig resource in the cluster, named 'kaiwo'.

Appears in: - KaiwoQueueConfig

Field	Description	Validation
`clusterQueues` ClusterQueue array	ClusterQueues defines a list of Kueue ClusterQueues that Kaiwo should manage. Kaiwo ensures these ClusterQueues exist and match the provided specs.	MaxItems: 1000
`resourceFlavors` ResourceFlavorSpec array	ResourceFlavors defines a list of Kueue ResourceFlavors that Kaiwo should manage. Kaiwo ensures these ResourceFlavors exist and match the provided specs. If omitted or empty, Kaiwo attempts to automatically discover node pools and create default flavors based on node labels.	MaxItems: 20
`workloadPriorityClasses` WorkloadPriorityClass array	WorkloadPriorityClasses defines a list of Kueue WorkloadPriorityClasses that Kaiwo should manage. Kaiwo ensures these priority classes exist with the specified values. See Kueue documentation for `WorkloadPriorityClass`.	MaxItems: 20
`topologies` Topology array	Topologies defines a list of Kueue Topologies that Kaiwo should manage. Kaiwo ensures these Topologies exist with the specified values. See Kueue documentation for `Topology`.	MaxItems: 10

KaiwoQueueConfigStatus

KaiwoQueueConfigStatus represents the observed state of KaiwoQueueConfig.

Appears in: - KaiwoQueueConfig

Field	Description	Default	Validation
`conditions` Condition array	Conditions lists the observed conditions of the KaiwoQueueConfig resource, such as whether the managed Kueue resources are synchronized and ready.
`status` QueueConfigStatusDescription	Status reflects the overall status of the Kueue resource synchronization managed by this config (e.g., READY, FAILED).

KaiwoService

KaiwoService represents a long-running service workload managed by Kaiwo. It encapsulates either a standard Kubernetes Deployment or a RayService (via an AppWrapper), along with common metadata, storage configurations, and scheduling preferences. The Kaiwo controller reconciles this resource to create and manage the underlying workload objects.

Appears in: - KaiwoServiceList

Field	Description	Default	Validation
`apiVersion` string	`kaiwo.silogen.ai/v1alpha1`
`kind` string	`KaiwoService`
`metadata` ObjectMeta	Refer to Kubernetes API documentation for fields of `metadata`.
`spec` KaiwoServiceSpec	Spec defines the desired state of the KaiwoService, including workload type (Deployment/RayService), configuration, resources, and common metadata.
`status` KaiwoServiceStatus	Status reflects the most recently observed state of the KaiwoService, including its phase, start time, duration, and conditions.

KaiwoServiceList

Field	Description	Default	Validation
`apiVersion` string	`kaiwo.silogen.ai/v1alpha1`
`kind` string	`KaiwoServiceList`
`metadata` ListMeta	Refer to Kubernetes API documentation for fields of `metadata`.
`items` KaiwoService array

KaiwoServiceSpec

KaiwoServiceSpec defines the desired state of KaiwoService.

Appears in: - KaiwoService

Field	Description	Default
`user` string	User specifies the owner or creator of the workload. It should typically be the user's email address. This value is primarily used for labeling (`kaiwo.silogen.ai/user`) the generated resources (like Pods, Jobs, Deployments) for identification and filtering (e.g., with `kaiwo list --user <email>`). In the future, if authentication is enabled, this must be the email address which is checked against authenticated user for match.
`podTemplateSpecLabels` object (keys:string, values:string)	PodTemplateSpecLabels allows you to specify custom labels that will be added to the `template.metadata.labels` section of the generated Pods (within Jobs, Deployments, or RayCluster specs). Standard Kaiwo system labels (like `kaiwo.silogen.ai/user`, `kaiwo.silogen.ai/name`, etc.) are added automatically and take precedence if there are conflicts.
`gpus` integer	Gpus specifies the total number of GPUs allocated to the workload. See here for more details on how this field impacts scheduling.	0
`gpuVendor` string	GpuVendor specifies the GPU vendor (e.g., amd, nvidia, etc.). See here for more details on how this field impacts scheduling.	amd
`gpuModels` string array	GpuModels allows you to optionally specify the GPU models that your workload will run on. You can see available models either by using the CLI and running `kaiwo status amd/nvidia` or by using kubectl command `kubectl get nodes -o custom-columns=NAME:.metadata.name,MODEL:.metadata.labels.kaiwo\/gpu-model` This field is used to filter the available nodes for scheduling. You can specify multiple models, and Kaiwo will select the best available node that matches one of the specified models.
`version` string	Version allows you to specify an optional version string for the workload. This can be useful for tracking different iterations or configurations of the same logical workload. It does not directly affect resource creation but serves as metadata.
`replicas` integer	Replicas specifies the number of replicas for the workload. See here for more details on how this field impacts scheduling.	1
`gpusPerReplica` integer	GpusPerReplica specifies the number of GPUs allocated per replica. See here for more details on how this field impacts scheduling. If you specify `gpusPerReplica`, you must also specify `replicas`.
`duration` Duration	Duration specifies the maximum duration over which the workload can run. This is useful for avoiding workloads running indefinitely.
`preferredTopologyLabel` string	PreferredTopologyLabel specifies the preferred topology label for scheduling the workload. This is used to influence how the workload is distributed across nodes in the cluster. If not specified, Kaiwo will use the default topology labels defined in the default topology of KaiwoQueueConfig starting at the host level. The levels are evaluated one-by-one going up from the level indicated by the label. If the PodSet cannot fit within a given topology label then the next topology level up is considered. If the PodSet cannot fit at the highest topology level, then it is distributed among multiple topology domains
`requiredTopologyLabel` string	RequiredTopologyLabel specifies the required topology label for scheduling the workload. This is used to ensure that the workload is scheduled on nodes that match the specified topology label.
`resources` ResourceRequirements	Resources specify the default resource requirements applied for all pods inside the workflow. This field defines default Kubernetes `ResourceRequirements` (requests and limits for CPU, memory, ephemeral-storage) applied to all containers (including init containers) within the workload's pods. Behavior: These values act as defaults. If a container within the underlying Job, Deployment, or Ray spec (if provided by the user) already defines a specific request or limit (e.g., `memory` limit), the value from `resources` for that specific metric will not override it. Interaction with GPU fields: The GPU requests/limits (`amd.com/gpu` or `nvidia.com/gpu`) are controlled exclusively by the `gpus`, `gpusPerReplica`, and `gpuVendor` fields (and the associated calculation logic described above). Any GPU specifications within the `resources` field are ignored. Default CPU/Memory with GPUs: When Kaiwo generates the underlying Job/Deployment/RayCluster spec (i.e., the user did not provide `spec.job`, `spec.deployment`, or `spec.rayService`/`spec.rayJob`), and GPUs are requested (`gpusPerReplica` > 0), Kaiwo applies default CPU and Memory requests/limits based on the GPU count (e.g., 4 CPU cores and 32Gi Memory per GPU). These GPU-derived defaults will override any CPU/Memory settings defined in the `resources` field in this specific scenario. If the user does provide the underlying spec, these GPU-derived CPU/Memory defaults are not applied, respecting the user's definition or the values from the `resources` field.
`image` string	Image specifies the default container image to be used for the primary workload container(s). - If containers defined within the underlying Job, Deployment, or Ray spec do not specify an image, this image will be used. - If this field is also empty, the latest tag of ghcr.io/silogen/rocm-ray is used
`imagePullSecrets` LocalObjectReference array	ImagePullSecrets is a list of Kubernetes `LocalObjectReference` (containing just the secret `name`) referencing secrets needed to pull the container image(s). These are added to the `imagePullSecrets` field of the PodSpec for all generated pods.
`env` EnvVar array	Env is a list of Kubernetes `EnvVar` structs. These environment variables are added to the primary workload container(s) in the generated pods. They are appended to any environment variables already defined in the underlying Job, Deployment, or Ray spec.
`secretVolumes` SecretVolume array	SecretVolumes allows you to mount specific keys from Kubernetes Secrets as files into the workload containers.
`ray` boolean	Ray determines whether the operator should use RayCluster for workload execution. If `true`, Kaiwo will create Ray-specific resources. If `false` (default), Kaiwo will create standard Kubernetes resources (BatchJob for `KaiwoJob`, Deployment for `KaiwoService`). This setting dictates which underlying spec (`job`/`rayJob` or `deployment`/`rayService`) is primarily used.	false
`storage` StorageSpec	Storage configures persistent storage using Kubernetes PersistentVolumeClaims (PVCs). Enabling `storage.data.download` or `storage.huggingFace.preCacheRepos` will cause Kaiwo to create a temporary Kubernetes Job (the "download job") before starting the main workload. This job runs a container that performs the downloads into the respective PVCs. The main workload only starts after the download job completes successfully.
`dangerous` boolean	Dangerous, if when set to `true`, Kaiwo will not add the default `PodSecurityContext` (which normally sets `runAsUser: 1000`, `runAsGroup: 1000`, `fsGroup: 1000`) to the generated pods. Use this only if you need to run containers as root or a different specific user and understand the security implications.	false
`clusterQueue` string	ClusterQueue specifies the name of the Kueue `ClusterQueue` that the workload should be submitted to for scheduling and resource management. This value is set as the `kueue.x-k8s.io/queue-name` label on the underlying resources. If omitted, it defaults to the value specified by the `DEFAULT_CLUSTER_QUEUE_NAME` environment variable in the Kaiwo controller (typically "kaiwo"), which is set during installation. Note! If the applied KaiwoQueueConfig includes no quota for the default queue, no workload will run that tries to fall back on it. The `kaiwo submit` CLI command can override this using the `--queue` flag or the `clusterQueue` field in the `kaiwoconfig.yaml` file.
`priorityClass` string	WorkloadPriorityClass specifies the name of Kueue `WorkloadPriorityClass` to be assigned to the job's pods. This influences the scheduling priority relative to other pods in the cluster.
`entrypoint` string	EntryPoint specifies the command or script executed in a Deployment. Can also be defined inside Deployment struct as regular command in the form of string array. It is not used when `ray: true` (use `serveConfigV2` or the `rayService` spec instead for Ray entrypoints).
`serveConfigV2` string	Defines the applications and deployments to deploy, should be a YAML multi-line scalar string. Can also be defined inside RayService struct
`rayService` RayService	RayService allows providing a full `rayv1.RayService` spec. If present (or `spec.ray` is `true`), Kaiwo creates a `RayService` (wrapped in an AppWrapper for Kueue integration) instead of a `Deployment`. Common fields are merged into the `RayClusterSpec` within this spec. Allows fine-grained control over the Ray cluster and Ray Serve configurations.
`deployment` Deployment	Deployment allows providing a full `appsv1.Deployment` spec. If present and `spec.ray` is `false`, this is used as the base for the created `Deployment`. Common fields are merged into this spec. Allows fine-grained control over Kubernetes Deployment parameters (strategy, selectors, pod template, etc.).

KaiwoServiceStatus

KaiwoServiceStatus defines the observed state of KaiwoService.

Appears in: - KaiwoService

Field	Description	Default	Validation
`startTime` Time	StartTime records the timestamp when the first pod associated with the workload started running.
`conditions` Condition array	Conditions lists the observed conditions of the workload resource, following standard Kubernetes conventions. May include conditions reflecting the underlying Deployment or RayService state.
`status` WorkloadStatus	Status reflects the current high-level phase of the workload lifecycle (e.g., PENDING, STARTING, READY, FAILED).
`duration` integer	Duration indicates how long the service has been running since StartTime, in seconds. Calculated periodically while running.
`observedGeneration` integer	ObservedGeneration records the `.metadata.generation` of the workload resource that was last processed by the controller.

ObjectStorageDownloadSpec

ObjectStorageDownloadSpec aggregates download tasks for various object storage and Git sources within the DataStorageSpec.

Appears in: - DataStorageSpec

Field	Description	Default	Validation
`s3` S3DownloadItem array	S3 lists any S3 downloads
`gcs` GCSDownloadItem array	GCS lists and Google Cloud Storage downloads
`azureBlob` AzureBlobStorageDownloadItem array	AzureBlob lists any Azure Blob Storage downloads
`git` GitDownloadItem array	Git lists any Git downloads

QueueConfigStatusDescription

Underlying type: string

Appears in: - KaiwoQueueConfigStatus

Field	Description
`READY`
`FAILED`

ResourceFlavorSpec

ResourceFlavorSpec defines the configuration for a Kueue ResourceFlavor managed by Kaiwo.

Appears in: - KaiwoQueueConfigSpec

Field	Description	Validation
`name` string	Name specifies the name of the Kueue ResourceFlavor resource (e.g., "amd-mi300-8gpu").
`nodeLabels` object (keys:string, values:string)	NodeLabels specifies the labels that pods requesting this flavor must match on nodes. This is used by Kueue for scheduling decisions. Keys and values should correspond to actual node labels. Example: `\{"kaiwo/nodepool": "amd-gpu-nodes"\}`	MaxProperties: 10
`taints` Taint array	Taints specifies a list of taints associated with this flavor.	MaxItems: 5
`tolerations` Toleration array	Tolerations specifies a list of tolerations associated with this flavor. This is less common than using Taints; Kueue primarily uses Taints to derive Tolerations.	MaxItems: 5
`topologyName` string	TopologyName specifies the name of the Kueue Topology that this flavor belongs to. If specified, it must match one of the Topologies defined in the KaiwoQueueConfig. This is used to group flavors by topology for scheduling purposes.

S3DownloadItem

S3DownloadItem defines parameters for downloading data from an S3-compatible object store.

Appears in: - DownloadTaskConfig - ObjectStorageDownloadSpec

Field	Description	Default	Validation
`endpointUrl` string	EndpointUrl specifies the S3 API endpoint URL (e.g., "https://s3.us-east-1.amazonaws.com" or a MinIO endpoint).
`accessKeyId` ValueReference	AccessKeyId optionally references a Kubernetes Secret containing the S3 access key ID. See `ValueReference`.
`secretKey` ValueReference	SecretKey optionally references a Kubernetes Secret containing the S3 secret access key. See `ValueReference`.
`buckets` CloudDownloadBucket array	Buckets lists the S3 buckets and the specific files/folders to download from them. See `CloudDownloadBucket`.

SecretVolume

SecretVolume defines how to mount a specific key from a Kubernetes Secret into the workload's containers.

Appears in: - CommonMetaSpec - KaiwoJobSpec - KaiwoServiceSpec

Field	Description	Default	Validation
`name` string	Name defines the name of the Kubernetes Volume that will be created. Should be unique within the pod.
`secretName` string	SecretName specifies the name of the Kubernetes Secret resource to mount from.
`key` string	Key specifies the key within the Secret whose value should be mounted. If omitted, the entire secret might be mounted as files (depending on Kubernetes behavior).
`subPath` string	SubPath defines the filename within the `MountPath` directory where the secret `Key`'s content will be placed. Useful for mounting a single secret key as a file.
`mountPath` string	MountPath defines the directory path inside the container where the secret volume (or the `SubPath` file) should be mounted.

StorageSpec

StorageSpec defines the storage configuration for the workload.

Appears in: - CommonMetaSpec - KaiwoJobSpec - KaiwoServiceSpec

Field	Description	Default
`storageEnabled` boolean	StorageEnabled must be `true` to enable the creation of any PersistentVolumeClaims defined within this spec. If `false`, `data` and `huggingFace` sections are ignored.
`storageClassName` string	StorageClassName specifies the name of the Kubernetes `StorageClass` to use when creating PersistentVolumeClaims for `data` and `huggingFace` volumes. Must refer to an existing StorageClass in the cluster.
`accessMode` PersistentVolumeAccessMode	AccessMode determines the access mode (e.g., `ReadWriteOnce`, `ReadWriteMany`, `ReadOnlyMany`) for the created PersistentVolumeClaims. In a multi-node setting, ReadWriteMany is generally required, as pods scheduled on different nodes cannot access ReadWriteOnce PVCs. This is true even when `replicas: 1` if you are using download jobs, as the download pod may get scheduled on a different pod than the main workload pod.	ReadWriteMany
`data` DataStorageSpec	Data configures the main data PersistentVolumeClaim and optional pre-download tasks for it.
`huggingFace` HfStorageSpec	HuggingFace configures a PersistentVolumeClaim specifically for caching Hugging Face models and datasets, with options for pre-caching.

Topology

Topology is the Schema for the topology API

Appears in: - KaiwoQueueConfigSpec

Field	Description	Default	Validation
`metadata` ObjectMeta	Refer to Kubernetes API documentation for fields of `metadata`.
`spec` TopologySpec			Required: {}

TopologySpec

Appears in: - Topology

Field	Description	Default	Validation
`levels` TopologyLevel array	levels define the levels of topology.		MaxItems: 8 MinItems: 1

ValueReference

ValueReference provides a way to reference sensitive values stored in Kubernetes Secrets, typically used for credentials needed by download tasks.

Appears in: - AzureBlobStorageDownloadItem - GCSDownloadItem - GitDownloadItem - S3DownloadItem

Field	Description	Default	Validation
`file` string	File specifies the expected path within the download job's container where the secret value will be mounted as a file. This path is usually automatically generated by the controller based on SecretName and SecretKey.
`secretName` string	SecretName is the name of the Kubernetes Secret resource containing the value.
`secretKey` string	SecretKey is the key within the specified Secret whose value should be used.

WorkloadStatus

Underlying type: string

Appears in: - CommonStatusSpec - KaiwoJobStatus - KaiwoServiceStatus

Field	Description
``	WorkloadStatusNew indicates the resource has been created but not yet processed by the controller.
`DOWNLOADING`	WorkloadStatusDownloading indicates that the resource is currently running the download job
`PENDING`	WorkloadStatusPending indicates the resource is waiting for prerequisites (like Kueue admission) to complete.
`STARTING`	WorkloadStatusStarting indicates the Kaiwo workload has been admitted, and the underlying workload (Job, Deployment, RayService) is being created or started.
`RUNNING`	WorkloadStatusRunning indicates the workload pods are running. For KaiwoJob, this means the job has started execution. For KaiwoService, pods are up but may not yet be fully ready/healthy.
`COMPLETE`	WorkloadStatusComplete indicates a KaiwoJob has finished successfully.
`ERROR`	WorkloadStatusError indicates the workload encountered an error which can be recovered from.
`FAILED`	WorkloadStatusFailed indicates the workload (KaiwoJob or KaiwoService) encountered an error and cannot proceed or recover.
`TERMINATING`	WorkloadStatusTerminating indicates that the workload should begin to terminate the underlying resources.
`TERMINATED`	WorkloadStatusTerminated indicates the workload has been terminated by the user or system. This could be due to duration deadline being met and pressure for GPU demand.