Overview for Administrators
Kaiwo provides a layer on top of Kubernetes, Kueue, and Ray to streamline the management and execution of AI workloads, particularly focusing on efficient GPU utilization. As an administrator, your role involves deploying, configuring, and maintaining the Kaiwo system.
Key Components and Concepts
-
Kaiwo Operator:
- Runs as a deployment within the Kubernetes cluster.
- Manages the lifecycle of Kaiwo Custom Resources (
KaiwoJob
,KaiwoService
,KaiwoQueueConfig
). - Controllers: Includes specific controllers for each CRD:
KaiwoJobController
: TranslatesKaiwoJob
intobatchv1.Job
orrayv1.RayJob
, manages dependencies (like download jobs, PVCs), and updates status.KaiwoServiceController
: TranslatesKaiwoService
intoappsv1.Deployment
orrayv1.RayService
(wrapped in anAppWrapper
), manages dependencies, and updates status.KaiwoQueueConfigController
: Manages Kueue resources (ClusterQueue
,ResourceFlavor
,WorkloadPriorityClass
) based on the cluster-scopedKaiwoQueueConfig
CRD. Ensures a default configuration exists.
- Integration: Interacts with the Kubernetes API, Kueue, and Ray operators.
-
Kaiwo CRDs:
KaiwoJob
/KaiwoService
: User-facing resources defined by AI Scientists to describe their workloads. They abstract away much of the underlying Kubernetes/Ray/Kueue complexity.KaiwoQueueConfig
: A cluster-scoped resource (typically one namedkaiwo
) used by administrators to define and manage Kueue configurations centrally. This includes defining queues, resource types (flavors), and priorities.
-
Kueue Integration:
- Kaiwo relies on Kueue for job queueing, scheduling, and resource quota management.
- The Kaiwo Operator, specifically the
KaiwoQueueConfigController
, manages the creation and synchronization of KueueClusterQueue
,ResourceFlavor
, andWorkloadPriorityClass
resources based on theKaiwoQueueConfig
CRD. - Workloads (
KaiwoJob
/KaiwoService
) are submitted to a specificClusterQueue
(via thekueue.x-k8s.io/queue-name
label, derived fromspec.clusterQueue
).
-
Ray Integration:
- If
spec.ray: true
is set in aKaiwoJob
orKaiwoService
, the operator createsRayJob
orRayService
resources instead of standard Kubernetes ones. - This leverages Ray for distributed execution capabilities. Requires the KubeRay operator to be installed.
- If
-
Kaiwo CLI:
- The primary user interface for AI Scientists.
- Communicates with the Kubernetes API to create and manage Kaiwo CRDs.
- Requires
kubeconfig
access similar tokubectl
.
Administrator Responsibilities
- Installation: Deploying the Kaiwo operator and its dependencies (Kueue, Ray Operator, Cert-Manager, GPU Operator, etc.).
- Configuration: Defining cluster-wide queuing policies, resource flavors (mapping to node types/pools), and priorities using the
KaiwoQueueConfig
CRD. Managing storage classes referenced by users. - Maintenance: Upgrading Kaiwo components, monitoring operator health, managing certificates.
- Monitoring: Observing cluster resource utilization, queue lengths, and workload statuses. Integrating with monitoring tools like Prometheus.
- User Management: Potentially managing namespaces and ensuring users target appropriate Kueue queues.
- Troubleshooting: Diagnosing issues related to scheduling, resource allocation, operator errors, or workload failures.