Overview for Administrators
Kaiwo provides a layer on top of Kubernetes, Kueue, and Ray to streamline the management and execution of AI workloads, particularly focusing on efficient GPU utilization. As an administrator, your role involves deploying, configuring, and maintaining the Kaiwo system.
Key Components and Concepts
-
Kaiwo Operator:
- Runs as a deployment within the Kubernetes cluster.
- Manages the lifecycle of Kaiwo Custom Resources (
KaiwoJob,KaiwoService,KaiwoQueueConfig). - Controllers: Includes specific controllers for each CRD:
KaiwoJobController: TranslatesKaiwoJobintobatchv1.Joborrayv1.RayJob, manages dependencies (like download jobs, PVCs), and updates status.KaiwoServiceController: TranslatesKaiwoServiceintoappsv1.Deploymentorrayv1.RayService(wrapped in anAppWrapper), manages dependencies, and updates status.KaiwoQueueConfigController: Manages Kueue resources (ClusterQueue,ResourceFlavor,WorkloadPriorityClass) based on the cluster-scopedKaiwoQueueConfigCRD. Ensures a default configuration exists.
- Integration: Interacts with the Kubernetes API, Kueue, and Ray operators.
-
Kaiwo CRDs:
KaiwoJob/KaiwoService: User-facing resources defined by AI Scientists to describe their workloads. They abstract away much of the underlying Kubernetes/Ray/Kueue complexity.KaiwoQueueConfig: A cluster-scoped resource (typically one namedkaiwo) used by administrators to define and manage Kueue configurations centrally. This includes defining queues, resource types (flavors), and priorities.
-
Kueue Integration:
- Kaiwo relies on Kueue for job queueing, scheduling, and resource quota management.
- The Kaiwo Operator, specifically the
KaiwoQueueConfigController, manages the creation and synchronization of KueueClusterQueue,ResourceFlavor, andWorkloadPriorityClassresources based on theKaiwoQueueConfigCRD. - Workloads (
KaiwoJob/KaiwoService) are submitted to a specificClusterQueue(via thekueue.x-k8s.io/queue-namelabel, derived fromspec.clusterQueue).
-
Ray Integration:
- If
spec.ray: trueis set in aKaiwoJoborKaiwoService, the operator createsRayJoborRayServiceresources instead of standard Kubernetes ones. - This leverages Ray for distributed execution capabilities. Requires the KubeRay operator to be installed.
- If
-
Kaiwo CLI:
- The primary user interface for AI Scientists.
- Communicates with the Kubernetes API to create and manage Kaiwo CRDs.
- Requires
kubeconfigaccess similar tokubectl.
Administrator Responsibilities
- Installation: Deploying the Kaiwo operator and its dependencies (Kueue, Ray Operator, Cert-Manager, GPU Operator, etc.).
- Configuration: Defining cluster-wide queuing policies, resource flavors (mapping to node types/pools), and priorities using the
KaiwoQueueConfigCRD. Managing storage classes referenced by users. - Maintenance: Upgrading Kaiwo components, monitoring operator health, managing certificates.
- Monitoring: Observing cluster resource utilization, queue lengths, and workload statuses. Integrating with monitoring tools like Prometheus.
- User Management: Potentially managing namespaces and ensuring users target appropriate Kueue queues.
- Troubleshooting: Diagnosing issues related to scheduling, resource allocation, operator errors, or workload failures.