Installation Guide
This guide provides detailed steps for installing the Kaiwo operator and its dependencies on a Kubernetes cluster.
Prerequisites
- A running Kubernetes cluster (v1.22+ recommended).
kubectl
installed and configured with cluster-admin privileges.git
(if cloning repositories).- Go (if using Cluster Forge).
Dependency Overview
Kaiwo requires several core Kubernetes components to function correctly.
- Cert-Manager: Manages TLS certificates for webhooks.
- GPU Operator:
- NVIDIA: NVIDIA GPU Operator + GPU Feature Discovery.
- AMD: AMD GPU Operator. (Includes Node Labeller).
- Ensures GPU drivers are installed and nodes are correctly labeled with GPU information.
- Kueue: Provides job queueing, fair sharing, and quota management. (Docs).
- KubeRay Operator: Required only if users will run Ray-based workloads (
spec.ray: true
). Manages Ray clusters. (Docs). - AppWrapper: Used by Kueue to manage atomic scheduling of complex workloads, particularly Ray clusters/services. (GitHub).
- Prometheus (Recommended): For monitoring the Kaiwo operator and cluster metrics.
Step 1: Install Kaiwo and its dependencies
There are several different ways that you can install the Kaiwo dependencies and operator. The following serve as references that you can adapt to your particular environment and workflow.
Dependencies via convenience script
You can install the dependencies using the convenience script:
From the remote script
curl -sSL https://raw.githubusercontent.com/silogen/kaiwo/refs/heads/main/dependencies/install_dependencies.sh | bash -s --
Or if you have cloned the repository
GPU Operator Not Included
You must install the AMD GPU Operator separately according to its documentation before running the convenience script or installing Kaiwo. Ensure node labeling features are enabled.
Kaiwo operator via install manifest
Once dependencies are ready, install the Kaiwo operator itself.
You can install the latest version via:
kubectl apply -f https://github.com/silogen/kaiwo/releases/latest/download/install.yaml --server-side
If you want to choose the release, follow these steps:
- Choose Release: Find the latest stable release tag on the Kaiwo GitHub Releases page.
-
Apply Manifest: Use
kubectl apply
with the--server-side
flag (recommended for managing large manifests and CRDs). ReplacevX.Y.Z
with your chosen release tag.export KAIWO_VERSION=vX.Y.Z kubectl apply -f https://github.com/silogen/kaiwo/releases/download/${KAIWO_VERSION}/install.yaml --server-side
This installs:
- Kaiwo CRDs (
KaiwoJob
,KaiwoService
,KaiwoQueueConfig
) - The Kaiwo Controller Manager
Deployment
in thekaiwo-system
namespace. - RBAC rules (
ClusterRole
,Role
,ClusterRoleBinding
,RoleBinding
). - Webhook configurations (if enabled in the release).
- Service for webhooks/metrics.
- Kaiwo CRDs (
Everything via Cluster Forge
Cluster Forge is a tool for managing Kubernetes stacks. You can use it to install Kaiwo and its dependencies.
- Clone the Cluster Forge repository:
git clone https://github.com/silogen/cluster-forge.git
- Navigate into the directory:
cd cluster-forge
- Ensure Go is installed (
go version
). - Run the forge command, selecting
kaiwo-all
and optionally the relevant GPU operator (amd-gpu-operator
): - Deploy the selected stack:
- Verify pods in relevant namespaces (
kaiwo-system
,cert-manager
,kueue-system
, etc.).
Manually
If you prefer to manage the dependencies yourself, you can inspect the /dependencies
folder to see what is required, and install Kaiwo yourself by using the install.yaml
release from the releases page.
Step 2: Verify Installation
-
Check Operator Pod: Ensure the Kaiwo controller manager pod is running.
-
Check CRDs: Verify that the Kaiwo Custom Resource Definitions are installed.
-
Check Default QueueConfig: The operator should automatically create a default
If this is missing, check the operator logs:KaiwoQueueConfig
.kubectl logs -n kaiwo-system -l control-plane=kaiwo-controller-manager
Step 3: Provide Kaiwo CLI to Users
Instruct your users (AI Scientists/Engineers) on how to download and install the kaiwo
CLI tool. Point them to the User Quickstart guide or the CLI Installation instructions.
Next Steps
- Configure Kaiwo: Customize the default
KaiwoQueueConfig
(kubectl edit kaiwoqueueconfig kaiwo
) to define appropriate KueueResourceFlavors
andClusterQueues
reflecting your cluster's hardware and policies. See the Configuration Guide. - Set up Monitoring: Integrate Kaiwo operator metrics with your monitoring system (e.g., Prometheus). See the Monitoring Guide.
- Authentication: Ensure users have the necessary
kubeconfig
files and any required authentication plugins installed. See Authentication & Authorization.