Installation Guide
This guide provides clear, step‑by‑step instructions to install the Kaiwo operator and its dependencies on a Kubernetes cluster.
Prerequisites
- A running Kubernetes cluster (v1.22+ recommended)
kubectlconfigured with cluster-admin privilegeshelm(for Helm-based install)git(if using the helper scripts)
Optional (for GPU workloads): GPU-capable nodes and the appropriate GPU operator (AMD or NVIDIA).
Dependency Overview
Kaiwo requires several core Kubernetes components to function correctly:
- Cert-Manager: Manages TLS certificates for webhooks.
- GPU Operator:
- AMD: AMD GPU Operator. (Includes Node Labeler).
- NVIDIA: NVIDIA GPU Operator + GPU Feature Discovery.
- Ensures GPU drivers are installed and nodes are correctly labeled with GPU information.
- Kueue: Provides job queueing, fair sharing, and quota management. (Docs).
- KubeRay Operator: Required only if users will run Ray-based workloads (
spec.ray: true). Manages Ray clusters. (Docs). - AppWrapper: Used by Kueue to manage atomic scheduling of complex workloads, particularly Ray clusters/services. (GitHub).
- Prometheus (Recommended): For monitoring the Kaiwo operator and cluster metrics.
Installation Methods
There are two main phases: install dependencies (Step 1) and install Kaiwo (Step 2). Choose the option(s) that fit your environment.
Step 1: Install Dependencies
You can either install dependencies yourself or use the helper script (handy for dev/test).
Clone the repository and install dependencies using the script:
git clone https://github.com/silogen/kaiwo.git
cd kaiwo
dependencies/deploy.sh kind-test up # Use appropriate environment
Available environments:
- kind-test: For Kind/testing clusters
- tw-009-038: GPU environment example
- banff-sc-cx42-43: GPU environment example
Info
The GPU environments above are examples with hard-coded values for specific environments. To use the helper script with your own GPU cluster:
- Create a new environment file:
dependencies/environments/<my-env>.yaml - Create a new overlay:
dependencies/kustomization-server-side/overlays/environments/<my-env>/kustomization.yaml
Then install: dependencies/deploy.sh <my-env> up
Step 2: Install the Kaiwo Operator
You can install Kaiwo via Helm (recommended) or by applying a prebuilt manifest with Kustomize.
Option A — Helm
Install from the OCI registry:
# Install latest version to kaiwo-system namespace
helm install kaiwo oci://ghcr.io/silogen/kaiwo-operator \
--namespace kaiwo-system --create-namespace
# Install a specific version
helm install kaiwo oci://ghcr.io/silogen/kaiwo-operator \
--version <version> \
--namespace kaiwo-system --create-namespace
Option B — Kustomize Manifests
Install the latest version:
kubectl apply -f https://github.com/silogen/kaiwo/releases/latest/download/install.yaml --server-side
Or install a specific version:
export KAIWO_VERSION=vX.Y.Z
kubectl apply -f https://github.com/silogen/kaiwo/releases/download/${KAIWO_VERSION}/install.yaml --server-side
Install from a local build (useful for development):
This installs:
- Kaiwo CRDs (cluster-scoped)
kaiwojobs.kaiwo.silogen.aikaiwoservices.kaiwo.silogen.aikaiwoqueueconfigs.kaiwo.silogen.aikaiwoconfigs.config.kaiwo.silogen.airesourceflavors.kaiwo.silogen.aitopologies.kaiwo.silogen.ai- The Kaiwo controller
Deploymentin thekaiwo-systemnamespace - RBAC rules (
ClusterRole,Role,ClusterRoleBinding,RoleBinding) - Webhook configurations and services
Verification
After installation, verify that all components are running correctly:
1. Check Dependencies
Verify that all dependency components are running (only the ones you installed/apply):
# Check Cert-Manager
kubectl get pods -n cert-manager
# Check Kueue
kubectl get pods -n kueue-system
# Check KubeRay (if Ray workloads are used)
kubectl get pods -A | grep kuberay-operator || true
# Check AppWrapper
kubectl get pods -n appwrapper-system
2. Check Kaiwo Operator
Ensure the Kaiwo controller manager pod is running:
kubectl get pods -n kaiwo-system
# Expected output:
# NAME READY STATUS RESTARTS AGE
# kaiwo-controller-manager-xxxxxxxxxx-xxxxx 2/2 Running 0 2m
3. Verify CRDs
Check that the Kaiwo Custom Resource Definitions are installed:
kubectl get crds | grep -E 'kaiwo\.silogen\.ai|config\.kaiwo\.silogen\.ai'
# Expected output (at minimum):
# kaiwojobs.kaiwo.silogen.ai
# kaiwoservices.kaiwo.silogen.ai
# kaiwoqueueconfigs.kaiwo.silogen.ai
# kaiwoconfigs.config.kaiwo.silogen.ai
# resourceflavors.kaiwo.silogen.ai
# topologies.kaiwo.silogen.ai
4. Check Default Configuration
The operator should automatically create a default KaiwoQueueConfig:
If this is missing, check the operator logs:
If pods are pending or webhooks fail, see Troubleshooting.
Uninstallation
Remove Kaiwo Operator
For Helm installations:
CRD Removal
Helm uninstall keeps CRDs by default. Deleting CRDs will remove all Kaiwo resources. Only delete CRDs if you intend to wipe all Kaiwo state.
For Kustomize installations:
CRD Removal
Kustomize will delete CRDs, which removes all Kaiwo resources. Only delete CRDs if you intend to wipe all Kaiwo state.
Remove Dependencies
To remove dependencies:
cd kaiwo # Your cloned repository
dependencies/deploy.sh kind-test down # Use same environment as installation
Provide CLI to Users
Instruct your users (AI Scientists/Engineers) on how to download and install the kaiwo CLI tool. Point them to the User Quickstart guide or the CLI Installation instructions.
Next Steps
- Configure Kaiwo: Customize
KaiwoQueueConfigand (optionally)KaiwoConfigto reflect your cluster’s hardware and policies. See the Configuration Guide. - Set up Monitoring: Integrate Kaiwo operator metrics with your monitoring system (e.g., Prometheus). See the Monitoring Guide.
- Authentication: Ensure users have the necessary
kubeconfigfiles and any required authentication plugins installed. See Authentication & Authorization. - Troubleshooting: If something isn’t working, review common issues and fixes in Troubleshooting.