Monitoring Kaiwo

Monitoring the Kaiwo operator and the workloads it manages is crucial for ensuring system health and performance.

Operator Metrics

The Kaiwo operator exposes metrics in Prometheus format.

Endpoint: By default, metrics are exposed on port 8080 (HTTP) or 8443 (HTTPS, if --metrics-secure=true, which is the default). The bind address can be configured via the --metrics-bind-address flag (defaults to 0 which disables the endpoint unless overridden). The install.yaml manifest typically configures this.
Security: When --metrics-secure=true, the endpoint uses TLS. Certificates can be auto-generated by controller-runtime, managed by Cert-Manager, or provided manually via flags (--metrics-cert-path, etc.). Authentication and authorization can be enabled via controller-runtime's filters (metricsServerOptions.FilterProvider = filters.WithAuthenticationAndAuthorization). RBAC for accessing the metrics endpoint needs to be configured separately (see config/rbac/kustomization.yaml in the source repository for examples).
Key Metrics: The operator exposes standard controller-runtime metrics (e.g., reconcile times, errors, queue lengths) and potentially custom metrics related to Kaiwo operations.

Integration with Prometheus:

Ensure a Prometheus instance (e.g., Prometheus Operator, managed Prometheus service) is running in your cluster.
Configure Prometheus to scrape the Kaiwo operator's metrics endpoint. This typically involves creating a ServiceMonitor or PodMonitor resource targeting the kaiwo-controller-manager service/pods in the kaiwo-system namespace.
If using TLS (--metrics-secure=true), configure Prometheus scraping job with the appropriate TLS configuration (e.g., insecure_skip_verify: true for self-signed certs, or proper CA/client certs).

Consult the controller-runtime documentation and your Prometheus setup guide for detailed scraping configuration.

Operator Logs

Monitor the logs of the Kaiwo operator pod for errors or important events:

kubectl logs -n kaiwo-system -l control-plane=kaiwo-controller-manager -f

Consider shipping these logs to a central logging system (e.g., Loki, Elasticsearch, Splunk) for easier analysis and alerting.

Kueue Monitoring

Kueue also exposes its own metrics and has status conditions on its resources (ClusterQueue, LocalQueue, Workload). Monitoring Kueue is essential for understanding queue lengths, resource utilization, admission decisions, and potential bottlenecks.

Kueue Metrics: Scrape metrics from the kueue-controller-manager similar to the Kaiwo operator.

Queue Status: Check the status of ClusterQueue and LocalQueue resources:

kubectl get clusterqueue <queue-name> -o yaml
kubectl get localqueue -n <namespace> <queue-name> -o yaml

Workload Status: Inspect Workload resources created by Kueue for admitted/pending jobs:

kubectl get workloads -n <namespace>
kubectl describe workload -n <namespace> <workload-name>

Refer to the Kueue documentation for details on its metrics.

Workload Status and Events

Monitor the status of the KaiwoJob and KaiwoService resources themselves:

kubectl get kaiwojobs -A
kubectl get kaiwoservices -A

kubectl describe kaiwojob -n <namespace> <job-name>
kubectl describe kaiwoservice -n <namespace> <service-name>

Check the status field for the overall phase (PENDING, RUNNING, COMPLETE, FAILED, READY) and conditions.

Also, monitor Kubernetes events related to Kaiwo resources and the underlying pods/jobs/deployments:

kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Cluster Resource Utilization

Use standard Kubernetes monitoring tools (e.g., kubectl top nodes, kubectl top pods, Prometheus with kube-state-metrics and node-exporter) to track overall cluster CPU, memory, and GPU utilization. Pay special attention to GPU utilization on nodes designated for AI workloads.

Dashboards and Alerting

Create dashboards (e.g., in Grafana) combining metrics from the Kaiwo operator, Kueue, GPU operator, and standard Kubernetes components to get a holistic view of the system. Set up alerts based on key metrics or status conditions (e.g., high queue lengths, operator errors, low GPU utilization, failed workloads).