Monitoring Kaiwo
Monitoring the Kaiwo operator and the workloads it manages is crucial for ensuring system health and performance.
Operator Metrics
The Kaiwo operator exposes metrics in Prometheus format.
- Endpoint: By default, metrics are exposed on port
8080(HTTP) or8443(HTTPS, if--metrics-secure=true, which is the default). The bind address can be configured via the--metrics-bind-addressflag (defaults to0which disables the endpoint unless overridden). Theinstall.yamlmanifest typically configures this. - Security: When
--metrics-secure=true, the endpoint uses TLS. Certificates can be auto-generated by controller-runtime, managed by Cert-Manager, or provided manually via flags (--metrics-cert-path, etc.). Authentication and authorization can be enabled viacontroller-runtime's filters (metricsServerOptions.FilterProvider = filters.WithAuthenticationAndAuthorization). RBAC for accessing the metrics endpoint needs to be configured separately (seeconfig/rbac/kustomization.yamlin the source repository for examples). - Key Metrics: The operator exposes standard controller-runtime metrics (e.g., reconcile times, errors, queue lengths) and potentially custom metrics related to Kaiwo operations.
Integration with Prometheus:
- Ensure a Prometheus instance (e.g., Prometheus Operator, managed Prometheus service) is running in your cluster.
- Configure Prometheus to scrape the Kaiwo operator's metrics endpoint. This typically involves creating a
ServiceMonitororPodMonitorresource targeting thekaiwo-controller-managerservice/pods in thekaiwo-systemnamespace. - If using TLS (
--metrics-secure=true), configure Prometheus scraping job with the appropriate TLS configuration (e.g.,insecure_skip_verify: truefor self-signed certs, or proper CA/client certs).
Consult the controller-runtime documentation and your Prometheus setup guide for detailed scraping configuration.
Operator Logs
Monitor the logs of the Kaiwo operator pod for errors or important events:
Consider shipping these logs to a central logging system (e.g., Loki, Elasticsearch, Splunk) for easier analysis and alerting.
Kueue Monitoring
Kueue also exposes its own metrics and has status conditions on its resources (ClusterQueue, LocalQueue, Workload). Monitoring Kueue is essential for understanding queue lengths, resource utilization, admission decisions, and potential bottlenecks.
- Kueue Metrics: Scrape metrics from the
kueue-controller-managersimilar to the Kaiwo operator. - Queue Status: Check the status of
ClusterQueueandLocalQueueresources: - Workload Status: Inspect
Workloadresources created by Kueue for admitted/pending jobs:
Refer to the Kueue documentation for details on its metrics.
Workload Status and Events
Monitor the status of the KaiwoJob and KaiwoService resources themselves:
kubectl get kaiwojobs -A
kubectl get kaiwoservices -A
kubectl describe kaiwojob -n <namespace> <job-name>
kubectl describe kaiwoservice -n <namespace> <service-name>
Check the status field for the overall phase (PENDING, RUNNING, COMPLETE, FAILED, READY) and conditions.
Also, monitor Kubernetes events related to Kaiwo resources and the underlying pods/jobs/deployments:
Cluster Resource Utilization
Use standard Kubernetes monitoring tools (e.g., kubectl top nodes, kubectl top pods, Prometheus with kube-state-metrics and node-exporter) to track overall cluster CPU, memory, and GPU utilization. Pay special attention to GPU utilization on nodes designated for AI workloads.
Dashboards and Alerting
Create dashboards (e.g., in Grafana) combining metrics from the Kaiwo operator, Kueue, GPU operator, and standard Kubernetes components to get a holistic view of the system. Set up alerts based on key metrics or status conditions (e.g., high queue lengths, operator errors, low GPU utilization, failed workloads).