Monitoring Kaiwo
Monitoring the Kaiwo operator and the workloads it manages is crucial for ensuring system health and performance.
Operator Metrics
The Kaiwo operator exposes metrics in Prometheus format.
- Endpoint: By default, metrics are exposed on port
8080
(HTTP) or8443
(HTTPS, if--metrics-secure=true
, which is the default). The bind address can be configured via the--metrics-bind-address
flag (defaults to0
which disables the endpoint unless overridden). Theinstall.yaml
manifest typically configures this. - Security: When
--metrics-secure=true
, the endpoint uses TLS. Certificates can be auto-generated by controller-runtime, managed by Cert-Manager, or provided manually via flags (--metrics-cert-path
, etc.). Authentication and authorization can be enabled viacontroller-runtime
's filters (metricsServerOptions.FilterProvider = filters.WithAuthenticationAndAuthorization
). RBAC for accessing the metrics endpoint needs to be configured separately (seeconfig/rbac/kustomization.yaml
in the source repository for examples). - Key Metrics: The operator exposes standard controller-runtime metrics (e.g., reconcile times, errors, queue lengths) and potentially custom metrics related to Kaiwo operations.
Integration with Prometheus:
- Ensure a Prometheus instance (e.g., Prometheus Operator, managed Prometheus service) is running in your cluster.
- Configure Prometheus to scrape the Kaiwo operator's metrics endpoint. This typically involves creating a
ServiceMonitor
orPodMonitor
resource targeting thekaiwo-controller-manager
service/pods in thekaiwo-system
namespace. - If using TLS (
--metrics-secure=true
), configure Prometheus scraping job with the appropriate TLS configuration (e.g.,insecure_skip_verify: true
for self-signed certs, or proper CA/client certs).
Consult the controller-runtime
documentation and your Prometheus setup guide for detailed scraping configuration.
Operator Logs
Monitor the logs of the Kaiwo operator pod for errors or important events:
Consider shipping these logs to a central logging system (e.g., Loki, Elasticsearch, Splunk) for easier analysis and alerting.
Kueue Monitoring
Kueue also exposes its own metrics and has status conditions on its resources (ClusterQueue
, LocalQueue
, Workload
). Monitoring Kueue is essential for understanding queue lengths, resource utilization, admission decisions, and potential bottlenecks.
- Kueue Metrics: Scrape metrics from the
kueue-controller-manager
similar to the Kaiwo operator. - Queue Status: Check the status of
ClusterQueue
andLocalQueue
resources: - Workload Status: Inspect
Workload
resources created by Kueue for admitted/pending jobs:
Refer to the Kueue documentation for details on its metrics.
Workload Status and Events
Monitor the status of the KaiwoJob
and KaiwoService
resources themselves:
kubectl get kaiwojobs -A
kubectl get kaiwoservices -A
kubectl describe kaiwojob -n <namespace> <job-name>
kubectl describe kaiwoservice -n <namespace> <service-name>
Check the status
field for the overall phase (PENDING
, RUNNING
, COMPLETE
, FAILED
, READY
) and conditions
.
Also, monitor Kubernetes events related to Kaiwo resources and the underlying pods/jobs/deployments:
Cluster Resource Utilization
Use standard Kubernetes monitoring tools (e.g., kubectl top nodes
, kubectl top pods
, Prometheus with kube-state-metrics
and node-exporter
) to track overall cluster CPU, memory, and GPU utilization. Pay special attention to GPU utilization on nodes designated for AI workloads.
Dashboards and Alerting
Create dashboards (e.g., in Grafana) combining metrics from the Kaiwo operator, Kueue, GPU operator, and standard Kubernetes components to get a holistic view of the system. Set up alerts based on key metrics or status conditions (e.g., high queue lengths, operator errors, low GPU utilization, failed workloads).