Resource Monitoring

The Kaiwo Operator includes a resource monitoring utility which continuously watches your Kaiwo workloads (Jobs or Services) and checks their GPU utilization via metrics endpoints. If any pod of a workload that reserves GPUs is underutilizing the GPU, the operator marks the workload as Underutilized and emits an event. If the workload does not utilize the GPU for a given amount of time, it is automatically terminated. This termination feature is enabled by default if resource monitoring is enabled, but it can be disabled in case you want to implement your own termination logic.

In order for workloads to be monitored, they must be deployed via Kaiwo CRDs (KaiwoJob or KaiwoService). This ensures that the created resources have the correct labels and are inspected by the resource monitor.

Configuration

Operator Environmental Variables

Resource monitoring is enabled via environmental variables given to the Kaiwo operator:

Parameter	Description	Default
`RESOURCE_MONITORING_ENABLED`	Enable or disable monitoring (`true`/`false`)	`false`
`RESOURCE_MONITORING_METRICS_ENDPOINT`	URL of your metrics endpoint	(required)
`RESOURCE_MONITORING_POLLING_INTERVAL`	How often to check metrics (e.g. `30s`, `1m`)	(required)

Note

Setting the polling interval very long with workloads that only use GPUs occasionally may end up causing false early terminations, if the GPU is not in use during the polling check. Ensure that your polling interval is low enough to catch GPU usage based on your workload.

These options are set as the operator environment variables and cannot be changed during runtime.

`resourceMonitoring` field in KaiwoConfig

Please see the CRD documentation for the available options for setting the runtime configuration for the resource monitoring. Changing these fields takes effect immediately.

By default, terminateUnderutilizingAfter is set to 24 hours and lowUtilizationThreshold is set to 1 (percent). This means that if a workload reaches at least 1 % GPU utilization at least once over 24 hours, it will not be terminated. These values should most likely be changed to suit your environment.

Terminating Underutilizing Workloads

If the KaiwoConfig field spec.resourceMonitoring.terminateUnderutilizing is true (the default), once a workload has been underutilizing one or more GPUs continuously for the time specified in the field spec.resourceMonitoring.terminateUnderutilizingAfter, it is flagged for termination by setting the early termination condition and setting the status to TERMINATING. The Kaiwo operator will then take care of deleting the dependent resources, but keeps the Kaiwo workload object available to provide a way to inspect the reason for termination.

Status Conditions

Once monitoring begins, each KaiwoWorkload will have a condition under .status.conditions:

- type: ResourceUnderutilization
  status: "False"    # “True” means Underutilized, “False” means Normal
  reason: GpuUtilizationNormal    # or GpuUtilizationLow
  message: "GPU utilization normal"

If a workload is flagged for early termination, it will have an additional condition:

- type: WorkloadTerminatedEarly
  status: "True"
  reason: GpuUtilizationLow    # or GpuUtilizationLow
  message: "Early termination due to low GPU usage"

Best Practices

Right-size your thresholds
Choose a sensible cutoff (e.g. 10–30%) so you catch idle pods without false positives
Namespace filtering
Use the KaiwoConfig field spec.resourceMonitoring.targetNamespaces to restrict monitoring to critical workloads only.

Troubleshooting

No status updates?
- Ensure ENABLED=true and METRICS_ENDPOINT is reachable.
- Check operator logs for query errors.
Excessive events?
- Increase lowUtilizationThreshold to reduce sensitivity.