Resource Monitoring
The Kaiwo Operator includes a resource monitoring utility which continuously watches your Kaiwo workloads (Jobs or Services) and checks their GPU utilization via metrics endpoints. If any pod of a workload that reserves GPUs is underutilizing the GPU, the operator marks the workload as Underutilized and emits an event. If the workload does not utilize the GPU for a given amount of time, it is automatically terminated. This termination feature is enabled by default if resource monitoring is enabled, but it can be disabled in case you want to implement your own termination logic.
In order for workloads to be monitored, they must be deployed via Kaiwo CRDs (KaiwoJob
or KaiwoService
). This ensures that the created resources have the correct labels and are inspected by the resource monitor.
Configuration
Operator Environmental Variables
Resource monitoring is enabled via environmental variables given to the Kaiwo operator:
Parameter | Description | Default |
---|---|---|
RESOURCE_MONITORING_ENABLED |
Enable or disable monitoring (true /false ) |
false |
RESOURCE_MONITORING_METRICS_ENDPOINT |
URL of your metrics endpoint | (required) |
RESOURCE_MONITORING_POLLING_INTERVAL |
How often to check metrics (e.g. 30s , 1m ) |
(required) |
Note
Setting the polling interval very long with workloads that only use GPUs occasionally may end up causing false early terminations, if the GPU is not in use during the polling check. Ensure that your polling interval is low enough to catch GPU usage based on your workload.
These options are set as the operator environment variables and cannot be changed during runtime.
resourceMonitoring
field in KaiwoConfig
Please see the CRD documentation for the available options for setting the runtime configuration for the resource monitoring. Changing these fields takes effect immediately.
By default, terminateUnderutilizingAfter
is set to 24 hours and lowUtilizationThreshold
is set to 1 (percent). This means that if a workload reaches at least 1 % GPU utilization at least once over 24 hours, it will not be terminated. These values should most likely be changed to suit your environment.
Terminating Underutilizing Workloads
If the KaiwoConfig field spec.resourceMonitoring.terminateUnderutilizing
is true
(the default), once a workload has been underutilizing one or more GPUs continuously for the time specified in the field spec.resourceMonitoring.terminateUnderutilizingAfter
, it is flagged for termination by setting the early termination condition and setting the status to TERMINATING
. The Kaiwo operator will then take care of deleting the dependent resources, but keeps the Kaiwo workload object available to provide a way to inspect the reason for termination.
Status Conditions
Once monitoring begins, each KaiwoWorkload will have a condition under .status.conditions
:
- type: ResourceUnderutilization
status: "False" # “True” means Underutilized, “False” means Normal
reason: GpuUtilizationNormal # or GpuUtilizationLow
message: "GPU utilization normal"
If a workload is flagged for early termination, it will have an additional condition:
- type: WorkloadTerminatedEarly
status: "True"
reason: GpuUtilizationLow # or GpuUtilizationLow
message: "Early termination due to low GPU usage"
Best Practices
- Right-size your thresholds
Choose a sensible cutoff (e.g. 10–30%) so you catch idle pods without false positives - Namespace filtering
Use the KaiwoConfig fieldspec.resourceMonitoring.targetNamespaces
to restrict monitoring to critical workloads only.
Troubleshooting
- No status updates?
- Ensure
ENABLED=true
andMETRICS_ENDPOINT
is reachable. - Check operator logs for query errors.
- Ensure
- Excessive events?
- Increase
lowUtilizationThreshold
to reduce sensitivity.
- Increase