GPU Preemption
GPU preemption automatically detects idle GPU workloads and terminates them to free resources for pending work. Unlike Resource Monitoring, which is limited to Kaiwo CRDs (KaiwoJob / KaiwoService), GPU preemption works with any Kubernetes workload that requests AMD GPUs -- bare Pods, Jobs, Deployments, KaiwoJobs, RayJobs, or any other resource type.
The system relies on a CRD, GpuWorkload, which tracks each opted-in workload's GPU utilization and lifecycle. These CRs are created automatically when a qualifying Pod is scheduled, and they persist after the workload is deleted to provide an audit trail.
How It Works
flowchart LR
P[Pod created with<br>amd.com/gpu request] --> D{Owner or namespace<br>has preemption<br>annotation?}
D -- Yes --> G[GpuWorkload CR created]
D -- No --> X[Ignored]
G --> S[Metrics scraper updates<br>per-pod utilization]
S --> R[Reconciler computes<br>aggregated utilization]
R --> Ph{Phase?}
Ph -- Active --> W[Requeue]
Ph -- Idle past<br>grace-period --> E[Preemption evaluation]
E -- Always --> Del[Delete workload]
E -- OnPressure +<br>pending demand<br>+ fit check --> Del
- Discovery -- The controller watches all Pods. When a Pod requesting any
amd.com/*GPU resource (e.g.amd.com/gpu,amd.com/gpu-0, partitioned resources) is created, the controller resolves its root owner (walkingownerReferencesup through ReplicaSets, Jobs, etc.) and checks for preemption annotations on both the root owner and the pod's namespace. Namespace annotations serve as defaults; workload-level annotations override them. - Tracking -- A
GpuWorkloadCR is created for each unique root owner, capturing the GPU resource counts and annotation-derived settings. - Metrics -- A background scraper polls the AMD GPU operator's Prometheus endpoint (
gpu_gfx_activity) and writes per-pod utilization into theGpuWorkloadstatus. - Reconciliation -- The reconciler aggregates utilization, determines the phase (
PendingOther,PendingGpu,Active,Idle, etc.), and evaluates preemption when conditions are met. - Preemption -- When an idle workload becomes eligible, a Lease-guarded evaluation ensures only one controller replica runs the global preemption decision at a time, preventing race conditions and over-preemption.
AMD GPUs only
GPU preemption currently supports AMD GPU resources (amd.com/*). NVIDIA GPUs are not tracked by this feature.
Supported Owner Types
The controller resolves pod ownership by walking the ownerReferences chain. The operator has RBAC permissions for the following owner types out of the box:
batch/v1Jobsapps/v1Deployments, ReplicaSets, StatefulSetsray.io/v1RayClusters, RayJobs, RayServiceskaiwo.silogen.ai/v1alpha1KaiwoJobs, KaiwoServicesaim.silogen.ai/v1alpha1AIMServicesaim.eai.amd.com/v1alpha1AIMServicesworkload.codeflare.dev/v1beta2AppWrappers
For workloads owned by other resource types (e.g. CronJobs, DaemonSets, custom controllers), add get and delete permissions for those resources to the operator ServiceAccount. get is needed to resolve the owner chain and read annotations; delete is needed for preemption. Without get, the controller will log an RBAC error and skip the pod.
Configuration Hierarchy
Each setting is resolved through a 5-tier chain. The first non-empty value wins:
per-workload annotation > namespace annotation > KaiwoConfig > operator env var > hardcoded fallback
This lets you set cluster-wide runtime defaults (via KaiwoConfig), override them per namespace (via namespace annotations), and further override per workload (via workload annotations), while env vars and hardcoded values provide a safety net.
Enabling GPU Preemption
1. Operator startup: Helm values / env vars
Enable the feature and point it at the AMD GPU metrics exporter:
gpuPreemption:
enabled: true
# Replace with the actual in-cluster URL of your AMD GPU metrics exporter.
metricsEndpoint: "http://amd-gpu-metrics-exporter.gpu-operator.svc:5000/metrics"
pollingInterval: "15s"
defaultThreshold: "5"
defaultGracePeriod: "10m"
defaultPolicy: "OnPressure"
defaultAggregation: "Max"
defaultTTL: "24h"
These map to operator environment variables (see Environment Variables below). Changing them requires an operator restart.
Operator restart required
enabled, metricsEndpoint, and pollingInterval can only be set via Helm values / environment variables — they are not available in KaiwoConfig or per-workload annotations. These settings control the controller and metrics scraper startup, so any change requires an operator restart to take effect.
2. Cluster-wide runtime defaults: KaiwoConfig
The KaiwoConfig cluster-scoped singleton provides live-tunable defaults that take effect without restarting the operator. Add a gpuPreemption section to the spec:
apiVersion: config.kaiwo.silogen.ai/v1alpha1
kind: KaiwoConfig
metadata:
name: kaiwo
spec:
gpuPreemption:
defaultThreshold: 10
defaultGracePeriod: "15m"
defaultPolicy: "OnPressure"
defaultAggregation: "Max"
defaultTTL: "12h"
All fields are optional. Omitted fields fall through to the env var / hardcoded fallback.
Note
enabled, metricsEndpoint, and pollingInterval are not available in KaiwoConfig. These are startup-only settings (Helm / env vars) that require an operator restart to change. All other settings (defaultThreshold, defaultGracePeriod, defaultPolicy, defaultAggregation, defaultTTL) can be tuned live via KaiwoConfig without restarting.
3. Per-namespace: Annotations
Annotate a Namespace to opt in all GPU workloads in that namespace. This is convenient when you want every GPU workload in a namespace to be tracked without annotating each one individually:
apiVersion: v1
kind: Namespace
metadata:
name: ml-team
annotations:
kaiwo.silogen.ai/gpu-preemption.enabled: "true"
kaiwo.silogen.ai/gpu-preemption.grace-period: "15m"
kaiwo.silogen.ai/gpu-preemption.policy: "OnPressure"
Any GPU workload created in this namespace will be automatically tracked. Individual workloads can still override specific settings with their own annotations (see below).
4. Per-workload: Annotations
Add annotations to the root owner resource (the Job, Deployment, KaiwoJob, etc. -- not the Pod template) to opt it in or override namespace defaults:
apiVersion: batch/v1
kind: Job
metadata:
name: my-training-job
annotations:
kaiwo.silogen.ai/gpu-preemption.grace-period: "5m"
kaiwo.silogen.ai/gpu-preemption.policy: "Always"
spec:
template:
spec:
containers:
- name: train
resources:
requests:
amd.com/gpu: "4"
limits:
amd.com/gpu: "4"
# ...
Note
Any annotation with the kaiwo.silogen.ai/gpu-preemption. prefix enables preemption tracking. This means:
.enabled: "true"on its own is valid -- all settings use cluster-wide defaults..grace-period: "5m"on its own is also valid -- it implicitly enables tracking without needing.enabled.- Any combination works; omitted settings fall back through namespace annotations, then KaiwoConfig, then env vars, then hardcoded defaults.
- Annotations on the workload take precedence over identical annotations on the namespace.
Annotations Reference
All annotations use the prefix kaiwo.silogen.ai/gpu-preemption. and can be placed on the root owner resource and/or the namespace. Each setting follows the resolution chain: workload annotation > namespace annotation > KaiwoConfig > env var > hardcoded fallback.
| Annotation | Description | KaiwoConfig field | Env var | Default |
|---|---|---|---|---|
.enabled |
Explicitly enable tracking. Can be used on its own (all other settings fall back to defaults). Not required if any other preemption annotation is set. | -- | -- | -- |
.grace-period |
Duration the workload must be continuously idle before it becomes preemptible (e.g. 5m, 1h). |
gpuPreemption.defaultGracePeriod |
GPU_PREEMPTION_DEFAULT_GRACE_PERIOD |
10m |
.threshold |
Utilization percentage below which the workload is considered idle. | gpuPreemption.defaultThreshold |
GPU_PREEMPTION_DEFAULT_THRESHOLD |
5 |
.policy |
Preemption policy: OnPressure or Always. |
gpuPreemption.defaultPolicy |
GPU_PREEMPTION_DEFAULT_POLICY |
OnPressure |
.aggregation |
How utilization is aggregated across pods: Min, Max, or Avg. |
gpuPreemption.defaultAggregation |
GPU_PREEMPTION_DEFAULT_AGGREGATION |
Max |
.ttl |
How long the GpuWorkload CR is retained after the workload reaches a terminal state. 0 means retain forever. |
gpuPreemption.defaultTTL |
GPU_PREEMPTION_DEFAULT_TTL |
24h |
Preemption Policies
OnPressure (default)
A workload is only preempted when all of the following are true:
- It has been idle longer than
grace-period. - Another
GpuWorkloadis in thePendingGpuphase -- meaning its pods are unschedulable specifically because of insufficient GPU resources (the scheduler'sPodScheduledcondition message must containInsufficient amd.com/gpu). Pods pending for other reasons show asPendingOtherand do not trigger preemption. - A fit check passes (see below).
Fit check and multi-victim accumulation: The evaluator does not consider idle workloads in isolation. For each pending workload, it walks the idle pool (sorted oldest-idle first) and accumulates GPU counts across multiple idle workloads until the combined total meets the pending demand. If the combined total of all eligible idle workloads is still less than what the pending workload needs, none of them are preempted -- the system avoids pointless evictions. If the total does meet the demand, all workloads in the accumulated set are marked for preemption together. This is matched by GPU resource name (e.g. amd.com/gpu), so workloads using different GPU partition profiles do not interfere with each other.
Already-claimed victims are excluded from subsequent pending workloads' evaluations, preventing double-preemption.
Always
The workload is preempted as soon as it has been idle longer than grace-period, regardless of whether any other workload is waiting. This is useful for strict resource reclamation policies.
Environment Variables
These are set on the Kaiwo operator deployment, typically via Helm values under gpuPreemption.*:
| Variable | Description | Helm value | Hardcoded fallback |
|---|---|---|---|
GPU_PREEMPTION_ENABLED |
Enable the GPU preemption controller and metrics scraper (true/false). |
gpuPreemption.enabled |
false |
GPU_PREEMPTION_METRICS_ENDPOINT |
URL of the AMD GPU metrics exporter Prometheus endpoint. | gpuPreemption.metricsEndpoint |
(required) |
GPU_PREEMPTION_POLLING_INTERVAL |
How often the scraper polls the metrics endpoint (e.g. 15s, 1m). |
gpuPreemption.pollingInterval |
15s |
GPU_PREEMPTION_DEFAULT_THRESHOLD |
Default utilization threshold (%) below which a workload is idle. | gpuPreemption.defaultThreshold |
5 |
GPU_PREEMPTION_DEFAULT_GRACE_PERIOD |
Default idle duration before preemption eligibility (e.g. 10m, 1h). |
gpuPreemption.defaultGracePeriod |
10m |
GPU_PREEMPTION_DEFAULT_POLICY |
Default preemption policy (OnPressure or Always). |
gpuPreemption.defaultPolicy |
OnPressure |
GPU_PREEMPTION_DEFAULT_AGGREGATION |
Default multi-pod aggregation policy (Min, Max, or Avg). |
gpuPreemption.defaultAggregation |
Max |
GPU_PREEMPTION_DEFAULT_TTL |
Default TTL for terminal GpuWorkload CRs (0 = retain forever). |
gpuPreemption.defaultTTL |
24h |
GPU_PREEMPTION_OPERATOR_NAMESPACE |
Namespace where the operator runs (used for Lease coordination). | -- | (auto-detected) |
GpuWorkload Lifecycle
Each GpuWorkload CR transitions through the following phases:
| Phase | Meaning |
|---|---|
PendingOther |
Pods exist but are not yet Running for non-GPU reasons: image pulls, init containers, PVC binding, node affinity, taints, etc. No idle time is counted and no preemption evaluation is triggered. |
PendingGpu |
Pods are unschedulable specifically because of insufficient GPU resources (validated by checking the scheduler's PodScheduled condition for Insufficient amd.com/gpu). Acts as the demand signal for OnPressure preemption. |
Active |
Pods are running and aggregated GPU utilization is at or above the threshold. |
Idle |
Pods are running but aggregated GPU utilization is below the threshold. The grace-period timer starts. |
Preempting |
The workload has been selected for preemption; deletion of the root owner is in progress. |
Preempted |
The root owner was successfully deleted by the preemption system. |
Deleted |
The root owner disappeared on its own (completed, manually deleted, etc.). |
After reaching Preempted or Deleted, the CR is retained for ttlAfterFinished (default 24h) to provide an audit trail, then automatically cleaned up.
GpuWorkload naming
GpuWorkload CRs are named deterministically from the root owner: <lowercase-kind>-<name>-<uid-prefix>. For example, a Job named training with UID a1b2c3d4-... produces a GpuWorkload named job-training-a1b2c3d4.
Inspecting GpuWorkloads
# List all tracked GPU workloads
kubectl get gpuworkloads -A
# Detailed view with utilization and phase
kubectl get gpuworkloads -A -o wide
# Inspect a specific workload
kubectl describe gpuworkload job-training-a1b2c3d4 -n my-namespace
The status includes:
trackedPods-- owned pods and their per-GPU utilization metrics (pod name, GPU ID, utilization %, last update)aggregatedUtilization-- current utilization percentage across all podsidleSince-- when the workload became idleownerChain-- the full owner reference path (e.g.Pod/worker-0 -> Job/training -> KaiwoJob/my-job)preemptedFor-- the pending workload this was preempted to make room forpreemptionReason-- human-readable explanation
Example GpuWorkload CR
The following shows a representative GpuWorkload for an idle training Job, with all fields populated as they would appear during normal operation:
apiVersion: kaiwo.silogen.ai/v1alpha1
kind: GpuWorkload
metadata:
name: job-batch-training-a1b2c3d4 # <lowercase-kind>-<name>-<uid-prefix>
namespace: ml-team
spec:
workloadRef:
apiVersion: batch/v1
kind: Job
name: batch-training
uid: a1b2c3d4-e5f6-7890-abcd-ef1234567890
gpuResources:
amd.com/gpu: 4 # total GPUs requested across all pods
utilizationThreshold: 5.0 # % below which the workload is idle (optional override)
gracePeriod: 10m # idle duration before preemption eligibility (optional override)
preemptionPolicy: OnPressure # OnPressure or Always (optional override)
aggregationPolicy: Max # Min, Max, or Avg across pods (optional override)
ttlAfterFinished: 24h # how long to retain after terminal phase (optional override)
status:
phase: Idle
aggregatedUtilization: 1.2 # current %, below threshold → Idle
idleSince: "2026-03-04T10:00:00Z"
ownerChain: "Pod/batch-training-xw7n5 -> Job/batch-training"
lastMetricsUpdate: "2026-03-04T10:14:45Z"
trackedPods:
- podName: batch-training-xw7n5
gpuMetrics:
- gpuId: "0"
utilization: 2.1
lastUpdate: "2026-03-04T10:14:45Z"
- gpuId: "1"
utilization: 0.3
lastUpdate: "2026-03-04T10:14:45Z"
conditions:
- type: PhaseUpdated
status: "True"
reason: PhaseIdle
message: "GPU utilization dropped to 1.2%, workload is now considered idle"
lastTransitionTime: "2026-03-04T10:00:00Z"
For a preempted workload the status also includes:
status:
phase: Preempted
preemptedAt: "2026-03-04T10:15:00Z"
preemptedFor: "job-high-priority-demand-b2c3d4e5"
preemptionReason: "Preempted to free 4 amd.com/gpu for pending workload job-high-priority-demand-b2c3d4e5"
finishedAt: "2026-03-04T10:15:00Z" # TTL countdown starts here
Examples
Reclaim idle GPUs under pressure (default behavior)
apiVersion: batch/v1
kind: Job
metadata:
name: batch-inference
annotations:
kaiwo.silogen.ai/gpu-preemption.grace-period: "10m"
spec:
# ...
This uses the default OnPressure policy. The Job will only be preempted if it has been idle for 10 minutes and another GPU workload is pending due to insufficient GPUs and the combined GPUs freed by this (and potentially other idle) workloads would satisfy the pending demand.
Strict reclamation with short idle window
apiVersion: apps/v1
kind: Deployment
metadata:
name: dev-notebook
annotations:
kaiwo.silogen.ai/gpu-preemption.policy: "Always"
kaiwo.silogen.ai/gpu-preemption.grace-period: "5m"
kaiwo.silogen.ai/gpu-preemption.threshold: "10"
spec:
# ...
The Deployment is terminated if GPU utilization stays below 10% for 5 minutes, regardless of cluster demand.
Opt-in with defaults only
Uses all cluster-wide defaults from KaiwoConfig, then Helm values / environment variables.
Namespace-wide opt-in
apiVersion: v1
kind: Namespace
metadata:
name: ml-team
annotations:
kaiwo.silogen.ai/gpu-preemption.enabled: "true"
kaiwo.silogen.ai/gpu-preemption.grace-period: "15m"
All GPU workloads in ml-team are automatically tracked with a 15-minute grace period. Individual workloads can still override settings with their own annotations -- for example, setting .policy: "Always" on one Job overrides the default OnPressure inherited from the namespace.
Comparison with Resource Monitoring
| Aspect | Resource Monitoring | GPU Preemption |
|---|---|---|
| Scope | KaiwoJob and KaiwoService only |
Any Kubernetes GPU workload |
| Opt-in | Automatic for Kaiwo CRDs | Annotation on the root owner or namespace |
| CRD | None (status on Kaiwo CRDs) | GpuWorkload per tracked workload |
| Demand awareness | No | Yes (OnPressure with fit checks) |
| Audit trail | Status conditions on Kaiwo CRDs | Persistent GpuWorkload CRs with TTL |
| GPU vendor | NVIDIA or AMD | AMD (via amd.com/gpu* resources) |
Troubleshooting
-
GpuWorkload not created?
- Verify the root owner (not the Pod template) or its namespace has at least one
kaiwo.silogen.ai/gpu-preemption.*annotation. - Check that the Pod requests
amd.com/gpu*resources. - Check operator logs for RBAC errors during owner resolution -- the operator needs
getpermissions on all resource types in the owner chain.
- Verify the root owner (not the Pod template) or its namespace has at least one
-
Workload stuck in Idle but not preempted?
- With
OnPressurepolicy: preemption only happens when anotherGpuWorkloadis in thePendingGpuphase (pods unschedulable due toInsufficient amd.com/gpu, not other reasons) and its demand fits. Checkkubectl get gpuworkloadsfor aPendingGpuphase entry. - With
Alwayspolicy: ensuregrace-periodhas elapsed. CheckidleSinceon the GpuWorkload status.
- With
-
Metrics not updating?
- Verify
GPU_PREEMPTION_METRICS_ENDPOINTis reachable from the operator pod. - Check that the AMD GPU operator's metrics exporter is running and exposing
gpu_gfx_activity.
- Verify
-
Over-preemption concerns?
- The system uses a Kubernetes Lease for mutual exclusion during preemption evaluation. Only one replica evaluates at a time. The fit check accumulates idle workloads until demand is met and skips entirely if even all idle workloads together cannot satisfy the demand. Already-claimed victims are excluded from subsequent evaluations to prevent double-preemption.