Troubleshooting Guide
This guide provides steps for diagnosing common issues with Kaiwo.
Operator Issues
Operator Pod Not Running or Crashing
-
Check Pod Status:
Look for pods inCrashLoopBackOff
,Error
, orPending
states. -
Examine Pod Logs:
Look for error messages related to startup, configuration, API connectivity, or reconciliation loops. -
Describe Pod:
Check for events related to scheduling failures (resource constraints, taints/tolerations), image pull errors, readiness/liveness probe failures, or volume mount issues. -
Check Dependencies: Ensure all dependencies (Cert-Manager, Kueue, Ray Operator, GPU Operator, AppWrapper) are running correctly in their respective namespaces. Check their logs if necessary.
-
RBAC Permissions: Verify the Kaiwo operator's
ServiceAccount
,ClusterRole
, andClusterRoleBinding
grant sufficient permissions. Errors related to "forbidden" access often point to RBAC issues. -
Webhook Issues: If webhooks are enabled, check Cert-Manager status and webhook service connectivity. Invalid certificates or network policies blocking webhook calls can prevent resource creation/updates.
- Check webhook configurations:
kubectl get mutatingwebhookconfigurations
,kubectl get validatingwebhookconfigurations
- Check certificate status:
kubectl get certificates -n kaiwo-system
- Test webhook service endpoint.
- Check webhook configurations:
Default KaiwoQueueConfig
Not Created
- Check operator logs (
kubectl logs -n kaiwo-system -l control-plane=kaiwo-controller-manager
) for errors during the startup routine that creates the default configuration. - Common causes include inability to list Nodes (RBAC issue) or errors during node labeling/tainting if enabled.
Kueue Resources Not Syncing
- Ensure the
kaiwo
KaiwoQueueConfig
resource exists (kubectl get kaiwoqueueconfig kaiwo
). - Check operator logs for errors related to creating/updating Kueue
ResourceFlavors
,ClusterQueues
, orWorkloadPriorityClasses
. - Verify the operator has RBAC permissions to manage these Kueue resources.
- Check Kueue controller logs (
kubectl logs -n kueue-system -l control-plane=controller-manager -f
) for related errors.
Workload Issues
Workload Stuck in PENDING
This usually means Kueue has not admitted the workload yet.
-
Check Kueue Workload Status: Find the Kueue
Workload
resource corresponding to yourKaiwoJob
/KaiwoService
.Look for conditions like# Find the workload (often named after the Kaiwo resource) kubectl get workloads -n <namespace> # Describe the workload to see admission status and reasons for pending kubectl describe workload -n <namespace> <workload-name>
Admitted
beingFalse
and check theMessage
for reasons (e.g., quota exhaustion, no matchingResourceFlavor
). -
Check ClusterQueue Status:
Look at usage vs. quota (nominalQuota
) for relevant resource flavors. -
Check ResourceFlavor Definitions: Ensure
ResourceFlavors
defined inKaiwoQueueConfig
correctly match node labels in your cluster. -
Check LocalQueue: Ensure a
LocalQueue
pointing to the correctClusterQueue
exists in the workload's namespace (kubectl get localqueue -n <namespace> <queue-name>
). Kaiwo operator should create these if specified inKaiwoQueueConfig.spec.clusterQueues[].namespaces
.
Workload Fails Immediately (Status FAILED
)
-
Check Kaiwo Resource Events:
Look for events indicating failures during dependency creation (e.g., PVC, download job) or underlying resource creation. -
Check Download Job Logs (if applicable): If using
Look for errors related to accessing storage secrets, connecting to S3/GCS/Git, or filesystem permissions.spec.storage
with downloads, check the logs of the downloader job pod. -
Check Underlying Resource Events/Logs:
- For
KaiwoJob
->BatchJob
:kubectl describe job -n <namespace> <job-name>
and check pod events/logs. - For
KaiwoJob
->RayJob
:kubectl describe rayjob -n <namespace> <job-name>
and check Ray cluster/pod events/logs. - For
KaiwoService
->Deployment
:kubectl describe deployment -n <namespace> <service-name>
and check pod events/logs. - For
KaiwoService
->RayService
:kubectl describe rayservice -n <namespace> <service-name>
and check Ray cluster/pod events/logs.
- For
Pods Not Scheduling / Stuck in Pending
This occurs after Kueue admits the workload but before Kubernetes schedules the pod(s).
- Describe Pod:
Check the
Events
section for messages from the scheduler (e.g.,FailedScheduling
). Common reasons include:- Insufficient Resources: Not enough CPU, memory, or GPUs available on any node.
- Node Affinity/Selector Mismatch: Pod requires labels that no node possesses (often related to
ResourceFlavor
nodeLabels
). - Taint/Toleration Mismatch: Pod lacks tolerations for taints present on suitable nodes (e.g., GPU taint). Kaiwo should add GPU tolerations automatically if GPUs are requested.
- PVC Binding Issues: If using
storage
, check if thePersistentVolumeClaim
is stuck inPending
(kubectl get pvc -n <namespace>
). This could be due to no availablePersistentVolume
or StorageClass issues.
Pods Crashing / CrashLoopBackOff
-
Check Pod Logs: This is the most important step.
Look for application errors, missing files, permission issues, OOMKilled errors, GPU driver/runtime errors. -
Describe Pod: Check events for reasons like OOMKilled.
-
Exec into Pod (if possible): Use
kaiwo exec
orkubectl exec
to inspect the container environment.
Developer Debugging (kaiwo-dev
)
Info
This feature is only intended for contributors
The kaiwo-dev
tool (built separately from the main CLI/operator) provides debugging utilities.
-
This command gathers Kaiwo controller logs relevant to the namespace, pod logs within the namespace, and Kubernetes events, sorts them chronologically, and prints them with color-coding. Useful for understanding the sequence of events during a failed test run.kaiwo-dev debug chainsaw
: Helps debug Kyverno Chainsaw E2E tests by collecting and correlating logs and events from a specific test namespace.