Troubleshooting Guide
This guide provides steps for diagnosing common issues with Kaiwo.
Operator Issues
Operator Pod Not Running or Crashing
-
Check Pod Status:
Look for pods inCrashLoopBackOff,Error, orPendingstates. -
Examine Pod Logs:
Look for error messages related to startup, configuration, API connectivity, or reconciliation loops. -
Describe Pod:
Check for events related to scheduling failures (resource constraints, taints/tolerations), image pull errors, readiness/liveness probe failures, or volume mount issues. -
Check Dependencies: Ensure all dependencies (Cert-Manager, Kueue, Ray Operator, GPU Operator, AppWrapper) are running correctly in their respective namespaces. Check their logs if necessary.
-
RBAC Permissions: Verify the Kaiwo operator's
ServiceAccount,ClusterRole, andClusterRoleBindinggrant sufficient permissions. Errors related to "forbidden" access often point to RBAC issues. -
Webhook Issues: If webhooks are enabled, check Cert-Manager status and webhook service connectivity. Invalid certificates or network policies blocking webhook calls can prevent resource creation/updates.
- Check webhook configurations:
kubectl get mutatingwebhookconfigurations,kubectl get validatingwebhookconfigurations - Check certificate status:
kubectl get certificates -n kaiwo-system - Test webhook service endpoint.
- Check webhook configurations:
Default KaiwoQueueConfig Not Created
- Check operator logs (
kubectl logs -n kaiwo-system -l control-plane=kaiwo-controller-manager) for errors during the startup routine that creates the default configuration. - Common causes include inability to list Nodes (RBAC issue) or errors during node labeling/tainting if enabled.
Kueue Resources Not Syncing
- Ensure the
kaiwoKaiwoQueueConfigresource exists (kubectl get kaiwoqueueconfig kaiwo). - Check operator logs for errors related to creating/updating Kueue
ResourceFlavors,ClusterQueues, orWorkloadPriorityClasses. - Verify the operator has RBAC permissions to manage these Kueue resources.
- Check Kueue controller logs (
kubectl logs -n kueue-system -l control-plane=controller-manager -f) for related errors.
Workload Issues
Workload Stuck in PENDING
This usually means Kueue has not admitted the workload yet.
-
Check Kueue Workload Status: Find the Kueue
Workloadresource corresponding to yourKaiwoJob/KaiwoService.Look for conditions like# Find the workload (often named after the Kaiwo resource) kubectl get workloads -n <namespace> # Describe the workload to see admission status and reasons for pending kubectl describe workload -n <namespace> <workload-name>AdmittedbeingFalseand check theMessagefor reasons (e.g., quota exhaustion, no matchingResourceFlavor). -
Check ClusterQueue Status:
Look at usage vs. quota (nominalQuota) for relevant resource flavors. -
Check ResourceFlavor Definitions: Ensure
ResourceFlavorsdefined inKaiwoQueueConfigcorrectly match node labels in your cluster. -
Check LocalQueue: Ensure a
LocalQueuepointing to the correctClusterQueueexists in the workload's namespace (kubectl get localqueue -n <namespace> <queue-name>). Kaiwo operator should create these if specified inKaiwoQueueConfig.spec.clusterQueues[].namespaces.
Workload Fails Immediately (Status FAILED)
-
Check Kaiwo Resource Events:
Look for events indicating failures during dependency creation (e.g., PVC, download job) or underlying resource creation. -
Check Download Job Logs (if applicable): If using
Look for errors related to accessing storage secrets, connecting to S3/GCS/Git, or filesystem permissions.spec.storagewith downloads, check the logs of the downloader job pod. -
Check Underlying Resource Events/Logs:
- For
KaiwoJob->BatchJob:kubectl describe job -n <namespace> <job-name>and check pod events/logs. - For
KaiwoJob->RayJob:kubectl describe rayjob -n <namespace> <job-name>and check Ray cluster/pod events/logs. - For
KaiwoService->Deployment:kubectl describe deployment -n <namespace> <service-name>and check pod events/logs. - For
KaiwoService->RayService:kubectl describe rayservice -n <namespace> <service-name>and check Ray cluster/pod events/logs.
- For
Pods Not Scheduling / Stuck in Pending
This occurs after Kueue admits the workload but before Kubernetes schedules the pod(s).
- Describe Pod:
Check the
Eventssection for messages from the scheduler (e.g.,FailedScheduling). Common reasons include:- Insufficient Resources: Not enough CPU, memory, or GPUs available on any node.
- Node Affinity/Selector Mismatch: Pod requires labels that no node possesses (often related to
ResourceFlavornodeLabels). - Taint/Toleration Mismatch: Pod lacks tolerations for taints present on suitable nodes (e.g., GPU taint). Kaiwo should add GPU tolerations automatically if GPUs are requested.
- PVC Binding Issues: If using
storage, check if thePersistentVolumeClaimis stuck inPending(kubectl get pvc -n <namespace>). This could be due to no availablePersistentVolumeor StorageClass issues.
Pods Crashing / CrashLoopBackOff
-
Check Pod Logs: This is the most important step.
Look for application errors, missing files, permission issues, OOMKilled errors, GPU driver/runtime errors. -
Describe Pod: Check events for reasons like OOMKilled.
-
Exec into Pod (if possible): Use
kaiwo execorkubectl execto inspect the container environment.
Developer Debugging (kaiwo-dev)
Info
This feature is only intended for contributors
The kaiwo-dev tool (built separately from the main CLI/operator) provides debugging utilities.
-
This command gathers Kaiwo controller logs relevant to the namespace, pod logs within the namespace, and Kubernetes events, sorts them chronologically, and prints them with color-coding. Useful for understanding the sequence of events during a failed test run.kaiwo-dev debug chainsaw: Helps debug Kyverno Chainsaw E2E tests by collecting and correlating logs and events from a specific test namespace.