Troubleshooting#
Common issues and diagnostic steps for AIM Engine.
Service Status#
Check the overall status:
kubectl get aimservice <name> -n <namespace>
For detailed diagnostics, inspect conditions and component health:
kubectl get aimservice <name> -n <namespace> -o jsonpath='{.status.conditions}' | jq
Common Issues#
Service Stuck in “Pending”#
The service is waiting for upstream dependencies.
Check which conditions are blocking readiness:
kubectl get aimservice <name> -n <namespace> -o jsonpath='{.status.conditions}' | jq
Blocked Component |
Likely Cause |
|---|---|
Model |
Model not found — check |
Template |
No matching template — verify templates exist and are |
RuntimeConfig |
Runtime config not found or invalid |
Service Stuck in “Starting”#
Downstream resources are being created but haven’t become ready.
Check the InferenceService:
kubectl get inferenceservice -n <namespace> -l aim.eai.amd.com/service.name=<name>
kubectl describe inferenceservice <isvc-name> -n <namespace>
Check pods:
kubectl get pods -l serving.kserve.io/inferenceservice=<isvc-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
Use the InferenceService name returned by the first command as <isvc-name>.
Common causes:
Image pull errors — Wrong image URL or missing imagePullSecrets
Insufficient resources — Not enough GPU, memory, or CPU available
PVC not binding — Storage class doesn’t support RWX, or insufficient capacity
Template Selection Fails#
“No templates found”:
# List templates for the model
kubectl get aimservicetemplates -n <namespace>
kubectl get aimclusterservicetemplates
# Check template status
kubectl get aimservicetemplates -o custom-columns=NAME:.metadata.name,STATUS:.status.status
Templates may be excluded because:
Status is not
Ready(still discovering or failed)Status is
NotAvailable(required GPU not in cluster)Profile is
unoptimizedandallowUnoptimizedis not set
“Ambiguous selection”:
Multiple templates scored equally. Resolve by specifying template.name explicitly.
Cache or Artifact Failures#
# Check template cache
kubectl get aimtemplatecache -n <namespace>
# Check artifacts
kubectl get aimartifact -n <namespace>
# Check download job
kubectl get jobs -l aim.eai.amd.com/artifact=<artifact-name> -n <namespace>
kubectl logs job/<job-name> -n <namespace>
Common causes:
StorageSizeError — Model size not yet discovered; typically resolves automatically
Download failure — Network issues, authentication errors, or protocol incompatibility
PVC binding failure — Storage class doesn’t support
ReadWriteMany
Routing Not Working#
# Check HTTPRoute
kubectl get httproute -n <namespace>
kubectl describe httproute <name> -n <namespace>
# Check the gateway
kubectl get gateway -n <gateway-namespace>
Common causes:
Gateway doesn’t exist or isn’t ready
routing.enabledis not set (check runtime config)Gateway namespace mismatch in
gatewayRef
Operator Logs#
View operator logs for detailed error information:
kubectl logs -n aim-system deployment/aim-engine-controller-manager -f
Filter for errors related to a specific resource:
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
jq 'select(.name == "<resource-name>")'
Status Values#
Status |
Meaning |
|---|---|
|
Waiting for upstream dependencies |
|
Creating downstream resources |
|
Resources created, waiting for readiness |
|
Fully operational |
|
Resource is ready (for non-service CRDs) |
|
Partially functional |
|
Required infrastructure not present |
|
Critical failure |
Next Steps#
Monitoring — Log format and metrics details
CLI and Operator Flags — Enable debug logging