Monitoring and Observability#
AIM Engine exposes metrics and structured logs for monitoring operator health and inference workloads.
Metrics#
Endpoint#
The controller exposes metrics on port 8443 (HTTPS by default). Configure via Helm:
Value |
Default |
Description |
|---|---|---|
|
|
Enable metrics endpoint |
|
|
Metrics port |
Prometheus ServiceMonitor#
Enable automatic scraping with Prometheus:
helm install aim-engine oci://docker.io/amdenterpriseai/charts/aim-engine \
--version <version> \
--namespace aim-system \
--set prometheus.enable=true
This creates a ServiceMonitor resource that Prometheus Operator picks up automatically.
Controller Runtime Metrics#
AIM Engine exposes standard controller-runtime metrics:
controller_runtime_reconcile_total— Total reconciliations by controller and resultcontroller_runtime_reconcile_errors_total— Total reconciliation errorscontroller_runtime_reconcile_time_seconds— Reconciliation durationworkqueue_depth— Current work queue depth per controller
Logs#
Format#
Operator logs are JSON-formatted with these key fields:
Field |
Description |
Example |
|---|---|---|
|
Log level |
|
|
Controller name |
|
|
Resource namespace |
|
|
Resource name |
|
|
Condition being updated |
|
|
Condition status |
|
|
Condition reason |
|
Log Levels#
Configure via operator flags:
Flag |
Values |
Default |
|---|---|---|
|
|
|
|
|
|
|
— |
|
Enable debug logging in Helm:
helm install aim-engine oci://docker.io/amdenterpriseai/charts/aim-engine \
--version <version> \
--namespace aim-system \
--set 'manager.args={--leader-elect,--zap-log-level=debug}'
Useful Log Queries#
# View operator logs
kubectl logs -n aim-system deployment/aim-engine-controller-manager -f
# Filter for errors
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
jq 'select(.level == "error")'
# Filter by controller
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
jq 'select(.controller == "aimservice")'
# Filter by namespace
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
jq 'select(.namespace == "ml-team")'
Kubernetes Events#
The operator emits Kubernetes Events on AIM resources when conditions change. Events provide a timeline of state transitions visible via kubectl describe.
Event Types#
Type |
When Emitted |
|---|---|
|
Condition transitions to a healthy state |
|
Condition transitions to an unhealthy state, or persists unhealthy on every reconcile |
Event Reasons#
Events use the condition’s reason field as the event reason. Common event reasons:
AIMService:
Reason |
Type |
Description |
|---|---|---|
|
Normal |
Model found and ready |
|
Warning |
Referenced model does not exist |
|
Normal |
Template resolved successfully |
|
Warning |
Multiple templates scored equally |
|
Normal |
Model cache is populated |
|
Warning |
Cache download failed |
|
Normal |
InferenceService is serving |
|
Warning |
Model image URI is invalid |
|
Warning |
Routing path template failed to resolve |
AIMModel:
Reason |
Type |
Description |
|---|---|---|
|
Normal |
All discovered templates are ready |
|
Warning |
All discovered templates failed |
|
Warning |
Failed to extract model metadata |
AIMArtifact:
Reason |
Type |
Description |
|---|---|---|
|
Normal |
Download complete and verified |
|
Normal |
Download in progress |
Viewing Events#
# Events for a specific resource
kubectl describe aimservice qwen-chat -n <namespace>
# All AIM-related events in a namespace
kubectl get events -n <namespace> --field-selector involvedObject.apiVersion=aim.eai.amd.com/v1alpha1
Recurring Events#
Some warning events are emitted on every reconcile (not just on transitions) for critical conditions that remain unhealthy. These are useful for alerting — a persistent stream of warnings indicates a stuck or failing resource.
See Conditions Reference for the full catalog of conditions and reasons.
Health Probes#
The operator exposes health and readiness probes:
Probe |
Path |
Port |
|---|---|---|
Liveness |
|
8081 |
Readiness |
|
8081 |
These are configured automatically in the Helm chart deployment.
Next Steps#
Troubleshooting — Diagnosing common issues
CLI and Operator Flags — Full operator flag reference