Monitoring and Observability

Monitoring and Observability#

AIM Engine exposes metrics and structured logs for monitoring operator health and inference workloads.

Metrics#

Endpoint#

The controller exposes metrics on port 8443 (HTTPS by default). Configure via Helm:

Value	Default	Description
`metrics.enable`	`true`	Enable metrics endpoint
`metrics.port`	`8443`	Metrics port

Prometheus ServiceMonitor#

Enable automatic scraping with Prometheus:

helm install aim-engine oci://docker.io/amdenterpriseai/charts/aim-engine \
  --version <version> \
  --namespace aim-system \
  --set prometheus.enable=true

This creates a ServiceMonitor resource that Prometheus Operator picks up automatically.

Controller Runtime Metrics#

AIM Engine exposes standard controller-runtime metrics:

controller_runtime_reconcile_total — Total reconciliations by controller and result
controller_runtime_reconcile_errors_total — Total reconciliation errors
controller_runtime_reconcile_time_seconds — Reconciliation duration
workqueue_depth — Current work queue depth per controller

Logs#

Format#

Operator logs are JSON-formatted with these key fields:

Field	Description	Example
`level`	Log level	`info`, `error`, `debug`
`controller`	Controller name	`artifact`, `service`, `model`
`namespace`	Resource namespace	`ml-team`
`name`	Resource name	`qwen-chat`
`condition`	Condition being updated	`Ready`
`status`	Condition status	`True`, `False`
`reason`	Condition reason	`RuntimeReady`

Log Levels#

Configure via operator flags:

Flag	Values	Default
`--zap-log-level`	`debug`, `info`, `error`, or integer	`info`
`--zap-encoder`	`json`, `console`	`json`
`--zap-devel`	—	`false` (production mode)

Enable debug logging in Helm:

helm install aim-engine oci://docker.io/amdenterpriseai/charts/aim-engine \
  --version <version> \
  --namespace aim-system \
  --set 'manager.args={--leader-elect,--zap-log-level=debug}'

Useful Log Queries#

# View operator logs
kubectl logs -n aim-system deployment/aim-engine-controller-manager -f

# Filter for errors
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
  jq 'select(.level == "error")'

# Filter by controller
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
  jq 'select(.controller == "aimservice")'

# Filter by namespace
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
  jq 'select(.namespace == "ml-team")'

Kubernetes Events#

The operator emits Kubernetes Events on AIM resources when conditions change. Events provide a timeline of state transitions visible via kubectl describe.

Event Types#

Type	When Emitted
`Normal`	Condition transitions to a healthy state
`Warning`	Condition transitions to an unhealthy state, or persists unhealthy on every reconcile

Event Reasons#

Events use the condition’s reason field as the event reason. Common event reasons:

AIMService:

Reason	Type	Description
`ModelResolved`	Normal	Model found and ready
`ModelNotFound`	Warning	Referenced model does not exist
`Resolved`	Normal	Template resolved successfully
`TemplateSelectionAmbiguous`	Warning	Multiple templates scored equally
`CacheReady`	Normal	Model cache is populated
`CacheFailed`	Warning	Cache download failed
`RuntimeReady`	Normal	InferenceService is serving
`InvalidImageReference`	Warning	Model image URI is invalid
`PathTemplateInvalid`	Warning	Routing path template failed to resolve

AIMModel:

Reason	Type	Description
`AllTemplatesReady`	Normal	All discovered templates are ready
`AllTemplatesFailed`	Warning	All discovered templates failed
`MetadataExtractionFailed`	Warning	Failed to extract model metadata

AIMArtifact:

Reason	Type	Description
`Verified`	Normal	Download complete and verified
`Downloading`	Normal	Download in progress

Viewing Events#

# Events for a specific resource
kubectl describe aimservice qwen-chat -n <namespace>

# All AIM-related events in a namespace
kubectl get events -n <namespace> --field-selector involvedObject.apiVersion=aim.eai.amd.com/v1alpha1

Recurring Events#

Some warning events are emitted on every reconcile (not just on transitions) for critical conditions that remain unhealthy. These are useful for alerting — a persistent stream of warnings indicates a stuck or failing resource.

See Conditions Reference for the full catalog of conditions and reasons.

Health Probes#

The operator exposes health and readiness probes:

Probe	Path	Port
Liveness	`/healthz`	8081
Readiness	`/readyz`	8081

These are configured automatically in the Helm chart deployment.

Next Steps#

Troubleshooting — Diagnosing common issues
CLI and Operator Flags — Full operator flag reference