Monitoring and Observability#

AIM Engine exposes metrics and structured logs for monitoring operator health and inference workloads.

Metrics#

Endpoint#

The controller exposes metrics on port 8443 (HTTPS by default). Configure via Helm:

Value

Default

Description

metrics.enable

true

Enable metrics endpoint

metrics.port

8443

Metrics port

Prometheus ServiceMonitor#

Enable automatic scraping with Prometheus:

helm install aim-engine oci://docker.io/amdenterpriseai/charts/aim-engine \
  --version <version> \
  --namespace aim-system \
  --set prometheus.enable=true

This creates a ServiceMonitor resource that Prometheus Operator picks up automatically.

Controller Runtime Metrics#

AIM Engine exposes standard controller-runtime metrics:

  • controller_runtime_reconcile_total — Total reconciliations by controller and result

  • controller_runtime_reconcile_errors_total — Total reconciliation errors

  • controller_runtime_reconcile_time_seconds — Reconciliation duration

  • workqueue_depth — Current work queue depth per controller

Logs#

Format#

Operator logs are JSON-formatted with these key fields:

Field

Description

Example

level

Log level

info, error, debug

controller

Controller name

artifact, service, model

namespace

Resource namespace

ml-team

name

Resource name

qwen-chat

condition

Condition being updated

Ready

status

Condition status

True, False

reason

Condition reason

RuntimeReady

Log Levels#

Configure via operator flags:

Flag

Values

Default

--zap-log-level

debug, info, error, or integer

info

--zap-encoder

json, console

json

--zap-devel

false (production mode)

Enable debug logging in Helm:

helm install aim-engine oci://docker.io/amdenterpriseai/charts/aim-engine \
  --version <version> \
  --namespace aim-system \
  --set 'manager.args={--leader-elect,--zap-log-level=debug}'

Useful Log Queries#

# View operator logs
kubectl logs -n aim-system deployment/aim-engine-controller-manager -f

# Filter for errors
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
  jq 'select(.level == "error")'

# Filter by controller
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
  jq 'select(.controller == "aimservice")'

# Filter by namespace
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
  jq 'select(.namespace == "ml-team")'

Kubernetes Events#

The operator emits Kubernetes Events on AIM resources when conditions change. Events provide a timeline of state transitions visible via kubectl describe.

Event Types#

Type

When Emitted

Normal

Condition transitions to a healthy state

Warning

Condition transitions to an unhealthy state, or persists unhealthy on every reconcile

Event Reasons#

Events use the condition’s reason field as the event reason. Common event reasons:

AIMService:

Reason

Type

Description

ModelResolved

Normal

Model found and ready

ModelNotFound

Warning

Referenced model does not exist

Resolved

Normal

Template resolved successfully

TemplateSelectionAmbiguous

Warning

Multiple templates scored equally

CacheReady

Normal

Model cache is populated

CacheFailed

Warning

Cache download failed

RuntimeReady

Normal

InferenceService is serving

InvalidImageReference

Warning

Model image URI is invalid

PathTemplateInvalid

Warning

Routing path template failed to resolve

AIMModel:

Reason

Type

Description

AllTemplatesReady

Normal

All discovered templates are ready

AllTemplatesFailed

Warning

All discovered templates failed

MetadataExtractionFailed

Warning

Failed to extract model metadata

AIMArtifact:

Reason

Type

Description

Verified

Normal

Download complete and verified

Downloading

Normal

Download in progress

Viewing Events#

# Events for a specific resource
kubectl describe aimservice qwen-chat -n <namespace>

# All AIM-related events in a namespace
kubectl get events -n <namespace> --field-selector involvedObject.apiVersion=aim.eai.amd.com/v1alpha1

Recurring Events#

Some warning events are emitted on every reconcile (not just on transitions) for critical conditions that remain unhealthy. These are useful for alerting — a persistent stream of warnings indicates a stuck or failing resource.

See Conditions Reference for the full catalog of conditions and reasons.

Health Probes#

The operator exposes health and readiness probes:

Probe

Path

Port

Liveness

/healthz

8081

Readiness

/readyz

8081

These are configured automatically in the Helm chart deployment.

Next Steps#