Scaling and Autoscaling

Scaling and Autoscaling#

AIM Engine supports static replica scaling and KEDA-based autoscaling with OpenTelemetry metrics.

Static Scaling#

Set a fixed number of replicas:

apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMService
metadata:
  name: qwen-chat
spec:
  model:
    image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5
  replicas: 3

Autoscaling with KEDA#

For demand-based scaling, use minReplicas and maxReplicas instead of replicas. AIM Engine configures KServe to use KEDA as the autoscaler.

Prerequisites#

Install KEDA and the OpenTelemetry integration:

KEDA v2.18+
OpenTelemetry Operator
KEDA OpenTelemetry scaler (keda-otel-scaler)

Basic Autoscaling#

apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMService
metadata:
  name: qwen-chat
spec:
  model:
    image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5
  minReplicas: 1
  maxReplicas: 4

AIM Engine automatically:

Sets the KServe autoscaler class to keda
Injects an OpenTelemetry sidecar for metrics collection
KEDA creates an HPA (keda-hpa-{isvc-name}-predictor, based on the derived InferenceService name)

Custom Metrics#

Override the default scaling behavior with custom metrics:

apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMService
metadata:
  name: qwen-chat
spec:
  model:
    image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5
  minReplicas: 1
  maxReplicas: 8
  autoScaling:
    metrics:
      - type: PodMetric
        podmetric:
          metric:
            backend: opentelemetry
            metricNames:
              - vllm:num_requests_running
            query: "vllm:num_requests_running"
            operationOverTime: "avg"
          target:
            type: Value
            value: "1"

Available Metrics#

Common vLLM metrics for scaling decisions:

Metric	Description	Use Case
`vllm:num_requests_running`	Currently processing requests	Scale on active load
`vllm:num_requests_waiting`	Queued requests	Scale on queue depth

Metric Configuration#

Field	Description
`backend`	Metrics backend (`opentelemetry`)
`serverAddress`	KEDA OTel scaler address (default: `keda-otel-scaler.keda.svc:4317`)
`metricNames`	Metric names to query
`query`	Query expression
`operationOverTime`	Aggregation: `last_one`, `avg`, `max`, `min`, `rate`, `count`

Target Types#

Type	Field	Description
`Value`	`value`	Scale when metric exceeds this absolute value
`AverageValue`	`averageValue`	Scale when per-pod average exceeds this value
`Utilization`	`averageUtilization`	Scale on percentage utilization

Monitoring Scaling#

Check the current scaling state:

# AIMService status
kubectl get aimservice qwen-chat -o jsonpath='{.status.runtime}' | jq

# KEDA HPA status
kubectl get hpa -n <namespace> -l aim.eai.amd.com/service.name=qwen-chat

Next Steps#

Deploying Services — Full service configuration reference
Monitoring — Metrics and observability