Scaling and Autoscaling#

AIM Engine supports static replica scaling and KEDA-based autoscaling with OpenTelemetry metrics.

Static Scaling#

Set a fixed number of replicas:

apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMService
metadata:
  name: qwen-chat
spec:
  model:
    image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5
  replicas: 3

Autoscaling with KEDA#

For demand-based scaling, use minReplicas and maxReplicas instead of replicas. AIM Engine configures KServe to use KEDA as the autoscaler.

Prerequisites#

Install KEDA and the OpenTelemetry integration:

Basic Autoscaling#

apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMService
metadata:
  name: qwen-chat
spec:
  model:
    image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5
  minReplicas: 1
  maxReplicas: 4

AIM Engine automatically:

  1. Sets the KServe autoscaler class to keda

  2. Injects an OpenTelemetry sidecar for metrics collection

  3. KEDA creates an HPA (keda-hpa-{isvc-name}-predictor, based on the derived InferenceService name)

Custom Metrics#

Override the default scaling behavior with custom metrics:

apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMService
metadata:
  name: qwen-chat
spec:
  model:
    image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5
  minReplicas: 1
  maxReplicas: 8
  autoScaling:
    metrics:
      - type: PodMetric
        podmetric:
          metric:
            backend: opentelemetry
            metricNames:
              - vllm:num_requests_running
            query: "vllm:num_requests_running"
            operationOverTime: "avg"
          target:
            type: Value
            value: "1"

Available Metrics#

Common vLLM metrics for scaling decisions:

Metric

Description

Use Case

vllm:num_requests_running

Currently processing requests

Scale on active load

vllm:num_requests_waiting

Queued requests

Scale on queue depth

Metric Configuration#

Field

Description

backend

Metrics backend (opentelemetry)

serverAddress

KEDA OTel scaler address (default: keda-otel-scaler.keda.svc:4317)

metricNames

Metric names to query

query

Query expression

operationOverTime

Aggregation: last_one, avg, max, min, rate, count

Target Types#

Type

Field

Description

Value

value

Scale when metric exceeds this absolute value

AverageValue

averageValue

Scale when per-pod average exceeds this value

Utilization

averageUtilization

Scale on percentage utilization

Monitoring Scaling#

Check the current scaling state:

# AIMService status
kubectl get aimservice qwen-chat -o jsonpath='{.status.runtime}' | jq

# KEDA HPA status
kubectl get hpa -n <namespace> -l aim.eai.amd.com/service.name=qwen-chat

Next Steps#