Scaling and Autoscaling#
AIM Engine supports static replica scaling and KEDA-based autoscaling with OpenTelemetry metrics.
Static Scaling#
Set a fixed number of replicas:
apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMService
metadata:
name: qwen-chat
spec:
model:
image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5
replicas: 3
Autoscaling with KEDA#
For demand-based scaling, use minReplicas and maxReplicas instead of replicas. AIM Engine configures KServe to use KEDA as the autoscaler.
Prerequisites#
Install KEDA and the OpenTelemetry integration:
KEDA v2.18+
KEDA OpenTelemetry scaler (
keda-otel-scaler)
Basic Autoscaling#
apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMService
metadata:
name: qwen-chat
spec:
model:
image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5
minReplicas: 1
maxReplicas: 4
AIM Engine automatically:
Sets the KServe autoscaler class to
kedaInjects an OpenTelemetry sidecar for metrics collection
KEDA creates an HPA (
keda-hpa-{isvc-name}-predictor, based on the derived InferenceService name)
Custom Metrics#
Override the default scaling behavior with custom metrics:
apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMService
metadata:
name: qwen-chat
spec:
model:
image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5
minReplicas: 1
maxReplicas: 8
autoScaling:
metrics:
- type: PodMetric
podmetric:
metric:
backend: opentelemetry
metricNames:
- vllm:num_requests_running
query: "vllm:num_requests_running"
operationOverTime: "avg"
target:
type: Value
value: "1"
Available Metrics#
Common vLLM metrics for scaling decisions:
Metric |
Description |
Use Case |
|---|---|---|
|
Currently processing requests |
Scale on active load |
|
Queued requests |
Scale on queue depth |
Metric Configuration#
Field |
Description |
|---|---|
|
Metrics backend ( |
|
KEDA OTel scaler address (default: |
|
Metric names to query |
|
Query expression |
|
Aggregation: |
Target Types#
Type |
Field |
Description |
|---|---|---|
|
|
Scale when metric exceeds this absolute value |
|
|
Scale when per-pod average exceeds this value |
|
|
Scale on percentage utilization |
Monitoring Scaling#
Check the current scaling state:
# AIMService status
kubectl get aimservice qwen-chat -o jsonpath='{.status.runtime}' | jq
# KEDA HPA status
kubectl get hpa -n <namespace> -l aim.eai.amd.com/service.name=qwen-chat
Next Steps#
Deploying Services — Full service configuration reference
Monitoring — Metrics and observability