Quickstart

Quickstart#

Deploy your first inference service in minutes.

Prerequisites#

AIM Engine installed on your cluster
AMD GPUs available in the cluster
kubectl configured to access your cluster

Step 1: Check Available Models#

If you enabled model discovery during installation, models are already available:

kubectl get aimclustermodels

If no models are listed, create one manually (optional):

apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMClusterModel
metadata:
  name: qwen3-32b
spec:
  image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5

kubectl apply -f model.yaml

Step 2: Deploy an Inference Service#

Create an AIMService to deploy the model:

apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMService
metadata:
  name: qwen-chat
  namespace: default
spec:
  model:
    image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5

kubectl apply -f service.yaml

AIM Engine automatically:

Resolves or creates a matching model
Selects the best runtime template for your GPU hardware
Downloads the model weights (this can take several minutes for large models)
Creates a KServe InferenceService once the download completes
Starts serving the model

Caching#

Model weights are always downloaded to a persistent volume before the InferenceService starts. The caching mode controls whether that PVC is shared or isolated:

Shared (default) — The PVC is shared across all services using the same template. Once one service downloads the model, others reuse it immediately.
Dedicated — Each service gets its own PVC, isolated from other services.

apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMService
metadata:
  name: qwen-chat
  namespace: default
spec:
  model:
    image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5
  caching:
    mode: Dedicated

See Model Caching for more on caching modes and configuration.

Step 3: Monitor Progress#

Watch the service status:

kubectl get aimservice qwen-chat -w

The status progresses through: Pending → Starting → Running. The service pauses in Starting while model weights are downloaded.

For more detail, check the conditions:

kubectl get aimservice qwen-chat -o jsonpath='{.status.conditions}' | jq

Step 4: Send a Request#

Once the service is Running, find the inference endpoint:

kubectl get inferenceservice -n default -l aim.eai.amd.com/service.name=qwen-chat

InferenceService names are derived, so use the name returned by the command above and port-forward its predictor service:

kubectl port-forward -n default svc/<isvc-name>-predictor 8080:80

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-chat",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Next Steps#

Deploying Services — Scaling, caching, routing, and more configuration options
Model Catalog — Browse and manage available models
Architecture — Understand how AIM Engine components work together