KServe Deployment

KServe Deployment#

This guide provides step-by-step instructions for deploying the AIM inference server on a Kubernetes cluster using KServe. By following these steps, you will be able to deploy AI models using KServe’s inference service abstraction, which simplifies model deployment and management on Kubernetes infrastructure. This document covers the prerequisites, deployment process, and how to test the endpoint to ensure everything is working correctly.

Overview#

KServe deployment uses two main components:

ClusterServingRuntime - Defines the container image, ports, and configuration for the AIM inference engine
InferenceService - Declares which model to serve, resource requirements, and scaling policies

This approach separates the runtime configuration from the model deployment, allowing you to reuse the same runtime for multiple models and manage them independently.

Prerequisites#

Kubernetes cluster with kubectl configured (v1.32.8+rke2r1)
KServe installed on the cluster (v0.15.2)
AMD GPU with ROCm support (e.g., MI300X)

Deployment#

1. Install the Serving Runtime#

The ClusterServingRuntime defines the AIM container image and configuration used by the inference service.

Create a file named servingruntime-aim-qwen3-32b.yaml with the following contents:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: aim-qwen3-32b-runtime
spec:
  supportedModelFormats:
  - name: aim-qwen3-32b
  containers:
  - name: kserve-container
    image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.4
    imagePullPolicy: Always
    ports:
    - name: http
      containerPort: 8000
      protocol: TCP
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
  volumes:
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: 8Gi

Deploy the serving runtime:

kubectl apply -f servingruntime-aim-qwen3-32b.yaml

Expected output:

clusterservingruntime.serving.kserve.io/aim-qwen3-32b-runtime created

2. Deploy the Inference Service#

The InferenceService creates a model serving endpoint using the pre-built Qwen3 32B model.

Create a file named aim-qwen3-32b.yaml with the following contents:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: aim-qwen3-32b
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    model:
      runtime: aim-qwen3-32b-runtime
      modelFormat:
        name: aim-qwen3-32b
      resources:
        limits:
          memory: "128Gi"
          cpu: "8"
          amd.com/gpu: "1"
        requests:
          memory: "64Gi"
          cpu: "4"
          amd.com/gpu: "1"

Deploy the inference service:

kubectl apply -f aim-qwen3-32b.yaml

Expected output:

inferenceservice.serving.kserve.io/aim-qwen3-32b created

3. Port forward the service to access it locally#

KServe automatically creates a service with the name <inferenceservice-name>-predictor (in this case aim-qwen3-32b-predictor) that exposes port 80 by default.

kubectl port-forward service/aim-qwen3-32b-predictor 8000:80

Expected output:

Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000

4. Test the inference endpoint#

Make a request to the inference endpoint using curl:

curl -X POST http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

Expected output:

{
    "id": "cmpl-bfb8650209b74010b2a89489b31d8c7c",
    "object": "text_completion",
    "created": 1762361538,
    "model": "Qwen/Qwen3-32B-FP8",
    "choices": [
        {
            "index": 0,
            "text": " city that has long been a beacon",
            "logprobs": null,
            "finish_reason": "length",
            "stop_reason": null,
            "token_ids": null,
            "prompt_logprobs": null,
            "prompt_token_ids": null
        }
    ],
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "prompt_tokens": 4,
        "total_tokens": 11,
        "completion_tokens": 7,
        "prompt_tokens_details": null
    },
    "kv_transfer_params": null
}

Cleanup#

Remove the deployed resources:

kubectl delete inferenceservice aim-qwen3-32b
kubectl delete clusterservingruntime aim-qwen3-32b-runtime

KServe Deployment

Contents

KServe Deployment#

Overview#

Prerequisites#

Deployment#

1. Install the Serving Runtime#

2. Deploy the Inference Service#

3. Port forward the service to access it locally#

4. Test the inference endpoint#

Cleanup#