Pure Kubernetes Deployment#

This guide provides step-by-step instructions for deploying the AIM on a Kubernetes cluster. By following these steps, you will be able to use the provided example to set up and run an AI model on your own Kubernetes infrastructure. This document covers the prerequisites, deployment process, and how to test the endpoint to ensure everything is working correctly.

Prerequisites#

  • Kubernetes cluster with kubectl configured (v1.32.8+rke2r1)

  • AMD GPU with ROCm support (e.g., MI300X)

Deployment#

1. Create secret#

AIM uses DockerHub container registry to host its images. The images are public and no authentication is required to pull. However, some models are gated and require authentication to download them from Hugging Face. Therefore, you need to create a Kubernetes secret.

Create secret for Hugging Face token to download models:

kubectl create secret generic hf-token \
    --from-literal="hf-token=YOUR_HUGGINGFACE_TOKEN" \
    -n YOUR_K8S_NAMESPACE

Expected output:

secret/hf-token created

2. Install AMD device plugin if it is not already in place#

Fetch plugin manifest and create the DaemonSet:

kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml

Expected output:

daemonset.apps/amdgpu-device-plugin-daemonset created

3. Deploy Kubernetes manifest#

Example of deployment manifest#

Here is an example of deployment.yaml for deploying AIM with a specific model. See a corresponding service.yaml below.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: minimal-aim-deployment
  labels:
    app: minimal-aim-deployment
spec:
  progressDeadlineSeconds: 3600
  replicas: 1
  selector:
    matchLabels:
      app: minimal-aim-deployment
  template:
    metadata:
      labels:
        app: minimal-aim-deployment
    spec:
      containers:
        - name: minimal-aim-deployment
          image: "amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.8.4"
          imagePullPolicy: Always
          env:
            - name: AIM_PRECISION
              value: "auto"
            - name: AIM_GPU_COUNT
              value: "1"
            - name: AIM_GPU_MODEL
              value: "auto"
            - name: AIM_ENGINE
              value: "vllm"
            - name: AIM_METRIC
              value: "latency"
            - name: AIM_LOG_LEVEL_ROOT
              value: "INFO"
            - name: AIM_LOG_LEVEL
              value: "INFO"
            - name: AIM_PORT
              value: "8000"
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: hf-token
          ports:
            - name: http
              containerPort: 8000
          resources:
            requests:
              memory: "16Gi"
              cpu: "4"
              amd.com/gpu: "1"
            limits:
              memory: "16Gi"
              cpu: "4"
              amd.com/gpu: "1"
          startupProbe:
            httpGet:
              path: /v1/models
              port: http
            periodSeconds: 10
            failureThreshold: 60
          livenessProbe:
            httpGet:
              path: /health
              port: http
          readinessProbe:
            httpGet:
              path: /v1/models
              port: http
          volumeMounts:
            - name: ephemeral-storage
              mountPath: /tmp
            - name: dshm
              mountPath: /dev/shm
      volumes:
        - name: ephemeral-storage
          emptyDir:
            sizeLimit: 256Gi
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 32Gi

Example of service.yaml#

apiVersion: v1
kind: Service
metadata:
  name: minimal-aim-deployment
  labels:
    app: minimal-aim-deployment
spec:
  type: ClusterIP
  ports:
    - name: http
      port: 80
      targetPort: 8000
  selector:
    app: minimal-aim-deployment

Deployment#

Deploy AIM with specific model using Kubernetes deployment:

kubectl apply -f . -n YOUR_K8S_NAMESPACE

Expected output:

deployment.apps/minimal-aim-deployment created
service/minimal-aim-deployment created

Testing#

1. Port forward the service to access it locally#

Do port forwarding

kubectl port-forward service/minimal-aim-deployment 8000:80 -n YOUR_K8S_NAMESPACE

Expected output:

Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000

2. Test the inference endpoint#

Make a request to the inference endpoint using curl:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

Expected output:

{
  "id": "cmpl-703ff7b124a944849d64d063720a28f4",
  "object": "text_completion",
  "created":1758657978,
  "model":"meta-llama/Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "text":" city that is known for its v",
      "logprobs": null,
      "finish_reason":"length",
      "stop_reason":null,
      "prompt_logprobs":null,
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 12,
    "completion_tokens": 7,
    "prompt_tokens_details": null,
  },
  "kv_transfer_params": null
}

Removing the deployment#

To remove the deployment and service, run:

kubectl delete -f . -n YOUR_K8S_NAMESPACE

Expected output:

deployment.apps "minimal-aim-deployment" deleted
service "minimal-aim-deployment" deleted