Model Caching#

Model caching pre-downloads model artifacts to shared persistent volumes, reducing startup time and bandwidth usage across service replicas and restarts.

Caching Modes#

Control caching behavior with spec.caching.mode:

Mode

Behavior

Shared

Reuses shared cache assets across services that use the same template. This is the default.

Dedicated

Creates service-owned dedicated cache assets, isolated from other services.

apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMService
metadata:
  name: qwen-chat
spec:
  model:
    image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5
  caching:
    mode: Shared

Note

The caching mode is immutable after creation. Legacy values Always, Auto, and Never are accepted for backward compatibility (Always/Auto map to Shared, Never maps to Dedicated).

When caching is active, AIM Engine creates a hierarchy of resources:

  1. AIMTemplateCache — Groups all model artifacts for a specific template on a shared PVC

  2. AIMArtifact — Manages the download of individual model sources

  3. PVC + Download Job — The actual storage and download execution

The template cache is owned by the template (not the service), so multiple services sharing the same template reuse the same cache.

Download Protocols#

For Hugging Face models (hf:// sources), AIM Engine tries download protocols in sequence:

Protocol

Description

XET

XetHub protocol (fastest for large models)

HF_TRANSFER

Hugging Face Transfer (optimized multi-part download)

HTTP

Standard HTTP download (slowest, most compatible)

The default order is XET,HF_TRANSFER. If a protocol fails, the downloader cleans up incomplete files and tries the next one.

Override the protocol order via runtime configuration:

apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMRuntimeConfig
metadata:
  name: default
  namespace: ml-team
spec:
  env:
    - name: AIM_DOWNLOADER_PROTOCOL
      value: "HF_TRANSFER,HTTP"

Storage Sizing#

AIM Engine automatically sizes PVCs based on discovered model sizes plus a headroom percentage (default 10%). Configure the headroom via cluster runtime config:

apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMClusterRuntimeConfig
metadata:
  name: default
spec:
  storage:
    pvcHeadroomPercent: 15
    defaultStorageClassName: longhorn

Monitoring Cache Status#

Check the status of template caches and artifacts:

# List template caches
kubectl get aimtemplatecache -n <namespace>

# List artifacts
kubectl get aimartifact -n <namespace>

# Check artifact download progress
kubectl get aimartifact <name> -n <namespace> -o jsonpath='{.status}' | jq

Next Steps#