Docker deployment

Docker deployment#

This guide provides step-by-step instructions for deploying AMD Inference Microservice (AIM) container that supports different variants of Llama-3.1-8B-Instruct model. Follow these instructions to quickly get started with running an AI model on AMD GPUs. For more detailed information, please refer to the main README.

Prerequisites#

AMD GPU with ROCm support (e.g., MI300X, MI325X for Instinct; W7900, R9700 for Radeon Pro)
Docker installed and configured with GPU support
Access to model repositories (Hugging Face account with appropriate permissions for gated models)

1. Docker deployment#

1.1 Running the container#

docker run \
  -e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> \
  --device=/dev/kfd --device=/dev/dri \
  -p 8000:8000 \
  amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0

Where <YOUR_HUGGINGFACE_TOKEN> is your Hugging Face access token (required for gated models)

1.2 Customizing deployment with environment variables#

Customize your deployment with optional environment variables. In the example below AIM_PORT is set to 8080 instead of 8000. AIM_METRIC is set to throughput instead of latency. AIM_GPU_COUNT is set to 1 instead of auto, AIM_PRECISION is set to fp16 instead of auto.

docker run \
  -e AIM_PRECISION=fp16 \
  -e AIM_GPU_COUNT=1 \
  -e AIM_METRIC=throughput \
  -e AIM_PORT=8080 \
  --device=/dev/kfd --device=/dev/dri \
  -p 8080:8080 \
  amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0

Override automatic profile selection by specifying a profile directly. In the example below, AIM_PROFILE_ID is set to vllm-mi300x-fp8-tp1-latency. All other environment variables’ values are set implicitly according to the specified profile.

docker run \
  -e AIM_PROFILE_ID=vllm-mi300x-fp8-tp1-latency \
  --device=/dev/kfd --device=/dev/dri \
  -p 8000:8000 \
  amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0

2. Model caching for production#

For production environments, pre-download models to a persistent cache:

2.1 Download model to cache#

# Create persistent cache directory
mkdir -p /path/to/model-cache

# Download model using the download-to-cache command
docker run --rm \
  -e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> \
  -v /path/to/model-cache:/workspace/model-cache \
  amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0 \
  download-to-cache --model-id meta-llama/Llama-3.1-8B-Instruct

2.2 Run with pre-cached model#

docker run \
  -e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> \
  -v /path/to/model-cache:/workspace/model-cache \
  --device=/dev/kfd --device=/dev/dri \
  -p 8000:8000 \
  amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0

3. Monitoring and troubleshooting#

3.1 Getting help on the commands#

A general help command is available as follows:

docker run \
  amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0 \
  --help

A help command for specific subcommands is also available:

docker run \
  amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0 \
  <subcommand> --help

3.2 Enabling detailed logging#

docker run \
  -e AIM_LOG_LEVEL=DEBUG \
  -e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> \
  --device=/dev/kfd --device=/dev/dri \
  -p 8000:8000 \
  amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0

3.3 Checking profile selection results#

It is possible to check which profile AIM selects based on the provided environment variables.

docker run \
  -e AIM_GPU_COUNT=1 \
  -e AIM_PRECISION=fp16 \
  -e AIM_GPU_MODEL=MI300X \
  -e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> \
  amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0 \
  dry-run

3.4 List available profiles#

docker run \
  amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0 \
  list-profiles