Docker deployment#
This guide provides step-by-step instructions for deploying AMD Inference Microservice (AIM) container that supports different variants of Llama-3.1-8B-Instruct model. Follow these instructions to quickly get started with running an AI model on AMD GPUs. For more detailed information, please refer to the main README.
Prerequisites#
AMD GPU with ROCm support (e.g., MI300X, MI325X for Instinct; W7900, R9700 for Radeon Pro)
Docker installed and configured with GPU support
Access to model repositories (Hugging Face account with appropriate permissions for gated models)
1. Docker deployment#
1.1 Running the container#
docker run \
-e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> \
--device=/dev/kfd --device=/dev/dri \
-p 8000:8000 \
amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0
Where <YOUR_HUGGINGFACE_TOKEN> is your Hugging Face access token (required for gated models)
1.2 Customizing deployment with environment variables#
Customize your deployment with optional environment variables. In the example below AIM_PORT is set to 8080 instead
of 8000. AIM_METRIC is set to throughput instead of latency. AIM_GPU_COUNT is set to 1 instead of auto,
AIM_PRECISION is set to fp16 instead of auto.
docker run \
-e AIM_PRECISION=fp16 \
-e AIM_GPU_COUNT=1 \
-e AIM_METRIC=throughput \
-e AIM_PORT=8080 \
--device=/dev/kfd --device=/dev/dri \
-p 8080:8080 \
amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0
Override automatic profile selection by specifying a profile directly. In the example below, AIM_PROFILE_ID is set to
vllm-mi300x-fp8-tp1-latency. All other environment variables’ values are set implicitly according to the specified
profile.
docker run \
-e AIM_PROFILE_ID=vllm-mi300x-fp8-tp1-latency \
--device=/dev/kfd --device=/dev/dri \
-p 8000:8000 \
amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0
2. Model caching for production#
For production environments, pre-download models to a persistent cache:
2.1 Download model to cache#
# Create persistent cache directory
mkdir -p /path/to/model-cache
# Download model using the download-to-cache command
docker run --rm \
-e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> \
-v /path/to/model-cache:/workspace/model-cache \
amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0 \
download-to-cache --model-id meta-llama/Llama-3.1-8B-Instruct
2.2 Run with pre-cached model#
docker run \
-e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> \
-v /path/to/model-cache:/workspace/model-cache \
--device=/dev/kfd --device=/dev/dri \
-p 8000:8000 \
amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0
3. Monitoring and troubleshooting#
3.1 Getting help on the commands#
A general help command is available as follows:
docker run \
amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0 \
--help
A help command for specific subcommands is also available:
docker run \
amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0 \
<subcommand> --help
3.2 Enabling detailed logging#
docker run \
-e AIM_LOG_LEVEL=DEBUG \
-e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> \
--device=/dev/kfd --device=/dev/dri \
-p 8000:8000 \
amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0
3.3 Checking profile selection results#
It is possible to check which profile AIM selects based on the provided environment variables.
docker run \
-e AIM_GPU_COUNT=1 \
-e AIM_PRECISION=fp16 \
-e AIM_GPU_MODEL=MI300X \
-e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> \
amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0 \
dry-run
3.4 List available profiles#
docker run \
amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.11.0 \
list-profiles