LLM Inference Benchmarking Workload#
This Helm chart submits a job to benchmark the performance of vLLM running a model in the same container.
Prerequisites#
Helm: Install
helm. Refer to the Helm documentation for instructions.MinIO Storage (optional): To use pre-downloaded model weights from MinIO storage, the following environment variables must be set, otherwise models will be downloaded from Hugging Face. MinIO storage is also used for saving benchmark results.
BUCKET_STORAGE_HOSTBUCKET_STORAGE_ACCESS_KEYBUCKET_STORAGE_SECRET_KEYBUCKET_MODEL_PATH
HF Token (optional): If you need to download gated models from Hugging Face (e.g., Mistral and LLaMA 3.x) that are not available locally, ensure a secret named
hf-tokenexists in the namespace.
Implementation#
Basic configurations are defined in the values.yaml file, with key settings:
env_vars.TESTOPT: Must be set to either “latency” or “throughput”env_vars.USE_MAD: Controls whether to apply the MAD approach (see below)
Note: If the specified model cannot be found locally, the workload will attempt to download it from Hugging Face.
A. Scenario-specific approach#
In this approach (env_vars.USE_SCENARIO is not “false”), scenarios are defined in the mount/scenarios_{$TESTOPT}.csv file. Modify this file to specify models, parameters, and environment variables for benchmarking. Each column defines a parameter or variable, and each row represents a unique scenario to benchmark.
The default configuration benchmarks latency using benchmark_latency.py from vLLM. Setting env_vars.TESTOPT to “throughput” will use benchmark_throughput.py instead.
Example 1: Benchmark latency scenarios (default)
helm template . | kubectl apply -f -
Example 2: Benchmark throughput scenarios
helm template . --set env_vars.TESTOPT="throughput" | kubectl apply -f -
B. ROCm/MAD standalone approach#
When env_vars.USE_MAD is not “false”, the ROCm/MAD repository will be cloned. The specified model (env_vars.MAD_MODEL) will be benchmarked according to preset scripts.
Example 3: Benchmark using MAD standalone approach with override settings
helm template . -f overrides/methods/MAD-Qwen2.5_0.5B.yaml | kubectl apply -f -