LLM Inference with SGLang#
This Helm Chart deploys the LLM Inference SGLang workload.
Prerequisites#
Ensure the following prerequisites are met before deploying any workloads:
Helm: Install
helm. Refer to the Helm documentation for instructions.Secrets: Create the following secrets in the namespace:
minio-credentialswith keysminio-access-keyandminio-secret-key.hf-tokenwith keyhf-token.
Deploying the Workload#
It is recommended to use helm template and pipe the result to kubectl apply , rather than using helm install. Generally, a command looks as follows
helm template [optional-release-name] <helm-dir> -f <overrides/xyz.yaml> --set <name>=<value> | kubectl apply -n <namespace> -f -
The chart provides three main ways to deploy models, detailed below.
Alternative 1: Deploy a Specific Model Configuration#
To deploy a specific model along with its settings, use the following command from the helm directory:
helm template tiny-llama . -f overrides/models/tinyllama_tinyllama-1.1b-chat-v1.0.yaml | kubectl apply -f -
Alternative 2: Override the Model#
You can also override the model on the command line:
helm template qwen2-0-5b . --set model=Qwen/Qwen2-0.5B-Instruct | kubectl apply -f -
Alternative 3: Deploy a Model from Bucket Storage#
If you have downloaded your model to bucket storage, use:
helm template qwen2-0-5b . --set model=s3://models/Qwen/Qwen2-0.5B-Instruct | kubectl apply -f -
The model will be automatically downloaded before starting the inference server.
User Input Values#
Refer to the values.yaml file for the user input values you can provide, along with instructions.
Interacting with Deployed Model#
Verify Deployment#
Check the deployment status:
kubectl get deployment
Port Forwarding#
Forward the port to access the service (assuming the service is named llm-inference-sglang-tiny-llama ):
kubectl port-forward services/llm-inference-sglang-tiny-llama 8080:80
Test the Deployment#
Send a test request to verify the service, assuming TinyLlama/TinyLlama-1.1B-Chat-v1.0 model:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-X POST \
-d '{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'