LLM Inference with SGLang#

This Helm Chart deploys the LLM Inference SGLang workload.

Prerequisites#

Ensure the following prerequisites are met before deploying any workloads:

  1. Helm: Install helm. Refer to the Helm documentation for instructions.

  2. Secrets: Create the following secrets in the namespace:

    • minio-credentials with keys minio-access-key and minio-secret-key.

    • hf-token with key hf-token.

Deploying the Workload#

It is recommended to use helm template and pipe the result to kubectl apply , rather than using helm install. Generally, a command looks as follows

helm template [optional-release-name] <helm-dir> -f <overrides/xyz.yaml> --set <name>=<value> | kubectl apply -n <namespace> -f -

The chart provides three main ways to deploy models, detailed below.

Alternative 1: Deploy a Specific Model Configuration#

To deploy a specific model along with its settings, use the following command from the helm directory:

helm template tiny-llama . -f overrides/models/tinyllama_tinyllama-1.1b-chat-v1.0.yaml | kubectl apply -f -

Alternative 2: Override the Model#

You can also override the model on the command line:

helm template qwen2-0-5b . --set model=Qwen/Qwen2-0.5B-Instruct | kubectl apply -f -

Alternative 3: Deploy a Model from Bucket Storage#

If you have downloaded your model to bucket storage, use:

helm template qwen2-0-5b . --set model=s3://models/Qwen/Qwen2-0.5B-Instruct | kubectl apply -f -

The model will be automatically downloaded before starting the inference server.

User Input Values#

Refer to the values.yaml file for the user input values you can provide, along with instructions.

Interacting with Deployed Model#

Verify Deployment#

Check the deployment status:

kubectl get deployment

Port Forwarding#

Forward the port to access the service (assuming the service is named llm-inference-sglang-tiny-llama ):

kubectl port-forward services/llm-inference-sglang-tiny-llama 8080:80

Test the Deployment#

Send a test request to verify the service, assuming TinyLlama/TinyLlama-1.1B-Chat-v1.0 model:

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -X POST \
    -d '{
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'