LLM Inference with SGLang¶
This Helm Chart deploys the LLM Inference SGLang workload.
Prerequisites¶
Ensure the following prerequisites are met before deploying any workloads:
- Helm: Install
helm
. Refer to the Helm documentation for instructions. - Secrets: Create the following secrets in the namespace:
minio-credentials
with keysminio-access-key
andminio-secret-key
.hf-token
with keyhf-token
.
Deploying the Workload¶
It is recommended to use helm template
and pipe the result to kubectl apply
, rather than using helm install
. Generally, a command looks as follows
helm template [optional-release-name] <helm-dir> -f <overrides/xyz.yaml> --set <name>=<value> | kubectl apply -n <namespace> -f -
The chart provides three main ways to deploy models, detailed below.
Alternative 1: Deploy a Specific Model Configuration¶
To deploy a specific model along with its settings, use the following command from the helm
directory:
helm template tiny-llama . -f overrides/models/tinyllama_tinyllama-1.1b-chat-v1.0.yaml | kubectl apply -f -
Alternative 2: Override the Model¶
You can also override the model on the command line:
Alternative 3: Deploy a Model from Bucket Storage¶
If you have downloaded your model to bucket storage, use:
The model will be automatically downloaded before starting the inference server.
User Input Values¶
Refer to the values.yaml
file for the user input values you can provide, along with instructions.
Interacting with Deployed Model¶
Verify Deployment¶
Check the deployment status:
Port Forwarding¶
Forward the port to access the service (assuming the deployment is named llm-inference-sglang-tiny-llama
):
Test the Deployment¶
Send a test request to verify the service, assuming TinyLlama/TinyLlama-1.1B-Chat-v1.0
model: