LLM Inference with vLLM¶
This Helm Chart deploys the LLM Inference vLLM workload.
Prerequisites¶
Ensure the following prerequisites are met before deploying any workloads:
- Helm: Install
helm
. Refer to the Helm documentation for instructions. - Secrets: Create the following secrets in the namespace:
minio-credentials
with keysminio-access-key
andminio-secret-key
.hf-token
with keyhf-token
.
Deploying the Workload¶
It is recommended to use helm template
and pipe the result to kubectl create
, rather than using helm install
. Generally, a command looks as follows
helm template [optional-release-name] <helm-dir> -f <overrides/xyz.yaml> --set <name>=<value> | kubectl apply -f -
The chart provides three main ways to deploy models, detailed below.
Alternative 1: Deploy a Specific Model Configuration¶
To deploy a specific model along with its settings, use the following command from the helm
directory:
helm template tiny-llama . -f overrides/models/tinyllama_tinyllama-1.1b-chat-v1.0.yaml | kubectl apply -f -
Alternative 2: Override the Model¶
You can also override the model on the command line:
Alternative 3: Deploy a Model from Bucket Storage¶
If you have downloaded your model to bucket storage, use:
The model will be automatically downloaded before starting the inference server.
Alternative 4: Deploy with Custom Served Model Name¶
You can decouple the served model name from the storage path by using the served_model_name
parameter:
helm template qwen2-0-5b . \
--set model=s3://default-bucket/engineering/models/OdiaGenAI-LLM/qwen_1.5_odia_7b \
--set served_model_name=OdiaGenAI-LLM/qwen_1.5_odia_7b | kubectl apply -f -
This allows you to use clean, user-friendly model names in your API requests while keeping the actual storage path separate.
User Input Values¶
Refer to the values.yaml
file for the user input values you can provide, along with instructions.
Interacting with Deployed Model¶
Verify Deployment¶
Check the deployment status:
Port Forwarding¶
Forward the port to access the service (assuming the service is named llm-inference-vllm-tiny-llama
):
Test the Deployment¶
Send a test request to verify the service, assuming TinyLlama/TinyLlama-1.1B-Chat-v1.0
model:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-X POST \
-d '{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'
If you deployed with a custom served_model_name
, use that name instead: