LLM Inference with vLLM¶

This Helm Chart deploys the LLM Inference vLLM workload.

Prerequisites¶

Ensure the following prerequisites are met before deploying any workloads:

Helm: Install helm. Refer to the Helm documentation for instructions.
Secrets: Create the following secrets in the namespace:
- minio-credentials with keys minio-access-key and minio-secret-key.
- hf-token with key hf-token.

Deploying the Workload¶

It is recommended to use helm template and pipe the result to kubectl create , rather than using helm install. Generally, a command looks as follows

helm template [optional-release-name] <helm-dir> -f <overrides/xyz.yaml> --set <name>=<value> | kubectl apply -f -

The chart provides three main ways to deploy models, detailed below.

Alternative 1: Deploy a Specific Model Configuration¶

To deploy a specific model along with its settings, use the following command from the helm directory:

helm template tiny-llama . -f overrides/models/tinyllama_tinyllama-1.1b-chat-v1.0.yaml | kubectl apply -f -

Alternative 2: Override the Model¶

You can also override the model on the command line:

helm template qwen2-0-5b . --set model=Qwen/Qwen2-0.5B-Instruct | kubectl apply -f -

Alternative 3: Deploy a Model from Bucket Storage¶

If you have downloaded your model to bucket storage, use:

helm template qwen2-0-5b . --set model=s3://models/Qwen/Qwen2-0.5B-Instruct | kubectl apply -f -

The model will be automatically downloaded before starting the inference server.

Alternative 4: Deploy with Custom Served Model Name¶

You can decouple the served model name from the storage path by using the served_model_name parameter:

helm template qwen2-0-5b . \
    --set model=s3://default-bucket/engineering/models/OdiaGenAI-LLM/qwen_1.5_odia_7b \
    --set served_model_name=OdiaGenAI-LLM/qwen_1.5_odia_7b | kubectl apply -f -

This allows you to use clean, user-friendly model names in your API requests while keeping the actual storage path separate.

User Input Values¶

Refer to the values.yaml file for the user input values you can provide, along with instructions.

Interacting with Deployed Model¶

Verify Deployment¶

Check the deployment status:

kubectl get deployment

Port Forwarding¶

Forward the port to access the service (assuming the service is named llm-inference-vllm-tiny-llama ):

kubectl port-forward services/llm-inference-vllm-tiny-llama 8080:80

Test the Deployment¶

Send a test request to verify the service, assuming TinyLlama/TinyLlama-1.1B-Chat-v1.0 model:

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -X POST \
    -d '{
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

If you deployed with a custom served_model_name, use that name instead:

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "OdiaGenAI-LLM/qwen_1.5_odia_7b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello, how are you?"}
        ]
    }'