LLM Inference with vLLM¶
This Helm Chart deploys the embedding inference (via infinity) workload.
Deploying the Workload¶
helm template [optional-release-name] <helm-dir> -f <overrides/xyz.yaml> --set <name>=<value> | kubectl apply -f -
Example commands¶
Use default settings:
Use custom model (that works with infinity):
helm template bilingual-embedding-large . --set model=Lajavaness/bilingual-embedding-large | kubectl apply -f -
The model will be automatically downloaded before starting the inference server.
User Input Values¶
Refer to the values.yaml
file for the user input values you can provide, along with instructions.
Interacting with Deployed Model¶
Verify Deployment¶
Check the deployment status:
Port Forwarding¶
Forward the port to access the service:
Test the Deployment¶
Infinity server UI will be accessible at http://0.0.0.0:7997/docs.
Send a test request to verify the service: