LLM Inference Service with Llama.cpp¶
This Helm chart deploys a LLM inference service workload via llama.cpp
Prerequisites¶
- Helm: Install
helm
. Refer to the Helm documentation for instructions. - Secrets: (Optional) Create the secrets
minio-credentials
with keysminio-access-key
andminio-secret-key
in the namespace if you want to download pre-built executables and models from MinIO.
Deploying the Workload¶
Basic configurations are defined in the values.yaml
file.
The default model is 1.73-bit quantized DeepSeek-R1-UD-IQ1_M, which fits in one MI300X GPU and can serve with a context length of 4K.
For example: run the following command within the helm/
folder to deploy the service:
Note: Compiling llama.cpp executables and downloading/merging the GGUF files of DeepSeek R1 (~200GB) from HuggingFace can take a significant amount of time. The deployment process may take over 30 minutes before the LLM inference service is ready.
Interacting with the Deployed Model¶
Verify Deployment¶
Check the deployment and service status:
Port Forwarding¶
To access the service locally, forward the port using the following commands. This assumes the service name is llm-inference-llamacpp
:
The service exposes HTTP on port 80, while the deployment uses port 8080 by default.
kubectl port-forward services/llm-inference-llamacpp 8080:80 ||\
kubectl port-forward deployments/llm-inference-llamacpp 8080:8080
You can access the Llama.cpp server's WebUI at http://localhost:8080
using a web browser.
Additionally, an OpenAI-compatible API endpoint is available at http://localhost:8080/v1