LLM Inference Service with Llama.cpp¶
This Helm chart deploys a LLM inference service workload via Ollama
Prerequisites¶
Install helm
. Refer to the Helm documentation for instructions.
Deploying the Workload¶
Basic configurations for the deployment are specified in the values.yaml
file. By default, the service uses the quantized Gemma3:4b
model. For a comprehensive list of available models, visit the Ollama Model Library.
For example: run the following command within the helm/
folder to deploy the service:
Note: Compiling Ollama executables and downloading models can take a significant amount of time. The deployment process may take over 10 minutes before the LLM inference service is ready.
Interacting with the Deployed Model¶
Verify Deployment¶
Check the deployment and service status:
Port Forwarding¶
To access the service locally, forward the port using the following commands. This assumes the service name is llm-inference-ollama
:
Note Ollama server provides both Ollama API http://localhost:8080/api
and OpenAI-compatible API http://localhost:8080/v1