OpenAI-compatible Endpoint Benchmarking¶
This Helm chart defines a batch job to benchmark LLM performance using vLLM's benchmarking script against OpenAI-compatible API endpoints. It follows the best practices for optimized inference on AMD Instinct GPUs.
Prerequisites and Configuration¶
-
Helm: Ensure
helm
is installed. Refer to the Helm documentation for installation instructions. -
MinIO Storage: Required for saving benchmark results. Configure the following environment variables in
values.yaml
:BUCKET_STORAGE_HOST
BUCKET_STORAGE_ACCESS_KEY
BUCKET_STORAGE_SECRET_KEY
BUCKET_RESULT_PATH
-
API Endpoint: An OpenAI-compatible API endpoint is required. Configure this in
values.yaml
asenv_vars.OPENAI_API_BASE_URL
or override using the--set
option with Helm. -
Tokenizer: Required for token calculations. Specify a HuggingFace model repository in
values.yaml
by settingenv_vars.TOKENIZER
. The default isdeepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
. -
HuggingFace Token (optional): Set the
env_vars.HF_TOKEN
environment variable if using gated tokenizers (e.g., Mistral and Llama models) from HuggingFace.
Benchmark Configuration¶
The benchmark behavior can be customized using the following environment variables:
INPUT_LENGTH
(default:2048
): Sets the input token length for benchmark requestsOUTPUT_LENGTH
(default:2048
): Sets the output token length for benchmark requestsQPS
(default:inf
): Sets the queries per second rate. Can be:- A single value:
"10"
,"inf"
(unlimited) - Multiple space-separated values:
"1 5 10 inf"
(runs tests at each rate)
The benchmark automatically tests with request concurrency levels of 1, 2, 4, 8, 16, 32, 64, 128, and 256.
Deployment Example¶
To deploy the chart, run the following command in the helm/
directory: