OpenAI-compatible Endpoint Benchmarking¶
This Helm chart defines a batch job to benchmark LLM performance using vLLM's benchmarking script against OpenAI-compatible API endpoints. It follows the best practices for optimized inference on AMD Instinct GPUs.
Prerequisites and Configuration¶
-
Helm: Ensure
helmis installed. Refer to the Helm documentation for installation instructions. -
MinIO Storage: Required for saving benchmark results. Configure the following environment variables in
values.yaml:BUCKET_STORAGE_HOSTBUCKET_STORAGE_ACCESS_KEYBUCKET_STORAGE_SECRET_KEYBUCKET_RESULT_PATH
-
API Endpoint: An OpenAI-compatible API endpoint is required. Configure this in
values.yamlasenv_vars.OPENAI_API_BASE_URLor override using the--setoption with Helm. -
Tokenizer: Required for token calculations. Specify a HuggingFace model repository in
values.yamlby settingenv_vars.TOKENIZER. The default isdeepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B. -
HuggingFace Token (optional): Set the
env_vars.HF_TOKENenvironment variable if using gated tokenizers (e.g., Mistral and Llama models) from HuggingFace.
Benchmark Configuration¶
The benchmark behavior can be customized using the following environment variables:
INPUT_LENGTH(default:2048): Sets the input token length for benchmark requestsOUTPUT_LENGTH(default:2048): Sets the output token length for benchmark requestsQPS(default:inf): Sets the queries per second rate. Can be:- A single value:
"10","inf"(unlimited) - Multiple space-separated values:
"1 5 10 inf"(runs tests at each rate)
The benchmark automatically tests with request concurrency levels of 1, 2, 4, 8, 16, 32, 64, 128, and 256.
Deployment Example¶
To deploy the chart, run the following command in the helm/ directory: