LLM Inference Benchmarking Workload¶
This Helm chart submits a job to benchmark the performance of vLLM running a model in the same container.
Prerequisites¶
- Helm: Install
helm. Refer to the Helm documentation for instructions. -
MinIO Storage (optional): To use pre-downloaded model weights from MinIO storage, the following environment variables must be set, otherwise models will be downloaded from Hugging Face. MinIO storage is also used for saving benchmark results.
BUCKET_STORAGE_HOSTBUCKET_STORAGE_ACCESS_KEYBUCKET_STORAGE_SECRET_KEYBUCKET_MODEL_PATH
-
HF Token (optional): If you need to download gated models from Hugging Face (e.g., Mistral and LLaMA 3.x) that are not available locally, ensure a secret named
hf-tokenexists in the namespace.
Implementation¶
Basic configurations are defined in the values.yaml file, with key settings:
env_vars.TESTOPT: Must be set to either "latency" or "throughput"env_vars.USE_MAD: Controls whether to apply the MAD approach (see below)
Note: If the specified model cannot be found locally, the workload will attempt to download it from Hugging Face.
A. Scenario-specific approach¶
In this approach (env_vars.USE_SCENARIO is not "false"), scenarios are defined in the mount/scenarios_{$TESTOPT}.csv file. Modify this file to specify models, parameters, and environment variables for benchmarking. Each column defines a parameter or variable, and each row represents a unique scenario to benchmark.
The default configuration benchmarks latency using benchmark_latency.py from vLLM. Setting env_vars.TESTOPT to "throughput" will use benchmark_throughput.py instead.
Example 1: Benchmark latency scenarios (default)
Example 2: Benchmark throughput scenarios
B. ROCm/MAD standalone approach¶
When env_vars.USE_MAD is not "false", the ROCm/MAD repository will be cloned. The specified model (env_vars.MAD_MODEL) will be benchmarked according to preset scripts.
Example 3: Benchmark using MAD standalone approach with override settings