LLM Inference Benchmarking Workload¶
This Helm chart submits a job to benchmark the performance of vLLM running a model in the same container.
Prerequisites¶
- Helm: Install
helm
. Refer to the Helm documentation for instructions. -
MinIO Storage (optional): To use pre-downloaded model weights from MinIO storage, the following environment variables must be set, otherwise models will be downloaded from HuggingFace. MinIO storage is also used for saving benchmark results.
BUCKET_STORAGE_HOST
BUCKET_STORAGE_ACCESS_KEY
BUCKET_STORAGE_SECRET_KEY
BUCKET_MODEL_PATH
-
HF Token (optional): If you need to download gated models from HuggingFace (e.g., Mistral and LLaMA 3.x) that are not available locally, ensure a secret named
hf-token
exists in the namespace.
Implementation¶
Basic configurations are defined in the values.yaml
file, with key settings:
env_vars.TESTOPT
: Must be set to either "latency" or "throughput"env_vars.USE_MAD
: Controls whether to apply the MAD approach (see below)
Note: If the specified model cannot be found locally, the workload will attempt to download it from HuggingFace.
A. Scenario-specific approach¶
In this approach (env_vars.USE_SCENARIO
is not "false"), scenarios are defined in the mount/scenarios_{$TESTOPT}.csv
file. Modify this file to specify models, parameters, and environment variables for benchmarking. Each column defines a parameter or variable, and each row represents a unique scenario to benchmark.
The default configuration benchmarks latency using benchmark_latency.py from vLLM. Setting env_vars.TESTOPT
to "throughput" will use benchmark_throughput.py instead.
Example 1: Benchmark latency scenarios (default)
Example 2: Benchmark throughput scenarios
B. ROCm/MAD standalone approach¶
When env_vars.USE_MAD
is not "false", the ROCm/MAD repository will be cloned. The specified model (env_vars.MAD_MODEL
) will be benchmarked according to preset scripts.
Example 3: Benchmark using MAD standalone approach with override settings