LLM Inference Benchmarking Workload (ROCm Best Practices)¶
This Helm chart deploys a job to benchmark the performance of vLLM running a model within the same container. It follows the best practices for optimized inference on AMD Instinct GPUs.
Prerequisites¶
- Helm: Ensure
helm
is installed. Refer to the Helm documentation for installation instructions. -
MinIO Storage (optional): To use pre-downloaded model weights from MinIO storage, set the following environment variables. If not set, models will be downloaded from HuggingFace. MinIO storage is also used for saving benchmark results:
BUCKET_STORAGE_HOST
BUCKET_STORAGE_ACCESS_KEY
BUCKET_STORAGE_SECRET_KEY
BUCKET_MODEL_PATH
-
HuggingFace Token (optional): Required for downloading gated models (e.g., Mistral and LLaMA 3.x) from HuggingFace if they are not available locally.
Implementation¶
Basic configurations are defined in the values.yaml
file. YAML files in the overrides/models/
directory can be used to reproduce benchmarks for specific scenarios, such as models, tensor parallelism, data types, quantization, etc.
Example: Benchmarking a Specific Model Configuration¶
To benchmark a specific model (e.g., Mistral-7B-Instruct-v0.3-FP8) with its settings, run the following command from the helm
directory:
The benchmark results will be displayed at the end of the job log. An example result as the following:
============ Serving Benchmark Result ============
Successful requests: 256
Benchmark duration (s): 63.26
Total input tokens: 524288
Total generated tokens: 524288
Request throughput (req/s): 4.05
Output token throughput (tok/s): 8287.48
Total Token throughput (tok/s): 16574.96
---------------Time to First Token----------------
Mean TTFT (ms): 5749.02
Median TTFT (ms): 5569.11
P99 TTFT (ms): 10835.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 28.06
Median TPOT (ms): 28.15
P99 TPOT (ms): 30.52
---------------Inter-token Latency----------------
Mean ITL (ms): 28.06
Median ITL (ms): 25.17
P99 ITL (ms): 40.60
----------------End-to-end Latency----------------
Mean E2EL (ms): 63192.59
Median E2EL (ms): 63190.51
P99 E2EL (ms): 63229.48
==================================================