Tutorial 03: Deliver model and data to cluster MinIO, then run Megatron-LM continuous pretraining¶
This tutorial involves the following steps:
1. Download a model from the HuggingFace Hub in HuggingFace Transformers format, convert it to the Megatron-LM compatible format, and save it to the cluster-internal MinIO storage server.
2. Download a sample dataset from the HuggingFace Hub in jsonl
format, preprocess it into the Megatron-LM compatible format, and store it in a cluster-internal MinIO storage server.
3. Execute a multi-node Megatron-LM continuous pretraining job using the base model and dataset prepared in steps 1 and 2, and saving the resulting checkpoints to the cluster-internal MinIO storage.
4. Perform an inference workload using the final checkpoint from step 3 in Megatron-LM format to validate the results.
1. Setup¶
Follow the setup in the tutorial 0 prerequisites section.
2. Run workloads¶
2.1 Prepare model in Megatron-LM format¶
2.1.1 Download model¶
To download the meta-llama/llama-3.1-8B
model from the HuggingFace Hub and upload it to the in-cluster MinIO bucket, use the Helm chart located at workloads/download-huggingface-model-to-bucket/helm
.
helm template workloads/download-huggingface-model-to-bucket/helm \
--values workloads/download-huggingface-model-to-bucket/helm/overrides/tutorial-03-llama-3.1-8b.yaml \
--name-template "download-llama3-1-8b" \
| kubectl apply -f -
The model will be stored in the remote MinIO bucket at the path default-bucket/models/meta-llama/Llama-3.1-8B
after being downloaded from the HuggingFace Hub.
2.1.2 Convert model checkpoints to Megatron-LM format¶
To convert the model checkpoints into the Megatron-LM compatible format, use the Helm chart located at workloads/llm-megatron-ckpt-conversion/helm
.
helm template workloads/llm-megatron-ckpt-conversion/helm \
--values workloads/llm-megatron-ckpt-conversion/helm/overrides/tutorial-03-llama-3.1-8b.yaml \
--name-template "llama3-1-8b" \
| kubectl create -f -
The conversion process begins by copying the model checkpoint files from the MinIO storage to the workload's working directory. These checkpoint files are then processed within the conversion container to transform them into the Megatron-LM compatible format. Once the conversion is complete, the transformed checkpoint is uploaded back to the internal MinIO storage at the location default-bucket/megatron-models/meta-llama/Llama-3.1-8B/
for subsequent use.
2.2 Prepare data in Megatron-LM format¶
We will use the Helm chart located at workloads/prepare-data-for-megatron-lm/helm
to download and preprocess a sample of the HuggingFaceFW/fineweb-edu
dataset using the HuggingFace tokenizer downloaded during the previous step Download model.
The user input file is workloads/prepare-data-for-megatron-lm/helm/overrides/tutorial-03-fineweb-data-sample.yaml
.
helm template workloads/prepare-data-for-megatron-lm/helm \
--values workloads/prepare-data-for-megatron-lm/helm/overrides/tutorial-03-fineweb-data-sample.yaml \
--name-template "prepare-fineweb-data" \
| kubectl apply -f -
Refer to the Monitoring progress, logs, and GPU utilization with k9s section to track data and tokenizer downloads, data preprocessing, and uploads to the in-cluster MinIO bucket.
2.3 Run multi-node Megatron-LM continuous pretraining job¶
To launch the Megatron-LM pretraining job use the Helm chart located at workloads/llm-pretraining-megatron-lm-ray/helm
. Use the following command:
helm template workloads/llm-pretraining-megatron-lm-ray/helm \
--values workloads/llm-pretraining-megatron-lm-ray/helm/overrides/tutorial-03-values-llama-8b-16ddp.yaml \
| kubectl apply -f -
2.4 Run inference workload with the final checkpoint (2.3) and query it using sample prompts on Llama-3.1-8B¶
In order to perform inference with the just trained Llama-3.1-8B model and verify it's quality, follow the steps:
- Execute the Llama-3.1-8B single-node Megatron-LM inference workload. This step verifies that the model is correctly deployed and can respond to basic prompts.
- Query the model with a simple prompt to confirm it generates coherent responses.
2.4.1 Run Megatron-LM inference workload¶
helm template workloads/llm-inference-megatron-lm/helm/ \
--values workloads/llm-inference-megatron-lm/helm/overrides/tutorial-03-llama-3-1-8b.yaml \
| kubectl apply -f -
2.4.2 Monitoring progress, logs, and GPU utilization with k9s¶
To monitor training progress, view workload logs, and observe GPU utilization, we recommend using k9s. Refer to the official documentation for detailed guidance. Below are basic commands for this tutorial:
To access the Pods view in the your namespace, run:
Navigate using the arrow keys
to select a the pod containg the keyword "inference" and and press Enter
to view the pod running the inference server. View logs by pressing l
. Logs display output messages generated during runtime. Press Esc
to return to the previous k9s
view.
2.4.3 Connect to the inference service and query it to sample prompt continuations¶
First, check the deployment status:
You should see a deployment with a name in the formatllm-inference-megatron-lm-YYYYMMDD-HHMM
(e.g. llm-inference-megatron-lm-20250811-1229
) in ready state.
Get the name of the respective service deployed by the workload with
The service should have the same name as the deployment from above with the format llm-inference-megatron-lm-YYYYMMDD-HHMM
. Note the port exposed by the service, it is expected to be the port 80
.
Forward the service port to your local machine, e.g., in the example below, remote port 80
to local port 5000
. For example, use the following command, and do not forget to replace llm-inference-megatron-lm-YYYYMMDD-HHMM
with your real service name:
Now the inference API is available at http://localhost:5000
.
You can use curl
to send requests to the inference API. Make sure you have the service port forwarded as shown above. Send a simple prompt to the model to check if it responds coherently. For example:
curl -X PUT -H "Content-Type: application/json" \
-d '{"prompts": ["What is the capital of France?"], "tokens_to_generate": 32}' \
http://localhost:5000/api
You should receive a JSON response with the model’s answer. For a healthy model, the answer should be "Paris"
or similar with some extra text.
Try a few more prompts to check basic reasoning and language ability: