Dataset Preprocessing with VeRL¶
Kubernetes Jobs convert dataset from MinIO or Hugging Face into VeRL-ready Parquet shards for SFT.
Quick reference¶
- Input:
datasetRemotepulls from MinIO, whilepreprocess.localDatasetPathskips the download step. Setpreprocess.modetohfScriptfor Hugging Face pulls and describe the dataset viapreprocess.hfScriptArgs.*. - Transformation: Point
preprocess.builtinScriptto any packaged VeRL helper or rely on the HF helper for Q&A (questionField/answerField) and multi-turn (conversationField) corpora. Increasepreprocess.hfScriptArgs.numProcfor fasterdatasets.mapcalls. - If
preprocess.builtinScriptis a relative path (e.g.,examples/data_preprocess/gsm8k.py), it is automatically resolved againstpreprocess.verlRootDir(default:/workspace/verl). - Custom scripts: Switch
preprocess.modetocustomScriptto bake an inline Python helper (preprocess.customScript) into the ConfigMap. - Output:
preprocess.outputDiris the local staging folder;outputRemotePathmirrors the Parquet shards back to MinIO when set. - Access + auth: make sure
bucketStorageHost,bucketCredentialsSecret, and (if required)hfTokenSecretline up with your cluster. Private images go throughimagePullSecrets. - Sizing: tune
resources.requests/limitsfor CPU-bound preprocessing. Jobs are namespace-agnostic; append-n <ns>to your kubectl commands.
values.yaml documents every option.
Running the workload¶
Use an existing override or create a new one that sets bucket paths, auth, and selects the right preprocess.mode (builtinScript, customScript, or hfScript).
Custom scripts live directly under preprocess.customScript. When preprocess.mode=customScript, the chart renders that block into /configs/custom_script.py and automatically points entrypoint.sh at it, so the standard download/upload logic still runs before and after your custom Python.
Render the chart from the local directory and submit it to Kubernetes. This command is intended to be run from the aim-fine-tuning directory:
helm template dolly-15k aimtrain-dataprep-verl/helm \
--values aimtrain-dataprep-verl/helm/overrides/sft/dolly-15k.yaml \
| kubectl create -f -
Data format¶
The helper writes VeRL SFT rows (data_source, messages, extra_info). For details and downstream expectations, see the VeRL data preparation guide.