Fine-tuning config structure and parameters for SFT¶

This document describes the structure of the SFT fine-tuning configuration, and the parameters and values that can be defined there.

See the fine-tuning config section this config file for an example of a valid configuration. See the various sub-configs for their options. Additional properties are not allowed.

Top-level properties:

Property	Type	Required	Possible values	Default	Description
data_conf	`object`	✅	ChatTrainValidConfig		The data input config
training_args	`object`	✅	SilogenTrainingArguments		Transformer TrainingArguments with some restrictions
batchsize_conf	`object`	✅	BatchsizeConfig		Batch size configuration
peft_conf	`object`	✅	GenericPeftConfig and/or NoPeftConfig and/or PretrainedPeftConfig		Adapter configuration
run_conf	`object`	✅	RunConfig		Model related configuration
sft_args	`object`	✅	SFTArguments		SFT specific arguments
method	`const`		`sft`	`"sft"`
overrides	`object`		Overrides	`{"lr_multiplier": 1.0, "lr_batch_size_scaling": "none"}`	Override options to simplify the config interface
tracking	`object` or `null`		FinetuningTrackingConfig	`null`	MLFlow tracking configuration
quant_conf	`object`		BnBQuantizationConfig and/or NoQuantizationConfig	`{"quantization_type": "no-quantization"}`	Quantization configuration

Definitions¶

AutoSplitDataInput¶

Automatic validation split from the training data

Type: `object`¶

Property	Type	Required	Possible values	Default	Description
type	`const`	✅	`AUTO_SPLIT`
data_type	`string`		string	`"ChatConversation"`	Generally, the data_type is automatically set based on the experiment config method.
ratio	`number`		number	`0.2`	Ratio of the training data to use for validation
seed	`integer`		integer	`1289525893`	Seed for the random number generator for splitting

BatchsizeConfig¶

Config for determining the total batch size

Total batch size is the effective batch size for the complete training run. It is equal to number of processes * per-device batch size * accumulation.

The maximum batch size per device is the maximum batch size that can be accommodated on a single device. This mostly limited by the memory capacity of the device.

Type: `object`¶

Property	Type	Required	Possible values	Default	Description
total_train_batch_size	`integer`	✅	integer		The total batch size for the training run
max_per_device_train_batch_size	`integer`	✅	integer		The maximum training batch size per device
per_device_eval_batch_size	`integer` or `null`		integer	`null`	The maximum eval batch size per device, if not given, will use same as training batch size

BnBQuantizationConfig¶

Bits and Bytes configuration

The options are from the BitsAndBytes config, see: https://huggingface.co/docs/transformers/en/main_classes/quantization#transformers.BitsAndBytesConfig

Type: `object`¶

Property	Type	Possible values	Default
quantization_type	`const`	`bits-and-bytes`	`"bits-and-bytes"`
load_in_8bit	`boolean`	boolean	`false`
load_in_4bit	`boolean`	boolean	`false`
llm_int8_threshold	`number`	number	`6.0`
llm_int8_skip_modules	`array` or `null`	string	`null`
llm_int8_enable_fp32_cpu_offload	`boolean`	boolean	`false`
llm_int8_has_fp16_weight	`boolean`	boolean	`false`
bnb_4bit_compute_dtype	`string` or `null`	string	`null`
bnb_4bit_quant_type	`const`	`fp4` and/or `nf4`	`"fp4"`
bnb_4bit_use_double_quant	`boolean`	boolean	`false`
bnb_4bit_quant_storage	`string` or `null`	string	`null`

ChatTemplateName¶

Chat template to use.

Type: `string`¶

Possible Values: mistral-with-system or chat-ml or poro or keep-original or simplified-llama31

ChatTrainValidConfig¶

Training time data configuration

Always defines some DataInput for training data and can include validation DataInput, though a trivial NoneDataInput is also allowed for the validation side.

Additionally includes chat template and padding configurations, as those are part of the data input pipeline.

Type: `object`¶

Property	Type	Required	Possible values	Default	Description
training_data	`object`	✅	ConcatenationDataInput and/or WeightedMixDataInput
validation_data	`object`	✅	AutoSplitDataInput and/or ConcatenationDataInput and/or NoneDataInput
chat_template_name	`string`		ChatTemplateName	`"mistral-with-system"`
padding_side	`string`		string	`"right"`	Padding side, right is usually right.
missing_pad_token_strategy	`string`		MissingPadTokenStrategy	`"bos-repurpose"`	See the MissingPadTokenStrategys for descriptions of the options

ConcatenationDataInput¶

A simple list of datasets

These are simply concatenated, the same as sampling all with equal weight.

The datasets themselves need to be in the finetuning supported JSONL formats. For SFT this means lines:

{"messages": [{"content": "string", "role": "string"}]}

For DPO this means lines of: { "prompt_messages": [{"content": "string", "role": "string"}], "chosen_messages": [{"content": "string", "role": "string"}], "rejected_messages": [{"content": "string", "role": "string"}] }

Type: `object`¶

Property	Type	Required	Possible values	Default	Description
type	`const`	✅	`CONCATENATION`
datasets	`array`	✅	DatasetDefinition
data_type	`string`		string	`"ChatConversation"`	Generally, the data_type is automatically set based on the experiment config method.

DatasetDefinition¶

Define how to load a dataset

Type: `object`¶

Property	Type	Required	Possible values	Description
path	`string`	✅	string	Local path to a JSONL file in the finetuning data format

FinetuningTrackingConfig¶

Settings that define how run details are logged

Type: `object`¶

Property	Type	Required	Possible values	Default	Description
mlflow_server_uri	`string`	✅	string		MLflow server URI. Can be local path.
experiment_name	`string`	✅	string		Experiment name that is used for MLFlow tracking.
hf_mlflow_log_artifacts	`string`		string	`"False"`	Whether to store model artifacts in MLFlow.

GenericPeftConfig¶

Config for any new initialized PEFT Adapter

See https://huggingface.co/docs/peft/tutorial/peft_model_config for the possible kwargs and https://github.com/huggingface/peft/blob/v0.7.1/src/peft/utils/peft_types.py for the types.

Example:

>>> loaded_data = {'peft_type':'LORA', 'task_type': 'CAUSAL_LM',
...         'peft_kwargs': {'r': 32, 'target_modules': ['v_proj']}}
>>> generic_conf = GenericPeftConfig(**loaded_data)
>>> generic_conf.get_peft_config()
LoraConfig(task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>, peft_type=<PeftType.LORA: 'LORA'>, ...)

Type: `object`¶

Property	Type	Required	Possible values	Default
peft_type	`string`	✅	PeftType
task_type	`string`		TaskType	`"CAUSAL_LM"`
peft_kwargs	`object`		object

MissingPadTokenStrategy¶

Specifies the available missing pad token strategies.

We've shown in a small set of experiments that repurposing EOS can start to hurt performance while the other options seem to work equally well.

Repurposing EOS is the default in many online sources, but it is actually a bad idea if we want to predict EOS, as all the pad_token_ids get ignored in loss computation, and thus the model does not learn to predict the end of the text. However, for models that have additional tokens for end of message, end of turn, etc. this is not so dangerous.

Repurposing BOS is similar to repurposing EOS, but since we do not need to predict BOS, this may be more sensible.

Repurposing UNK can work with tokenizers that never produce UNKs in normal data (e.g. Mistral tokenizers should have a byte fall-back so that everything can be tokenized).

UNK_CONVERT_TO_EOS uses a hack where the unk_token_id is initially used for padding, but in the collation phase the input-side UNKs (padding) gets set to EOS, so that the input-side padding looks like EOS. On the output-side, the UNKs (padding) still gets ignored. NOTE: This will leave the tokenizer's pad_token_id set to the unk_token_id; so any subsequent use of the model where padding is involved should somehow explicitly set the pad_token_id again.

Type: `string`¶

Possible Values: eos-repurpose or bos-repurpose or unk-repurpose or unk-convert-to-eos

ModelArguments¶

These are passed to AutoModelForCausalLM.from_pretrained

See parameter docstrings and help at: https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained See below in "Parameters for big model inference" too, it affects training too. Also note that this link takes you to the transformers main branch version - be sure to compare with the installed version of transformers (that keeps changing over time, and it is difficult to keep this docstring up to date, so we wanted to link to the latest here).

Some important parameters to consider are:

device_map : A map that specifies where each submodule should go. It doesn’t need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device. If we only pass the device (e.g., "cpu", "cuda:1", "mps", or a GPU ordinal rank like 1) on which the model will be allocated, the device map will map the entire model to this device. Passing device_map = 0 means put the whole model on GPU 0.
attn_implementation : The attention implementation to use in the model (if relevant). Can be any of "eager" (manual implementation of the attention), "sdpa" (using F.scaled_dot_product_attention), or "flash_attention_2" (using Dao-AILab/flash-attention). By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual "eager" implementation.

NOTE: This does not include quantization_config. Quantization config is specified separately.

Type: `object`¶

Property	Type	Possible values	Default	Description
silogen_extra_args	`object`	object		Don't specify directly - this gathers additional args passed to the model
dtype	`const` or `string`	`auto` and/or string	`"auto"`
pretrained_model_name_or_path	`string` or `null`	Format: `path` and/or string	`null`	Can be either: - A string, the model id of a pretrained model hosted inside a model repo on huggingface.co. - A path to a directory containing model weights saved using `~PreTrainedModel.save_pretrained`. - A path or url to a tensorflow index checkpoint file. - A path or url to a model folder containing a flax checkpoint file in .msgpack format. - `None` if you are both providing the configuration and state dictionary.
config	`string` or `null`	Format: `path` and/or string	`null`	Configuration for the model to use instead of an automatically loaded configuration. Can be either an instance of a class derived from `PretrainedConfig`, or a string/path valid as input to `PretrainedConfig.from_pretrained`.
cache_dir	`string` or `null`	Format: `path` and/or string	`null`	Path to a directory in which a downloaded pretrained model configuration should be cached.
from_tf	`boolean`	boolean	`false`	Load the model weights from a TensorFlow checkpoint save file.
from_flax	`boolean`	boolean	`false`	Load the model weights from a Flax checkpoint save file.
ignore_mismatched_sizes	`boolean`	boolean	`false`	Whether or not to raise an error if some of the weights from the checkpoint do not have the same size as the weights of the model.
force_download	`boolean`	boolean	`false`	Whether or not to force the (re-)download of the model weights and configuration files.
proxies	`object` or `null`	object	`null`	A dictionary of proxy servers to use by protocol or endpoint.
output_loading_info	`boolean`	boolean	`false`	Whether or not to also return a dictionary containing missing keys, unexpected keys and error messages.
local_files_only	`boolean`	boolean	`false`	Whether or not to only look at local files (i.e., do not try to download the model).
token	`boolean` or `string` or `null`	boolean and/or string	`null`	The token to use as HTTP bearer authorization for remote files.
revision	`string`	string	`"main"`	The specific model version to use. It can be a branch name, a tag name, or a commit id.
attn_implementation	`string` or `null`	string	`null`	The attention implementation to use in the model. Can be any of 'eager', 'sdpa', 'flash_attention_2', or 'flash_attention_3'. Accepts HF kernel references in the form: /[@][:]
device_map	`integer` or `object` or `string` or `null`	integer and/or object and/or string	`null`	A map that specifies where each submodule should go.
max_memory	`object` or `null`	object	`null`	A dictionary device identifier to maximum memory if using `device_map`.
tp_plan	`string` or `null`	string	`null`	A torch tensor parallel plan. Currently only accepts 'auto'.
tp_size	`string` or `null`	string	`null`	A torch tensor parallel degree. If not provided would default to world size.
offload_folder	`string` or `null`	Format: `path` and/or string	`null`	If the `device_map` contains any value 'disk', the folder where we will offload weights.
offload_buffers	`boolean`	boolean	`false`	Whether or not to offload the buffers with the model parameters.
subfolder	`string`	string	`""`	In case the relevant files are located inside a subfolder of the model repo on huggingface.co.
variant	`string` or `null`	string	`null`	If specified load weights from `variant` filename, e.g. pytorch_model..bin.
use_safetensors	`boolean` or `null`	boolean	`null`	Whether or not to use `safetensors` checkpoints.
weights_only	`boolean`	boolean	`true`	Indicates whether unpickler should be restricted to loading only tensors and primitive types.
key_mapping	`object` or `null`	object	`null`	A potential mapping of the weight names if using a model on the Hub which is compatible to a Transformers architecture, but was not converted accordingly.

NoPeftConfig¶

A trivial config specifying that no peft is used

Type: `object`¶

Property	Type	Required	Possible values	Description
peft_type	`const`	✅	`NO_PEFT`

NoQuantizationConfig¶

A marker not to use quantization

Type: `object`¶

Property	Type	Required	Possible values	Default	Description
quantization_type	`const`		`no-quantization`	`"no-quantization"`

NoneDataInput¶

A special type for not using data e.g. in validation

Type: `object`¶

Property	Type	Required	Possible values	Default	Description
type	`const`	✅	`NONE`
data_type	`string`		string	`"ChatConversation"`	Generally, the data_type is automatically set based on the experiment config method.

Overrides¶

Override options

These implement dynamic scaling for the learning rate.

Type: `object`¶

Property	Type	Required	Possible values	Default	Description
lr_multiplier	`number`		number	`1.0`	Multiplier applied to the learning rate in the training_args
lr_batch_size_scaling	`string`		`none` `sqrt` `linear`	`"none"`	Scales the learning rate in the training_args by a factor derived from the total training batch size. 'none': No scaling. 'sqrt': Multiplies learning rate by square root of batch size (a classic scaling rule). 'linear': Multiplies learning rate by the batch size (a more modern scaling rule).

PeftType¶

Enum class for the different types of adapters in PEFT.

Supported PEFT types: - PROMPT_TUNING - MULTITASK_PROMPT_TUNING - P_TUNING - PREFIX_TUNING - LORA - ADALORA - BOFT - ADAPTION_PROMPT - IA3 - LOHA - LOKR - OFT - XLORA - POLY - LN_TUNING - VERA - FOURIERFT - HRA - BONE - RANDLORA - C3A

Type: `string`¶

Possible Values: PROMPT_TUNING or MULTITASK_PROMPT_TUNING or P_TUNING or PREFIX_TUNING or LORA or ADALORA or BOFT or ADAPTION_PROMPT or IA3 or LOHA or LOKR or OFT or POLY or LN_TUNING or VERA or FOURIERFT or XLORA or HRA or VBLORA or CPT or BONE or RANDLORA or TRAINABLE_TOKENS or C3A

PretrainedPeftConfig¶

PEFT adapter uses the config and initialisation from a pretrained adapter

Type: `object`¶

Property	Type	Required	Possible values	Description
peft_type	`const`	✅	`PRETRAINED_PEFT`
name_or_path	`string`	✅	string	HF ID or path to the pretrained peft.

RunConfig¶

Experiment running configuration

Type: `object`¶

Property	Type	Possible values	Default	Description
model	`string`	string	`"/local_resources/basemodel"`	Local path to model to be fine-tuned. Normally this should be /local_resources/basemodel
model_args	`object`	ModelArguments	{"dtype": "auto", "pretrained_model_name_or_path": null, "config": null, "cache_dir": null, "from_tf": false, "from_flax": false, "ignore_mismatched_sizes": false, "force_download": false, "proxies": null, "output_loading_info": false, "local_files_only": false, "token": null, "revision": "main", "attn_implementation": null, "device_map": "auto", "max_memory": null, "tp_plan": null, "tp_size": null, "offload_folder": null, "offload_buffers": false, "subfolder": "", "variant": null, "use_safetensors": null, "weights_only": true, "key_mapping": null}
tokenizer	`string` or `null`	string	`null`	Model HuggingFace ID, or path, or None to use the one associated with the model
use_fast_tokenizer	`boolean`	boolean	`true`	Use the Fast version of the tokenizer. The 'slow' version may be compatible with more features.
resume_from_checkpoint	`boolean` or `string`	boolean and/or string	`false`	Normally should be set to 'auto' to continue if a checkpoint exists. Can set to True to always try to continue, False to never try, or a path to load from a specific path.
final_checkpoint_name	`string`	string	`"checkpoint-final"`	Name of final checkpoint. Should be left as default
determinism	`string`	`no` `half` `full`	`"no"`	Set the level of determinism in implementations. Deterministic implementations are not always available, and when they are, they are usually slower than their non-deterministic counterparts. Recommended for debugging only. 'no': No determinism. 'half': Prefer deterministic implementations. 'full': Only fully deterministic implementations, error out on operations that only have non-deterministic implementations.

SFTArguments¶

Supervised fine-tuning arguments

Type: `object`¶

Property	Type	Possible values	Default	Description
max_seq_length	`integer`	integer	`2048`	Maximum length input sequence length. Longer sequences will be filtered or truncated.
length_handling	`string`	`filter` `truncate`	`"filter"`	How to handle examples that are longer than max_seq_length. 'filter': Filter out these examples from the training set. 'truncate': Truncate these examples to max_seq_length. Note that this might lead to loss of information and worse performance, especially if the important information is at the end of the sequence.
save_name_if_new_basemodel	`string`	string	`"checkpoint-new-basemodel"`	If a new basemodel is saved, it will be saved with this name
train_on_completions_only	`boolean`	boolean	`false`	Only compute loss on the assistant's turns.

SilogenTrainingArguments¶

HuggingFace TrainingArguments as Config with additional SiloGen conventions

The list of training arguments is best available online (the version might not be up-to-date here): https://huggingface.co/docs/transformers/v4.57.3/en/main_classes/trainer#transformers.TrainingArguments

The TrainingArguments object does a lot of things besides specifying the training configuration options (e.g. it has computed properties like true training batch size etc.)

TaskType¶

Enum class for the different types of tasks supported by PEFT.

Overview of the supported task types: - SEQ_CLS: Text classification. - SEQ_2_SEQ_LM: Sequence-to-sequence language modeling. - CAUSAL_LM: Causal language modeling. - TOKEN_CLS: Token classification. - QUESTION_ANS: Question answering. - FEATURE_EXTRACTION: Feature extraction. Provides the hidden states which can be used as embeddings or features for downstream tasks.

Type: `string`¶

Possible Values: SEQ_CLS or SEQ_2_SEQ_LM or CAUSAL_LM or TOKEN_CLS or QUESTION_ANS or FEATURE_EXTRACTION

WeightedDatasetDefinition¶

Define a dataset, with a weight for sampling

Type: `object`¶

Property	Type	Required	Possible values	Default	Description
path	`string`	✅	string		Local path to a JSONL file in the finetuning data format
sampling_weight	`number`		number	`1.0`

WeightedMixDataInput¶

A list of datasets where each is sampled by a certain weight

These datasets are interleaved based on the sampling weights. The resulting dataset is fully precomputed, upto the point where every single sample in every dataset gets picked. This means that with small sampling weights, it can take a lot of draws to see every sample from a dataset and so the resulting dataset can be very large.

The datasets themselves need to be in the finetuning supported JSONL formats. For SFT this means lines:

{"messages": [{"content": "string", "role": "string"}]}

For DPO this means lines of: { "prompt_messages": [{"content": "string", "role": "string"}], "chosen_messages": [{"content": "string", "role": "string"}], "rejected_messages": [{"content": "string", "role": "string"}] }

Type: `object`¶

Property	Type	Required	Possible values	Default	Description
type	`const`	✅	`PRECOMPUTE_WEIGHTED_MIX`
datasets	`array`	✅	WeightedDatasetDefinition
data_type	`string`		string	`"ChatConversation"`	Generally, the data_type is automatically set based on the experiment config method.
seed	`integer`		integer	`19851243`	Seed for the random number generator for interleaving draws

Fine-tuning config structure and parameters for SFT¶

Definitions¶

AutoSplitDataInput¶

Type: object¶

BatchsizeConfig¶

Type: object¶

BnBQuantizationConfig¶

Type: object¶

ChatTemplateName¶

Type: string¶

ChatTrainValidConfig¶

Type: object¶

ConcatenationDataInput¶

Type: object¶

DatasetDefinition¶

Type: object¶

FinetuningTrackingConfig¶

Type: object¶

GenericPeftConfig¶

Type: object¶

MissingPadTokenStrategy¶

Type: string¶

ModelArguments¶

Type: object¶

NoPeftConfig¶

Type: object¶

NoQuantizationConfig¶

Type: object¶

NoneDataInput¶

Type: object¶

Overrides¶

Type: object¶

PeftType¶

Type: string¶

PretrainedPeftConfig¶

Type: object¶

RunConfig¶

Type: object¶

SFTArguments¶

Type: object¶

SilogenTrainingArguments¶

TaskType¶

Type: string¶

WeightedDatasetDefinition¶

Type: object¶

WeightedMixDataInput¶

Type: object¶

Type: `object`¶

Type: `object`¶

Type: `object`¶

Type: `string`¶

Type: `object`¶

Type: `object`¶

Type: `object`¶

Type: `object`¶

Type: `object`¶

Type: `string`¶

Type: `object`¶

Type: `object`¶

Type: `object`¶

Type: `object`¶

Type: `object`¶

Type: `string`¶

Type: `object`¶

Type: `object`¶

Type: `object`¶

Type: `string`¶

Type: `object`¶

Type: `object`¶