Finetuning config structure and parameters¶
This document describes the structure of the finetuning configuration, and the parameters and values that can be defined there.
See the finetuning config section this config file for an example of a valid configuration. See the various sub-configs for their options. Additional properties are not allowed.
Top-level properties:
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
method | const |
sft |
"sft" |
||
data_conf | object |
✅ | ChatTrainValidConfig | The data input config | |
training_args | object |
✅ | SilogenTrainingArguments | Transformer TrainingArguments with some restrictions | |
overrides | object |
Overrides | {"num_train_epochs": null, "lr_multiplier": 1.0, "lr_batch_size_scaling": "none"} |
Override options to simplify the config interface | |
batchsize_conf | object |
✅ | BatchsizeConfig | Batch size configuration | |
peft_conf | object |
✅ | NoPeftConfig or PretrainedPeftConfig or GenericPeftConfig | Adapter configuration | |
run_conf | object |
✅ | RunConfig | Model related configuration | |
tracking | object or null |
FinetuningTrackingConfig | MLFlow tracking configuration | ||
quant_conf | object |
BnBQuantizationConfig or NoQuantizationConfig | {"quantization_type": "no-quantization"} |
Quantization configuration | |
sft_args | object |
✅ | SFTArguments | SFT specific arguments |
Definitions¶
AutoSplitDataInput¶
Automatic validation split from the training data
Type: object
¶
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
type | const |
✅ | AUTO_SPLIT |
||
data_type | string |
string | "ChatConversation" |
generally, the data_type is automatically set based on the experiment config method | |
ratio | number |
number | 0.2 |
Ratio of the training data to use for validation | |
seed | integer |
integer | 1289525893 |
Seed for the random number generator for splitting |
BatchsizeConfig¶
Config for determining the total batch size
Total batch size is the effective batch size for the complete training run. It is equal to number of processes * per-device batch size * accumulation.
The maximum batch size per device is the maximum batch size that can be accommodated on a single device. This mostly limited by the memory capacity of the device.
Type: object
¶
Property | Type | Required | Possible values | Description |
---|---|---|---|---|
total_train_batch_size | integer |
✅ | integer | The total batch size for the training run |
max_per_device_train_batch_size | integer |
✅ | integer | The maximum training batch size per device |
per_device_eval_batch_size | integer or null |
integer | The maximum eval batch size per device, if not given, will use same as training batch size |
BnBQuantizationConfig¶
Bits and Bytes configuration
The options are from the BitsAndBytes config, see: https://huggingface.co/docs/transformers/en/main_classes/quantization#transformers.BitsAndBytesConfig
Type: object
¶
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
quantization_type | const |
bits-and-bytes |
"bits-and-bytes" |
||
load_in_8bit | boolean |
boolean | False |
||
load_in_4bit | boolean |
boolean | False |
||
llm_int8_threshold | number |
number | 6.0 |
||
llm_int8_skip_modules | array or null |
string | |||
llm_int8_enable_fp32_cpu_offload | boolean |
boolean | False |
||
llm_int8_has_fp16_weight | boolean |
boolean | False |
||
bnb_4bit_compute_dtype | string or null |
string | |||
bnb_4bit_quant_type | const |
fp4 and/or nf4 |
"fp4" |
||
bnb_4bit_use_double_quant | boolean |
boolean | False |
||
bnb_4bit_quant_storage | string or null |
string |
ChatTrainValidConfig¶
Training time data configuration.
Always defines some DataInput for training data and can include validation DataInput, though a trivial NoneDataInput is also allowed for the validation side.
Additionally includes chat template and padding configurations, as those are part of the data input pipeline.
Type: object
¶
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
training_data | object |
✅ | ConcatenationDataInput or WeightedMixDataInput | ||
validation_data | object |
✅ | AutoSplitDataInput or ConcatenationDataInput or NoneDataInput | ||
chat_template_name | string |
mistral-with-system or chat-ml or poro or keep-original or simplified-llama31 |
"mistral-with-system" |
||
padding_side | string |
string | "right" |
Padding side, right is usually right. | |
missing_pad_token_strategy | string |
MissingPadTokenStrategy | "bos-repurpose" |
See the MissingPadTokenStrategys for descriptions of the options |
ConcatenationDataInput¶
A simple list of datasets
These are simply concatenated, the same as sampling all with equal weight.
The datasets themselves need to be in the finetuning supported JSONL formats. For SFT this means lines:
{"messages": {"content": "string", "role": "string"}}
For DPO this means lines of:
{"prompt_messages": {"content": "string", "role": "string"}, "chosen_messages": {"content": "string", "role": "string"}, "rejected_messages": {"content": "string", "role": "string"}}
Type: object
¶
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
type | const |
✅ | CONCATENATION |
||
datasets | array |
✅ | DatasetDefinition | ||
data_type | string |
string | "ChatConversation" |
generally, the data_type is automatically set based on the experiment config method |
DatasetDefinition¶
Define how to load a dataset
Type: object
¶
Property | Type | Required | Possible values | Description |
---|---|---|---|---|
path | string |
✅ | string | Local path to a JSONL file in the finetuning data format |
FinetuningTrackingConfig¶
Settings that define how run details are logged
Type: object
¶
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
mlflow_server_uri | string |
✅ | string | MLflow server URI. Can be local path | |
experiment_name | string |
✅ | string | Experiment name that is used for MLFlow tracking | |
run_id | string or null |
string | Run id, to resume logging to previousely started run | ||
run_name | string or null |
string | Run name, to give meaningful name to the run to be displayed in MLFlow UI. Used only when run_id is unspecified | ||
hf_mlflow_log_artifacts | string |
string | "False" |
Whether to store model artifacts in MLFlow |
GenericPeftConfig¶
Config for any new initialized PEFT Adapter
See https://huggingface.co/docs/peft/tutorial/peft_model_config for the possible kwargs and https://github.com/huggingface/peft/blob/v0.7.1/src/peft/utils/peft_types.py for the types.
Example¶
>>> loaded_data = {'peft_type':'LORA', 'task_type': 'CAUSAL_LM',
... 'peft_kwargs': {'r': 32, 'target_modules': ['v_proj']}}
>>> generic_conf = GenericPeftConfig(**loaded_data)
>>> # Then later in the code something like:
>>> model = transformers.AutoModel.from_pretrained('hf-internal-testing/tiny-random-MistralModel')
>>> peft.get_peft_model(model, generic_conf.get_peft_config())
PeftModelForCausalLM(
(base_model): LoraModel(
...
)
)
Type: object
¶
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
peft_type | string |
✅ | PeftType | ||
task_type | string |
TaskType | "CAUSAL_LM" |
||
peft_kwargs | object |
object |
MissingPadTokenStrategy¶
Specifies the available missing pad token strategies.
We've shown in a small set of experiments that repurposing EOS can start to hurt performance while the other options seem to work equally well.
Repurposing EOS is the default in many online sources, but it is actually a bad idea if we want to predict EOS, as all the pad_token_ids get ignored in loss computation, and thus the model does not learn to predict the end of the text. However, for models that have additional tokens for end of message, end of turn, etc. this is not so dangerous.
Repurposing BOS is similar to repurposing EOS, but since we do not need to predict BOS, this may be more sensible.
Repurposing UNK can work with tokenizers that never produce UNKs in normal data (e.g. Mistral tokenizers should have a byte fall-back so that everything can be tokenized).
UNK_CONVERT_TO_EOS uses a hack where the unk_token_id is initially used for padding, but in the collation phase the input-side UNKs (padding) gets set to EOS, so that the input-side padding looks like EOS. On the output-side, the UNKs (padding) still gets ignored. NOTE: This will leave the tokenizer's pad_token_id set to the unk_token_id; so any subsequent use of the model where padding is involved should somehow explicitly set the pad_token_id again.
Type: string
¶
Possible Values: eos-repurpose
or bos-repurpose
or unk-repurpose
or unk-convert-to-eos
ModelArguments¶
These are passed to AutoModelForCausalLM.from_pretrained
See parameter docstrings and help at: https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained See below in "Parameters for big model inference" too, it affects training too. Also note that this link takes you to the transformers main branch version - be sure to compare with the installed version of transformers (that keeps changing over time, and it is difficult to keep this doctstring up to date, so we wanted to link to the latest here).
Some important parameters to consider are: - device_map : A map that specifies where each submodule should go. It doesn’t need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device. If we only pass the device (e.g., "cpu", "cuda:1", "mps", or a GPU ordinal rank like 1) on which the model will be allocated, the device map will map the entire model to this device. Passing device_map = 0 means put the whole model on GPU 0. - attn_implementation : The attention implementation to use in the model (if relevant). Can be any of "eager" (manual implementation of the attention), "sdpa" (using F.scaled_dot_product_attention), or "flash_attention_2" (using Dao-AILab/flash-attention). By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual "eager" implementation.
NOTE: This does not include quantization_config. Quantization config is specified separately.
Type: object
¶
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
torch_dtype | const |
auto |
"auto" |
||
device_map | object or string or null |
object and/or string | Custom device map so that you can manually override the choices that HuggingFace would make. This can also be a string to specify "auto", "balanced_low_0", or "sequential" | ||
max_memory | object or null |
object | |||
low_cpu_mem_usage | boolean |
boolean | False |
||
attn_implementation | string or null |
string | Note: this can be set to "sdpa", "flash_attention_2", "eager" | ||
offload_folder | string or null |
string | |||
offload_state_dict | boolean or null |
boolean | Default is True if offloading (otherwise no effect) | ||
offload_buffers | boolean or null |
boolean | |||
use_cache | boolean |
boolean | True |
Saves generated hidden states to speed up generation. See: https://discuss.huggingface.co/t/what-is-the-purpose-of-use-cache-in-decoder/958 use_cache is mutually exclusive with gradient_checkpointing | |
cache_dir | string or null |
string | |||
force_download | boolean |
boolean | False |
||
local_files_only | boolean |
boolean | False |
||
proxies | object or null |
object | |||
resume_download | boolean |
boolean | False |
||
revision | string |
string | "main" |
||
code_revision | string |
string | "main" |
||
subfolder | string or null |
string | |||
token | string or null |
string | |||
use_safetensors | boolean or null |
boolean | |||
variant | string or null |
string | |||
trust_remote_code | boolean |
boolean | False |
Warning: if set to True , allows execution of downloaded remote code |
NoPeftConfig¶
A trivial config specifying that no peft is used
Type: object
¶
Property | Type | Required | Possible values | Description |
---|---|---|---|---|
peft_type | const |
✅ | NO_PEFT |
NoQuantizationConfig¶
A marker not to use quantization
Type: object
¶
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
quantization_type | const |
no-quantization |
"no-quantization" |
NoneDataInput¶
A special type for not using data e.g. in validation
Type: object
¶
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
type | const |
✅ | NONE |
||
data_type | string |
string | "ChatConversation" |
generally, the data_type is automatically set based on the experiment config method |
Overrides¶
Override options that allow simple interfaces for charts using these configs
This is particularly useful for a helm chart interface where we include the finetuning package config as a part of the values.yaml file. These a more flexible helm interface with certain keys brought to the top level.
Type: object
¶
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
num_train_epochs | integer or number or null |
number | Overrides the number of epochs in the training_args | ||
lr_multiplier | number |
number | 1.0 |
Multiplier applied to the learning rate in the training_args | |
lr_batch_size_scaling | string |
none sqrt linear |
"none" |
Scales the learning rate in the training_args by a factor derived from the total training batch size. none : No scaling. sqrt : Multiplies learning rate by square root of batch size (a classic scaling rule). linear : Multiplies learning rate by the batch size (a more modern scaling rule). |
PeftType¶
Enum class for the different types of adapters in PEFT.
Supported PEFT types: - PROMPT_TUNING - MULTITASK_PROMPT_TUNING - P_TUNING - PREFIX_TUNING - LORA - ADALORA - BOFT - ADAPTION_PROMPT - IA3 - LOHA - LOKR - OFT - XLORA - POLY - LN_TUNING - VERA - FOURIERFT - HRA
Type: string
¶
Possible Values: PROMPT_TUNING
or MULTITASK_PROMPT_TUNING
or P_TUNING
or PREFIX_TUNING
or LORA
or ADALORA
or BOFT
or ADAPTION_PROMPT
or IA3
or LOHA
or LOKR
or OFT
or POLY
or LN_TUNING
or VERA
or FOURIERFT
or XLORA
or HRA
or VBLORA
PretrainedPeftConfig¶
PEFT adapter uses the config and initialisation from a pretrained adapter
Type: object
¶
Property | Type | Required | Possible values | Description |
---|---|---|---|---|
peft_type | const |
✅ | PRETRAINED_PEFT |
|
name_or_path | string |
✅ | string | HF ID or path to the pretrained peft |
RunConfig¶
Experiment running configuration
Type: object
¶
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
model | string |
string | "/local_resources/basemodel" |
Local path to model to be fine-tuned. Normally this should be /local_resources/basemodel |
|
model_args | object |
ModelArguments | {"torch_dtype": "auto", "device_map": "auto", "max_memory": null, "low_cpu_mem_usage": false, "attn_implementation": null, "offload_folder": null, "offload_state_dict": null, "offload_buffers": null, "use_cache": true, "cache_dir": null, "force_download": false, "local_files_only": false, "proxies": null, "resume_download": false, "revision": "main", "code_revision": "main", "subfolder": null, "token": null, "use_safetensors": null, "variant": null, "trust_remote_code": false} |
||
tokenizer | string or null |
string | Model HuggingFace ID, or path, or None to use the one associated with the model | ||
use_fast_tokenizer | boolean |
boolean | True |
Use the Fast version of the tokenizer. The 'slow' version may be compatible with more features. | |
resume_from_checkpoint | boolean or string |
boolean and/or string | False |
Normally should be set to 'auto' to continue if a checkpoint exists. Can set to True to always try to continue, False to never try, or a path to load from a specific path. |
|
final_checkpoint_name | string |
string | "checkpoint-final" |
Name of final checkpoint. Should be left as default |
SFTArguments¶
Supervised fine-tuning arguments
Type: object
¶
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
max_seq_length | integer |
integer | 2048 |
Maximum length input sequence length. Longer sequences will be filtered out. | |
save_name_if_new_basemodel | string |
string | "checkpoint-new-basemodel" |
If a new basemodel is saved, it will be saved with this name | |
train_on_completions_only | boolean |
boolean | False |
Only compute loss on the assistant's turns. |
SilogenTrainingArguments¶
HuggingFace TrainingArguments as Config with additional SiloGen conventions
The list of training arguments is best available online (the version might not be up-to-date here): https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments
The TrainingArguments object does a lot of things besides specifying the training configuaration options (e.g. it has computed properties like true training batch size etc.)
TaskType¶
Enum class for the different types of tasks supported by PEFT.
Overview of the supported task types: - SEQ_CLS: Text classification. - SEQ_2_SEQ_LM: Sequence-to-sequence language modeling. - CAUSAL_LM: Causal language modeling. - TOKEN_CLS: Token classification. - QUESTION_ANS: Question answering. - FEATURE_EXTRACTION: Feature extraction. Provides the hidden states which can be used as embeddings or features for downstream tasks.
Type: string
¶
Possible Values: SEQ_CLS
or SEQ_2_SEQ_LM
or CAUSAL_LM
or TOKEN_CLS
or QUESTION_ANS
or FEATURE_EXTRACTION
WeightedDatasetDefinition¶
Define a dataset, with a weight for sampling
Type: object
¶
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
path | string |
✅ | string | Local path to a JSONL file in the finetuning data format | |
sampling_weight | number |
number | 1.0 |
WeightedMixDataInput¶
A list of datasets where each is sampled by a certain weight
These datasets are interleaved based on the sampling weights. The resulting dataset is fully precomputed, upto the point where every single sample in every dataset gets picked. This means that with small sampling weights, it can take a lot of draws to see every sample from a dataset and so the resulting dataset can be very large.
The datasets themselves need to be in the finetuning supported JSONL formats. For SFT this means lines:
{"messages": {"content": "string", "role": "string"}}
For DPO this means lines of:
{"prompt_messages": {"content": "string", "role": "string"}, "chosen_messages": {"content": "string", "role": "string"}, "rejected_messages": {"content": "string", "role": "string"}}
Type: object
¶
Property | Type | Required | Possible values | Default | Description |
---|---|---|---|---|---|
type | const |
✅ | PRECOMPUTE_WEIGHTED_MIX |
||
datasets | array |
✅ | WeightedDatasetDefinition | ||
data_type | string |
string | "ChatConversation" |
generally, the data_type is automatically set based on the experiment config method | |
seed | integer |
integer | 19851243 |
Seed for the random number generator for interleaving draws |