Fine-tuning
Overview
LLMs in general are typically trained for generic use cases. Fine-tuning LLMs allows users to further train the model to provide more accurate answers for a specific domain or use case. This process involves providing the model with a dataset containing examples of input and output from a specific domain. LLM Labs helps simplify this process by providing a user-friendly way to fine-tune and deploy open-source models, allowing you to tailor LLMs to your exact needs.
Fine-Tuning Providers
There are two main options available for fine-tuning:
Served by Datasaur
In this option, the fine-tuning process is fully managed and hosted by Datasaur. You don’t need to set up any infrastructure or manage cloud resources — Datasaur will handle everything for you.
Currently, we support the following providers and pre-trained model(s) for fine-tuning served by Datasaur:
Amazon Bedrock
Amazon Nova Micro
Amazon Nova Lite
Amazon Nova Pro
Amazon Titan Text G1 - Express
Amazon Titan Text G1 - Lite
Cohere Command
Cohere Command Light
Meta Llama 3.1 8B
Meta Llama 3.1 70B
Bring Your Own Key (BYOK)
In this option, the fine-tuning process runs in your own cloud account. This gives you full control over the environment — you can configure machine types, regions, and other resources as needed. BYOK is ideal for teams that require advanced customization.
Currently, we support fine-tuning with BYOK through Amazon SageMaker Jumpstart for various models from Meta Llama*, AI Singapore, Qwen, and DeepSeek. Only instruct models are supported for fine-tuning at this time. If you’d like to fine-tune a model not yet available in LLM Labs, please contact our support team at [email protected]. BYOK models requires additional prerequisites before you are able to fine-tune:
Your customer plan type should be Subscription, Enterprise Usage Credit, or Internal
Connect to the selected provider through Models catalog > Manage providers.
Ensure your account role of the selected provider have necessary permission to perform fine-tuning.
Create fine-tuned models
This section guides you through the process of fine-tuning your models in LLM Labs.
Step 1: Set up model
Navigate to the Models page.
On the Available tab, go to the Fine-tuned LLMs section, and click Create fine-tuned model.

My models Set up your fine-tuning job.

Name your fine-tuned model. This name uniquely identifies your fine-tuned model.
For AWS SageMaker, it will be used as the trainingJobId and must comply with the AWS SageMaker’s naming requirements:
Uniqueness: Must be unique within the same AWS Region and AWS account.
Length: 1–63 characters.
Allowed characters: Letters (
a–z,A–Z), numbers (0–9), and hyphens ().Format rules:
Must start and end with an alphanumeric character.
Cannot include spaces, underscores (
_), or special characters other than hyphens.
Select a base model that you want to fine-tune. You can select either:
Pre-trained LLMs provided by Datasaur, or
Existing fine-tuned models
Choose a dataset. You can upload either a .csv consisting of 2 columns:
prompt,expected completion, or you can choose an existing dataset from the library. For the validation dataset, you have 3 options:Split from selected dataset: Datasaur will split the uploaded dataset and use it for validation data. You will need to configure the validation size using a percentage.

Split from selected dataset Use new dataset: You will need to add a new dataset to use as validation.

Use new dataset None: Choose this option if you don't want to add a validation dataset.
Important: Please note that validation data are required for several base models (e.g., Cohere Command, Cohere Command Light). On the other hand, some models may not support validation data at all.
Configure your resources. This step applies to SageMaker only and determines the cost and performance of your training job.
Instance type: The machine used to fine-tune your model.
Instance count: The number of instances for training.
Training volume size: The storage capacity allocated to each instance.
For detailed recommendations, see the Choosing the Right Resources section.
Step 2: Adjust hyperparameters
Next, you will need to configure the hyperparameters for your fine-tuning project.

The fundamental hyperparameters are epochs and learning rate.
Some models also support advanced hyperparameters, which Datasaur provides with default values. These parameters are always applied alongside the fundamental ones, but you can adjust them for further fine-tuning if desired.
The advanced hyperparameters include: early stopping callback (for fine-tuning with validation), batch size, and learning warm-up steps.
Optional Hyperparameters

If you wish to include additional hyperparameters, you can define them in Optional Hyperparameters using a JSON schema. For example, to enable LoRA, you can specify:
{
"use_lora": true,
"lora_config": {
"r": 8,
"lora_alpha": 16,
"lora_dropout": 0.05
}
}You cannot override the fundamental or advanced hyperparameters in Optional Hyperparameters.
If you partially specify hyperparameters without providing its full configuration, the default values will be applied automatically. For example, if you configure use_lora without specifying lora_config, the default LoRA configuration will be:
{
"lora_config": {
"r": 8,
"lora_alpha": 8,
"lora_dropout": 0.0
}
}For more recommendation about hyperparameters, see the Common Fine-Tuning Hyperparameters section.
Step 3: Review job
The last step is to review your fine-tuning job before you start the process.

Review job You can review the optional hyperparameters that you add by clicking View optional hyperparameters.

Review job You can also view the predicted cost by clicking the View total predicted cost button on the Costs section. It will calculate and show you the total predicted cost for starting the fine-tuning process.

Review job
Once you have reviewed the configuration, you will need to check the acknowledgement checkbox.
Lastly, click Start fine-tuning job and the training process will start.

Training process
Once the training process is complete, your model will be available to deploy.

My models page
Model management
Model status
There are 7 possible statuses for the fine-tuned models.
Training: The model is currently being trained on your dataset. This status indicates that the training process is in progress, and the model is learning from your data.

Training status Training failed: The model training process has failed due to an error. This status indicates that the training process was interrupted, and you may need to investigate and resolve the issue.

Training failed Stopping training: The model training process is being stopped. This status indicates that someone has chosen to stop the training.

Stopping training Training stopped: The model training process has been stopped. This status indicates that the training process has been successfully stopped, and you can’t continue the training once it stopped.

Training stopped Not deployed: The model has been trained but has not yet been deployed for use. You can deploy the model to use it in Sandbox.

Not deployed Deploying: The model is being deployed for use. This status indicates that the deployment process is in progress, and the model will soon be available for use in Sandbox.

Deploying Deployed: The model has been successfully deployed. This status indicates that the model is now available for use in Sandbox, and you can start using it to generate predictions or responses.

Deployed
Deploy models
To deploy a fine-tuned model:
Click Deploy model to start the deployment.

In the dialog that appears, specify the auto undeploy schedule.

Click Deploy model in the dialog to confirm and the process will start.

Deploying process Once the process is finish, your model will be available to use for experiment in Sandbox. Learn more about Sandbox

Model deployed
Undeploy models
Click the more menu (three-dots) in the right corner of the model card and select Undeploy.

Undeploy model Confirm the process by clicking Undeploy in the dialog that appears.

Your model will be undeployed and you will no longer be charged for the hourly cost.
View model details
To view the model details, click the more menu (three-dots) and select View details. The details of the fine-tuned model will be shown.

In this dialog, you can view the dataset, validation dataset, models used, hyperparameter configurations, the creator, and storage cost information.


Download models

You can use your fine-tuned model for further tasks by downloading it. The model can then be loaded in your local environment for inference or additional fine-tuning, uploaded to your own SageMaker or other cloud environment, or backed up for compliance or reproducibility purposes.
When you initiate a download, the selected fine-tuned model is exported as a .tar.gz archive that contains:
Model weights: The core parameters of the fine-tuned model.
Configuration files: Metadata needed to load the model correctly (e.g., tokenizer config, architecture).
Optional adapters: If the model includes LoRA or other PEFT adapters, these are included as separate files.
Delete models
To delete a fine-tuned model, click the more menu (three-dots) and select Delete.

In the dialog that appears, check the acknowledgement checkbox, and confirm the deletion by clicking Delete model.

Use in Sandbox
Once a fine-tuned model is deployed, it will be available in the Sandbox for further experimentation and testing. This allows you to integrate and test the specialized model within your specific applications. Learn more about Sandbox.

Choosing the Right Resources
Every fine-tuning job has different resource requirements depending on model size, dataset size, and training configuration (e.g., batch size, sequence length, LoRA rank). Each instance differs in GPU memory (VRAM), number of GPUs, and storage capacity, which affect training performance, duration, and cost.
Example:
ml.g5.4xlarge— 1 GPU, 16 GB VRAM, 125 GB storage → not sufficient for 7B modelml.g5.12xlarge— 4 GPUs, 64 GB VRAM (16 GB per GPU), 375 GB storage
For detailed information, see AWS documentation.
Selecting the right instance ensures stable training, optimal performance, and cost efficiency. To do so, we first need to understand the fine-tuning requirements before choosing the most suitable instance for VRAM, storage, and instance count calculations.
VRAM (GPU Memory)
VRAM determines how large your model and training batch size can be. If VRAM is too small, your training may fail or become extremely slow due to gradient checkpointing or offloading.
Main factors affecting VRAM:
Model size (parameters) — Larger models require more memory to store weights.
Precision (FP32, FP16, BF16, FP8, etc.) — Lower precision reduces VRAM usage, but might have tradeoffs with accuracy.
Optimizer states — Extra memory for momentum, variance, etc. (e.g., Adam/AdamW).
Activations — Temporary memory for forward/backward pass.
Use of LoRA / PEFT — Fine-tunes only a subset of weights, so it will reduce memory usage significantly.
Activations are not included in this simplified estimation, as they are usually small compared to weights, gradients, and optimizer states for typical batch sizes.
Step 1: Identify model size
Model has 7B parameters.
This is the baseline factor for VRAM calculation.
Step 2: Apply precision
Precision determines how much memory each parameter uses.
Common precisions:
PrecisionBytes per parameterMemory effectFP32 (default)
4
Full memory
FP16
2
Half memory of FP32 (×2 reduction)
BF16
2
Half memory of FP32
Example:
VRAM = 7B parameters × 2 bytes (FP16) = 14 GB
Step 3: Add gradients
Each weight requires a gradient of the same precision.
Memory for gradients is approximately the same as weights, so total VRAM doubles compared to Step 2.
Example:
VRAM = 14 GB (weights) + 14 GB (gradients) = 28 GB
Step 4: Add optimizer states
Optimizers like Adam/AdamW store additional memory for momentum and variance (~2× weights). These extra tensors require additional memory for every trainable parameter.
OptimizerExtra tensorsApproximate multiplierAdam / AdamW
2 (momentum + variance)
2× weights
Adafactor
1 (factored variance)
1× weights (can be less in some implementations)
SGD
0–1 (momentum)
0–1× weights
Example (AdamW):
VRAM = 28 GB (weights + gradients) * 28 GB (optimizer) = 56 GB
Step 5: Adjust for training method (full vs LoRA)
For full fine-tuning, all weights, gradients, and optimizer states are used.
Example:
VRAM = 56 GB (previous calculation)
LoRA fine-tunes only a small fraction of model parameters, instead of all of them. This reduces memory for gradients + optimizer states (approximately 10–20%), but the full model weights still need to be loaded for forward computations.
Example (Using 20% as fraction):
VRAM = Weights+(Gradients + Optimizer)×Fraction of trainable parameters
14 GB + ((14 GB + 28 GB) * 20%) = 14 GB + 8.4GB = 22.4GB
Please note that these are rough estimates. Actual VRAM usage can vary depending on factors such as batch size, sequence length, precision, gradient checkpointing, and framework overhead, which can increase memory requirements. At the same time, certain techniques, like LoRA, reduce memory usage for gradients and optimizer states in proportion to the fraction of weights being trained, although the full model weights still need to be loaded.
It is recommended to always leave some extra GPU memory as a buffer to account for unexpected spikes in memory usage, framework overhead, or temporary allocations during training, which helps prevent out-of-memory errors and ensures stable execution.
Training Volume Size
Storage determines how much space you have for the model checkpoints, intermediate artifacts, and dataset.
Step 1: Base model storage
This is the size of the pre-trained weights. Most pre-trained models are stored in FP16 or compressed format.
Example:
7B model (FP16) ≈ 14 GB
Step 2: Checkpoints storage
During training, checkpoints are periodically saved to allow recovery or evaluation.
Each checkpoint contains model weights + optimizer states + metadata.
Typical checkpoint storage ≈ 2–3× the base model size, because:
Optimizer states (AdamW, Adafactor) require extra memory (~1–2× weights)
Metadata and temporary artifacts add minor overhead
Example (1x checkpoint):
7B model (FP16) → 14 GB × 2–3 ≈ 26–39 GB
Step 3: Dataset size
Include the raw dataset size (CSV, JSON, etc.).
Example: 1 GB dataset adds 1 GB to volume
Step 4: Recommended volume size
Storage Size ≈ Base model + Checkpoints + Dataset
Example:
14 GB (weights) + 30 GB (checkpoints) + 1 GB (dataset) ≈ 35 GB
Recommendation: Round up (e.g., 50 GB) to allow buffer for temporary files and unexpected artifacts.
Instance Count
Instance count determines how many machines are used for distributed training.
Single-instance (1) — Simpler setup, lower cost, but slower for large models.
Multiple instances (≥2) — Distributes model and data across machines for parallel training, but requires more setup time and inter-node communication.
When to increase instance count:
Model is too large to fit in one GPU (e.g., 70B+)
Dataset is massive and you want to reduce training time
While multi-GPU is helpful to handle large model, beware of overhead (time, VRAM, and compute efficiency).
Common Fine-Tuning Hyperparameters
Hyperparameters control how the model learns and affect training stability, performance, and final model quality. Choosing appropriate values balances training efficiency, accuracy, and resource cost. Common fine-tuning hyperparameters include:
Fundamental Hyperparameters
Epochs (num_train_epochs)
Number of times the model sees the full dataset. More epochs improve learning but increase training time and risk overfitting.
Learning rate
Step size for optimizer updates. High learning rates can cause unstable training, low rates may result in slow convergence or underfitting.
Advanced Hyperparameters
Batch size (per_device_train_batch_size)
Number of examples per batch before updating model parameters. Larger batch sizes increase VRAM usage but improve gradient estimation stability.
Learning rate warmup steps
Gradually increases learning rate at the start of training. Helps stabilize training and prevents early divergence.
Early stopping callback
Stops training if validation metrics do not improve after a set number of steps. Helps prevent overfitting and saves compute cost. Only necessary if training uses validation data.
early_stopping_patience determines number of evaluation calls to wait before stopping training when the metric stops improving.
early_stopping_threshold determines minimum improvement required to consider the metric as "improving”.
Optional Hyperparameters
LoRA Configuration
Fine-tunes only a subset of model parameters, reducing memory and optimizer state requirements.
Key LoRA parameters and their effects:
ParameterDescriptionExample / EffectrRank of the low-rank matrices used for adaptation
Higher
rcan improve learning but uses more memoryalphaScaling factor applied to the LoRA updates
alpha = 32→ higher values amplify LoRA weight updates; often scaled relative tordropoutDropout applied to LoRA layers
dropout = 0.05→ helps prevent overfitting by randomly zeroing some LoRA updatestarget_modulesList of model layers to apply LoRA
["q_proj", "v_proj"]→ only these attention projection layers are adapted; targeting more layers increases capacity and memory usefan_in_fan_outWhether LoRA weights are transposed for some model types
falsefor most standard models; needed for certain architecturesbiasHow bias terms are handled in LoRA layers
"none","all","lora_only"; usually"none"for simplicityExample:
{ "use_lora": true, "lora_config": { "r": 8, "alpha": 8, "dropout": 0.05, "target_modules": ["q_proj", "v_proj"], "fan_in_fan_out": false, "bias": "none" } }
Merge Adapters
Determines whether the LoRA adapters are merged into the base model weights after fine-tuning.
If true, adapters are merged into the main model, so the model can be used independently without loading separate LoRA weights.
If false, Adapters remain separate, allowing further fine-tuning or swapping different adapters.
Example:
{ "use_lora": true, "merge_adapters": true }
Logging Strategy
Frequency of recording training metrics, such as training loss and accuracy, to provides more insights. But it may slightly slow down training.
Options:
epoch,steps, ornone. If using steps,logging_stepsmust also be set.Example:
"sft_config": { "logging_strategy": "steps", "logging_steps": 50, }
Evaluation strategy
Determines when validation is going to be performed (after every epoch or a number of steps).
Options:
epoch,steps, ornone. If usingsteps,eval_stepsmust be specified (or defaults tologging_steps).If different from logging strategy, evaluation logs may appear at a different interval.
Example:
"sft_config": { "eval_strategy": "steps", }
Mixed Precision Training
Controls the numerical precision used during training to balance memory usage and training speed.
Options:
OptionsDescriptionNotefp32(32-bit floating point)highest precision but uses most memory
when both fp16 and bf16 are disabled
fp16(16-bit floating point)~50% memory reduction, faster training
when explicitly enabled
bf16(Brain Float 16-bit)better numerical stability than fp16, requires Ampere+ GPUs
either explicitly enabled OR when using FlashAttention2
Example:
"sft_config": { "fp16": "false", "bf16": "true, }
Attention Implementation
Controls which attention algorithm to use during training. Different implementations offer varying trade-offs between memory usage, speed, and compatibility.
Use use_cache: false for training to save memory
Options:
Attention TypeDescriptionPrecisionTraining Speedeager
Universal compatibility but slowest and uses most memory
fp16/fp32
1x (baseline)
sdpa
Good balance of speed and compatibility, works with fp16/fp32
fp16/fp32
1.5x
flash_attention_2 (default)
Fastest and most memory-efficient, but forces bf16 precision
Forced bf16
3x
Example:
{ "model_init_config": { "attn_implementation": "flash_attention_2", "use_cache": false }, }
Gradient Accumulation
Number of batches to accumulate before a backward pass.
Useful when batch size exceeds GPU memory. Allows training larger effective batch sizes without running out of VRAM.
Example:
"sft_config": { "gradient_accumulation_steps": 4 }
Last updated