Fine-tuning

Overview

LLMs in general are typically trained for generic use cases. Fine-tuning LLMs allows users to further train the model to provide more accurate answers for a specific domain or use case. This process involves providing the model with a dataset containing examples of input and output from a specific domain. LLM Labs helps simplify this process by providing a user-friendly way to fine-tune and deploy open-source models, allowing you to tailor LLMs to your exact needs.

Fine-Tuning Providers

There are two main options available for fine-tuning:

  1. Served by Datasaur

    In this option, the fine-tuning process is fully managed and hosted by Datasaur. You don’t need to set up any infrastructure or manage cloud resources — Datasaur will handle everything for you.

    Currently, we support the following providers and pre-trained model(s) for fine-tuning served by Datasaur:

    1. Amazon Bedrock

      1. Amazon Nova Micro

      2. Amazon Nova Lite

      3. Amazon Nova Pro

      4. Amazon Titan Text G1 - Express

      5. Amazon Titan Text G1 - Lite

      6. Cohere Command

      7. Cohere Command Light

      8. Meta Llama 3.1 8B

      9. Meta Llama 3.1 70B

  2. Bring Your Own Key (BYOK)

    In this option, the fine-tuning process runs in your own cloud account. This gives you full control over the environment — you can configure machine types, regions, and other resources as needed. BYOK is ideal for teams that require advanced customization.

    Currently, we support fine-tuning with BYOK through Amazon SageMaker Jumpstart for various models from Meta Llama*, AI Singapore, Qwen, and DeepSeek. Only instruct models are supported for fine-tuning at this time. If you’d like to fine-tune a model not yet available in LLM Labs, please contact our support team at [email protected]. BYOK models requires additional prerequisites before you are able to fine-tune:

    • Your customer plan type should be Subscription, Enterprise Usage Credit, or Internal

    • Connect to the selected provider through Models catalog > Manage providers.

    • Ensure your account role of the selected provider have necessary permission to perform fine-tuning.

*) Under the hood, AWS SageMaker fine-tuning uses the Hugging Face framework. Some base models, such as Meta Llama, are gated models — meaning you must log in to your Hugging Face account and accept the model’s license terms before using them.

To fine-tune a gated model, you must store your Hugging Face access token (hf_token) as a secret key in your AWS account. Ensure that your AWS account has the necessary permissions to access this secret key.

🔒 Datasaur does not have access to your secret key value — it remains fully secured within your AWS environment.

Create fine-tuned models

This section guides you through the process of fine-tuning your models in LLM Labs.

Step 1: Set up model

  1. Navigate to the Models page.

  2. On the Available tab, go to the Fine-tuned LLMs section, and click Create fine-tuned model.

    My models
  3. Set up your fine-tuning job.

    1. Name your fine-tuned model. This name uniquely identifies your fine-tuned model.

      1. For AWS SageMaker, it will be used as the trainingJobId and must comply with the AWS SageMaker’s naming requirements:

        1. Uniqueness: Must be unique within the same AWS Region and AWS account.

        2. Length: 1–63 characters.

        3. Allowed characters: Letters (a–z, A–Z), numbers (0–9), and hyphens ().

        4. Format rules:

          1. Must start and end with an alphanumeric character.

          2. Cannot include spaces, underscores (_), or special characters other than hyphens.

      2. Select a base model that you want to fine-tune. You can select either:

        1. Pre-trained LLMs provided by Datasaur, or

        2. Existing fine-tuned models

    2. Choose a dataset. You can upload either a .csv consisting of 2 columns: prompt, expected completion, or you can choose an existing dataset from the library. For the validation dataset, you have 3 options:

      1. Split from selected dataset: Datasaur will split the uploaded dataset and use it for validation data. You will need to configure the validation size using a percentage.

        Split from selected dataset
      2. Use new dataset: You will need to add a new dataset to use as validation.

        Use new dataset
      3. None: Choose this option if you don't want to add a validation dataset.

  1. Configure your resources. This step applies to SageMaker only and determines the cost and performance of your training job.

    1. Instance type: The machine used to fine-tune your model.

    2. Instance count: The number of instances for training.

    3. Training volume size: The storage capacity allocated to each instance.

    For detailed recommendations, see the Choosing the Right Resources section.

Step 2: Adjust hyperparameters

Next, you will need to configure the hyperparameters for your fine-tuning project.

Adjust hyperparameters

The fundamental hyperparameters are epochs and learning rate.

Some models also support advanced hyperparameters, which Datasaur provides with default values. These parameters are always applied alongside the fundamental ones, but you can adjust them for further fine-tuning if desired.

The advanced hyperparameters include: early stopping callback (for fine-tuning with validation), batch size, and learning warm-up steps.

Optional Hyperparameters

This section applies to AWS SageMaker JumpStart Bring Your Own Key (BYOK) fine-tuning only.

If you wish to include additional hyperparameters, you can define them in Optional Hyperparameters using a JSON schema. For example, to enable LoRA, you can specify:

{
  "use_lora": true,
  "lora_config": {
    "r": 8,
    "lora_alpha": 16,
    "lora_dropout": 0.05
  }
}

You cannot override the fundamental or advanced hyperparameters in Optional Hyperparameters.

If you partially specify hyperparameters without providing its full configuration, the default values will be applied automatically. For example, if you configure use_lora without specifying lora_config, the default LoRA configuration will be:

{
  "lora_config": {
    "r": 8,
    "lora_alpha": 8,
    "lora_dropout": 0.0
  }
}

For more recommendation about hyperparameters, see the Common Fine-Tuning Hyperparameters section.

Step 3: Review job

  1. The last step is to review your fine-tuning job before you start the process.

    Review job
  2. You can review the optional hyperparameters that you add by clicking View optional hyperparameters.

    Review job
  3. You can also view the predicted cost by clicking the View total predicted cost button on the Costs section. It will calculate and show you the total predicted cost for starting the fine-tuning process.

    Review job

Please note that this is just a cost prediction. The final cost may be higher or lower, as each model has its own tokenizer.

  1. Once you have reviewed the configuration, you will need to check the acknowledgement checkbox.

  2. Lastly, click Start fine-tuning job and the training process will start.

    Training process

It will take several hours for the training process to be completed. Datasaur will notify you by email when the training process is complete.

  1. Once the training process is complete, your model will be available to deploy.

    My models page

Model management

Model status

There are 7 possible statuses for the fine-tuned models.

  1. Training: The model is currently being trained on your dataset. This status indicates that the training process is in progress, and the model is learning from your data.

    Training status
  2. Training failed: The model training process has failed due to an error. This status indicates that the training process was interrupted, and you may need to investigate and resolve the issue.

    Training failed
  3. Stopping training: The model training process is being stopped. This status indicates that someone has chosen to stop the training.

    Stopping training
  4. Training stopped: The model training process has been stopped. This status indicates that the training process has been successfully stopped, and you can’t continue the training once it stopped.

    Training stopped
  5. Not deployed: The model has been trained but has not yet been deployed for use. You can deploy the model to use it in Sandbox.

    Not deployed
  6. Deploying: The model is being deployed for use. This status indicates that the deployment process is in progress, and the model will soon be available for use in Sandbox.

    Deploying
  7. Deployed: The model has been successfully deployed. This status indicates that the model is now available for use in Sandbox, and you can start using it to generate predictions or responses.

    Deployed

Deploy models

To deploy a fine-tuned model:

  1. Click Deploy model to start the deployment.

  2. In the dialog that appears, specify the auto undeploy schedule.

  3. Click Deploy model in the dialog to confirm and the process will start.

    Deploying process
  4. Once the process is finish, your model will be available to use for experiment in Sandbox. Learn more about Sandbox

    Model deployed

Undeploy models

  1. Click the more menu (three-dots) in the right corner of the model card and select Undeploy.

    Undeploy model
  2. Confirm the process by clicking Undeploy in the dialog that appears.

  3. Your model will be undeployed and you will no longer be charged for the hourly cost.

View model details

To view the model details, click the more menu (three-dots) and select View details. The details of the fine-tuned model will be shown.

View details menu

In this dialog, you can view the dataset, validation dataset, models used, hyperparameter configurations, the creator, and storage cost information.

View details menu
View details menu

Download models

This section applies to AWS SageMaker JumpStart Bring Your Own Key (BYOK) fine-tuning only.

Download model

You can use your fine-tuned model for further tasks by downloading it. The model can then be loaded in your local environment for inference or additional fine-tuning, uploaded to your own SageMaker or other cloud environment, or backed up for compliance or reproducibility purposes.

When you initiate a download, the selected fine-tuned model is exported as a .tar.gz archive that contains:

  • Model weights: The core parameters of the fine-tuned model.

  • Configuration files: Metadata needed to load the model correctly (e.g., tokenizer config, architecture).

  • Optional adapters: If the model includes LoRA or other PEFT adapters, these are included as separate files.

Delete models

To delete a fine-tuned model, click the more menu (three-dots) and select Delete.

Delete model

In the dialog that appears, check the acknowledgement checkbox, and confirm the deletion by clicking Delete model.

Use in Sandbox

Once a fine-tuned model is deployed, it will be available in the Sandbox for further experimentation and testing. This allows you to integrate and test the specialized model within your specific applications. Learn more about Sandbox.

Access model via Sandbox

Choosing the Right Resources

This section applies to Bring Your Own Key (BYOK) fine-tuning only.

Every fine-tuning job has different resource requirements depending on model size, dataset size, and training configuration (e.g., batch size, sequence length, LoRA rank). Each instance differs in GPU memory (VRAM), number of GPUs, and storage capacity, which affect training performance, duration, and cost.

Example:

  • ml.g5.4xlarge — 1 GPU, 16 GB VRAM, 125 GB storage → not sufficient for 7B model

  • ml.g5.12xlarge — 4 GPUs, 64 GB VRAM (16 GB per GPU), 375 GB storage

For detailed information, see AWS documentation.

Selecting the right instance ensures stable training, optimal performance, and cost efficiency. To do so, we first need to understand the fine-tuning requirements before choosing the most suitable instance for VRAM, storage, and instance count calculations.

VRAM (GPU Memory)

VRAM determines how large your model and training batch size can be. If VRAM is too small, your training may fail or become extremely slow due to gradient checkpointing or offloading.

Main factors affecting VRAM:

  • Model size (parameters) — Larger models require more memory to store weights.

  • Precision (FP32, FP16, BF16, FP8, etc.) — Lower precision reduces VRAM usage, but might have tradeoffs with accuracy.

  • Optimizer states — Extra memory for momentum, variance, etc. (e.g., Adam/AdamW).

  • Activations — Temporary memory for forward/backward pass.

  • Use of LoRA / PEFT — Fine-tunes only a subset of weights, so it will reduce memory usage significantly.

Step 1: Identify model size

  • Model has 7B parameters.

  • This is the baseline factor for VRAM calculation.

Step 2: Apply precision

  • Precision determines how much memory each parameter uses.

  • Common precisions:

    Precision
    Bytes per parameter
    Memory effect

    FP32 (default)

    4

    Full memory

    FP16

    2

    Half memory of FP32 (×2 reduction)

    BF16

    2

    Half memory of FP32

  • Example:

    • VRAM = 7B parameters × 2 bytes (FP16) = 14 GB

Step 3: Add gradients

  • Each weight requires a gradient of the same precision.

  • Memory for gradients is approximately the same as weights, so total VRAM doubles compared to Step 2.

  • Example:

    • VRAM = 14 GB (weights) + 14 GB (gradients) = 28 GB

Step 4: Add optimizer states

  • Optimizers like Adam/AdamW store additional memory for momentum and variance (~2× weights). These extra tensors require additional memory for every trainable parameter.

    Optimizer
    Extra tensors
    Approximate multiplier

    Adam / AdamW

    2 (momentum + variance)

    2× weights

    Adafactor

    1 (factored variance)

    1× weights (can be less in some implementations)

    SGD

    0–1 (momentum)

    0–1× weights

  • Example (AdamW):

    • VRAM = 28 GB (weights + gradients) * 28 GB (optimizer) = 56 GB

Step 5: Adjust for training method (full vs LoRA)

  • For full fine-tuning, all weights, gradients, and optimizer states are used.

    • Example:

      • VRAM = 56 GB (previous calculation)

  • LoRA fine-tunes only a small fraction of model parameters, instead of all of them. This reduces memory for gradients + optimizer states (approximately 10–20%), but the full model weights still need to be loaded for forward computations.

    • Example (Using 20% as fraction):

      • VRAM = Weights+(Gradients + Optimizer)×Fraction of trainable parameters

      • 14 GB + ((14 GB + 28 GB) * 20%) = 14 GB + 8.4GB = 22.4GB

Please note that these are rough estimates. Actual VRAM usage can vary depending on factors such as batch size, sequence length, precision, gradient checkpointing, and framework overhead, which can increase memory requirements. At the same time, certain techniques, like LoRA, reduce memory usage for gradients and optimizer states in proportion to the fraction of weights being trained, although the full model weights still need to be loaded.

It is recommended to always leave some extra GPU memory as a buffer to account for unexpected spikes in memory usage, framework overhead, or temporary allocations during training, which helps prevent out-of-memory errors and ensures stable execution.

Training Volume Size

Storage determines how much space you have for the model checkpoints, intermediate artifacts, and dataset.

Step 1: Base model storage

  • This is the size of the pre-trained weights. Most pre-trained models are stored in FP16 or compressed format.

  • Example:

    • 7B model (FP16) ≈ 14 GB

If stored in FP32, size roughly doubles (~28 GB).

Step 2: Checkpoints storage

  • During training, checkpoints are periodically saved to allow recovery or evaluation.

  • Each checkpoint contains model weights + optimizer states + metadata.

  • Typical checkpoint storage ≈ 2–3× the base model size, because:

    • Optimizer states (AdamW, Adafactor) require extra memory (~1–2× weights)

    • Metadata and temporary artifacts add minor overhead

  • Example (1x checkpoint):

    • 7B model (FP16) → 14 GB × 2–3 ≈ 26–39 GB

More frequent checkpoints or larger optimizers increase storage requirements.

Step 3: Dataset size

  • Include the raw dataset size (CSV, JSON, etc.).

  • Example: 1 GB dataset adds 1 GB to volume

  • Storage Size ≈ Base model + Checkpoints + Dataset

  • Example:

    • 14 GB (weights) + 30 GB (checkpoints) + 1 GB (dataset) ≈ 35 GB

  • Recommendation: Round up (e.g., 50 GB) to allow buffer for temporary files and unexpected artifacts.

Instance Count

Instance count determines how many machines are used for distributed training.

  • Single-instance (1) — Simpler setup, lower cost, but slower for large models.

  • Multiple instances (≥2) — Distributes model and data across machines for parallel training, but requires more setup time and inter-node communication.

When to increase instance count:

  • Model is too large to fit in one GPU (e.g., 70B+)

  • Dataset is massive and you want to reduce training time

Common Fine-Tuning Hyperparameters

Hyperparameters control how the model learns and affect training stability, performance, and final model quality. Choosing appropriate values balances training efficiency, accuracy, and resource cost. Common fine-tuning hyperparameters include:

Fundamental Hyperparameters

  • Epochs (num_train_epochs)

    • Number of times the model sees the full dataset. More epochs improve learning but increase training time and risk overfitting.

  • Learning rate

    • Step size for optimizer updates. High learning rates can cause unstable training, low rates may result in slow convergence or underfitting.

Advanced Hyperparameters

  • Batch size (per_device_train_batch_size)

    • Number of examples per batch before updating model parameters. Larger batch sizes increase VRAM usage but improve gradient estimation stability.

  • Learning rate warmup steps

    • Gradually increases learning rate at the start of training. Helps stabilize training and prevents early divergence.

  • Early stopping callback

    • Stops training if validation metrics do not improve after a set number of steps. Helps prevent overfitting and saves compute cost. Only necessary if training uses validation data.

    • early_stopping_patience determines number of evaluation calls to wait before stopping training when the metric stops improving.

    • early_stopping_threshold determines minimum improvement required to consider the metric as "improving”.

Optional Hyperparameters

  • LoRA Configuration

    • Fine-tunes only a subset of model parameters, reducing memory and optimizer state requirements.

    • Key LoRA parameters and their effects:

      Parameter
      Description
      Example / Effect

      r

      Rank of the low-rank matrices used for adaptation

      Higher r can improve learning but uses more memory

      alpha

      Scaling factor applied to the LoRA updates

      alpha = 32 → higher values amplify LoRA weight updates; often scaled relative to r

      dropout

      Dropout applied to LoRA layers

      dropout = 0.05 → helps prevent overfitting by randomly zeroing some LoRA updates

      target_modules

      List of model layers to apply LoRA

      ["q_proj", "v_proj"] → only these attention projection layers are adapted; targeting more layers increases capacity and memory use

      fan_in_fan_out

      Whether LoRA weights are transposed for some model types

      false for most standard models; needed for certain architectures

      bias

      How bias terms are handled in LoRA layers

      "none", "all", "lora_only"; usually "none" for simplicity

    • Example:

      {
        "use_lora": true,
        "lora_config": {
          "r": 8,
          "alpha": 8,
          "dropout": 0.05,
          "target_modules": ["q_proj", "v_proj"],
          "fan_in_fan_out": false,
          "bias": "none"
        }
      }
  • Merge Adapters

    • Determines whether the LoRA adapters are merged into the base model weights after fine-tuning.

    • If true, adapters are merged into the main model, so the model can be used independently without loading separate LoRA weights.

    • If false, Adapters remain separate, allowing further fine-tuning or swapping different adapters.

    • Example:

      {
        "use_lora": true,
        "merge_adapters": true
      }
  • Logging Strategy

    • Frequency of recording training metrics, such as training loss and accuracy, to provides more insights. But it may slightly slow down training.

    • Options: epoch, steps, or none. If using steps, logging_steps must also be set.

    • Example:

      "sft_config": {
        "logging_strategy": "steps",
        "logging_steps": 50,
      }
  • Evaluation strategy

    • Determines when validation is going to be performed (after every epoch or a number of steps).

    • Options: epoch, steps, or none. If using steps, eval_steps must be specified (or defaults to logging_steps).

    • If different from logging strategy, evaluation logs may appear at a different interval.

    • Example:

      "sft_config": {
        "eval_strategy": "steps",
      }
  • Mixed Precision Training

    • Controls the numerical precision used during training to balance memory usage and training speed.

    • Options:

      Options
      Description
      Note

      fp32 (32-bit floating point)

      highest precision but uses most memory

      when both fp16 and bf16 are disabled

      fp16 (16-bit floating point)

      ~50% memory reduction, faster training

      when explicitly enabled

      bf16 (Brain Float 16-bit)

      better numerical stability than fp16, requires Ampere+ GPUs

      either explicitly enabled OR when using FlashAttention2

    • Example:

      "sft_config": {
        "fp16": "false",
        "bf16": "true,
      }
  • Attention Implementation

    • Controls which attention algorithm to use during training. Different implementations offer varying trade-offs between memory usage, speed, and compatibility.

    • Use use_cache: false for training to save memory

    • Options:

      Attention Type
      Description
      Precision
      Training Speed

      eager

      Universal compatibility but slowest and uses most memory

      fp16/fp32

      1x (baseline)

      sdpa

      Good balance of speed and compatibility, works with fp16/fp32

      fp16/fp32

      1.5x

      flash_attention_2 (default)

      Fastest and most memory-efficient, but forces bf16 precision

      Forced bf16

      3x

    • Example:

      {
        "model_init_config": {
          "attn_implementation": "flash_attention_2",
          "use_cache": false
        },
      }
  • Gradient Accumulation

    • Number of batches to accumulate before a backward pass.

    • Useful when batch size exceeds GPU memory. Allows training larger effective batch sizes without running out of VRAM.

    • Example:

      "sft_config": {
          "gradient_accumulation_steps": 4
        }

Last updated