# Fine-tuning

## Overview

LLMs in general are typically trained for generic use cases. Fine-tuning LLMs allows users to further train the model to provide more accurate answers for a specific domain or use case. This process involves providing the model with a dataset containing examples of input and output from a specific domain. LLM Labs helps simplify this process by providing a user-friendly way to fine-tune and deploy open-source models, allowing you to tailor LLMs to your exact needs.

## Fine-Tuning Providers <a href="#fine-tuning-providers" id="fine-tuning-providers"></a>

There are two main options available for fine-tuning:

1. **Served by Datasaur**

   In this option, the fine-tuning process is fully managed and hosted by Datasaur. You don’t need to set up any infrastructure or manage cloud resources — Datasaur will handle everything for you.

   Currently, we support the following providers and pre-trained model(s) for fine-tuning served by Datasaur:

   1. **Amazon Bedrock**
      1. Amazon Nova Micro
      2. Amazon Nova Lite
      3. Amazon Nova Pro
      4. Amazon Titan Text G1 - Express
      5. Amazon Titan Text G1 - Lite
      6. Cohere Command
      7. Cohere Command Light
      8. Meta Llama 3.1 8B
      9. Meta Llama 3.1 70B
2. **Bring Your Own Key (BYOK)**

   In this option, the fine-tuning process runs in your own cloud account. This gives you full control over the environment — you can configure machine types, regions, and other resources as needed. BYOK is ideal for teams that require advanced customization.

   Currently, we support fine-tuning with BYOK through [Amazon SageMaker Jumpstart](https://docs.datasaur.ai/llm-projects/models/amazon-sagemaker-jumpstart) for various models from **Meta Llama**\*, **AI Singapore**, **Qwen**, and **DeepSeek**. Only **instruct models** are supported for fine-tuning at this time. If you’d like to fine-tune a model not yet available in LLM Labs, please contact our support team at <support@datasaur.ai>.\
   BYOK models requires additional prerequisites before you are able to fine-tune:

   * Your customer plan type should be **Subscription**, **Enterprise Usage Credit**, or **Internal**
   * Connect to the selected provider through [Models catalog > Manage providers](https://docs.datasaur.ai/llm-projects/models/amazon-sagemaker-jumpstart).
   * Ensure your account role of the selected provider have necessary permission to perform fine-tuning.

{% hint style="info" %}
\*) Under the hood, **AWS SageMaker fine-tuning** uses the **Hugging Face** framework. Some base models, such as **Meta Llama**, are *gated models* — meaning you must log in to your Hugging Face account and accept the model’s license terms before using them.

To fine-tune a gated model, you must store your **Hugging Face access token (`hf_token`)** as a **secret key** in your AWS account. Ensure that your AWS account has the necessary permissions to access this secret key.

🔒 **Datasaur does not have access to your secret key value** — it remains fully secured within your AWS environment.
{% endhint %}

## Create fine-tuned models

This section guides you through the process of fine-tuning your models in LLM Labs.

### Step 1: Set up model

1. Navigate to the **Models** page.
2. On the **Available** tab, go to the **Fine-tuned LLMs** section, and click **Create fine-tuned model**.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-27d2ac6458900252f3f756fd6a8a831222bf89a7%2FModels%20-%20Available%20-%20fine-tuned%20LLMs%20-%20initial.png?alt=media" alt=""><figcaption><p>My models</p></figcaption></figure>
3. Set up your fine-tuning job.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-480829b1d8c7e865d3f47e9f774145b11af4b6f0%2FModels%20-%20fine-tuned%20LLMs%20-%20PCW%20-%20step%201%20-%20initial.png?alt=media" alt=""><figcaption></figcaption></figure>

   1. **Name your fine-tuned model**. This name uniquely identifies your fine-tuned model.
      1. For AWS SageMaker, it will be used as the trainingJobId and must comply with the [AWS SageMaker’s naming requirements](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#API_CreateTrainingJob_RequestSyntax):
         1. **Uniqueness:** Must be unique within the same AWS Region and AWS account.
         2. **Length:** 1–63 characters.
         3. **Allowed characters:** Letters (`a–z`, `A–Z`), numbers (`0–9`), and hyphens ().
         4. **Format rules:**
            1. Must start and end with an alphanumeric character.
            2. Cannot include spaces, underscores (`_`), or special characters other than hyphens.
      2. **Select a base model that you want to fine-tune**. You can select either:
         1. Pre-trained LLMs provided by Datasaur, or
         2. Existing fine-tuned models
   2. Choose a dataset. You can upload either a .csv consisting of 2 columns: `prompt`, `expected completion`, or you can choose an existing dataset from the library. For the validation dataset, you have 3 options:
      1. **Split from selected dataset**: Datasaur will split the uploaded dataset and use it for validation data. You will need to configure the validation size using a percentage.

         <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-fdd95b82a135c6fd916a6f066c2ef509d29cf98a%2FModels%20-%20fine%20tuning%20-%20PCW%20-%20step%201%20-%20validation%20data_%20split%20from%20selected%20dataset.png?alt=media" alt=""><figcaption><p>Split from selected dataset</p></figcaption></figure>
      2. **Use new dataset**: You will need to add a new dataset to use as validation.

         <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-a6319cb833859f6bf8eba5776f1c93593bae690a%2FModels%20-%20fine%20tuning%20-%20PCW%20-%20step%201%20-%20validation%20data_%20use%20new%20dataset.png?alt=media" alt=""><figcaption><p>Use new dataset</p></figcaption></figure>
      3. **None**: Choose this option if you don't want to add a validation dataset.

{% hint style="danger" %}
Important: Please note that validation data are required for several base models (e.g., Cohere Command, Cohere Command Light). On the other hand, some models may not support validation data at all.
{% endhint %}

{% hint style="info" %}
[Learn more about Dataset](https://docs.datasaur.ai/llm-projects/dataset)
{% endhint %}

4. **Configure your resources**. This step applies to SageMaker only and determines the cost and performance of your training job.

   1. **Instance type**: The machine used to fine-tune your model.
   2. **Instance count**: The number of instances for training.
   3. **Training volume size**: The storage capacity allocated to each instance.

   For detailed recommendations, see the [Choosing the Right Resources](#choosing-the-right-resources) section.

### Step 2: Adjust hyperparameters

Next, you will need to configure the hyperparameters for your fine-tuning project.

<figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-39c9cbd61655e030359ddd9055817d361b907799%2FModels%20-%20fine-tuned%20LLMs%20-%20PCW%20-%20step%202.png?alt=media" alt=""><figcaption><p>Adjust hyperparameters</p></figcaption></figure>

The fundamental hyperparameters are **epochs** and **learning rate**.

Some models also support **advanced hyperparameters**, which Datasaur provides with default values. These parameters are always applied alongside the fundamental ones, but you can adjust them for further fine-tuning if desired.

The advanced hyperparameters include: **early stopping callback** (for fine-tuning with validation), **batch size**, and **learning warm-up steps**.

#### **Optional Hyperparameters**

{% hint style="info" %}
This section applies to **AWS SageMaker JumpStart Bring Your Own Key (BYOK)** fine-tuning only.
{% endhint %}

<figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-00d0b9d27cbd087546179b071144b94e8703045d%2FModels%20-%20fine-tuned%20LLMs%20-%20PCW%20-%20step%202%20-%20Optional%20hyperparameters.png?alt=media" alt=""><figcaption></figcaption></figure>

If you wish to include additional hyperparameters, you can define them in **Optional Hyperparameters** using a **JSON schema**. For example, to enable **LoRA**, you can specify:

```json
{
  "use_lora": true,
  "lora_config": {
    "r": 8,
    "lora_alpha": 16,
    "lora_dropout": 0.05
  }
}
```

You **cannot override** the fundamental or advanced hyperparameters in **Optional Hyperparameters**.

If you partially specify hyperparameters without providing its full configuration, the **default values** will be applied automatically. For example, if you configure `use_lora` without specifying `lora_config`, the default LoRA configuration will be:

```json
{
  "lora_config": {
    "r": 8,
    "lora_alpha": 8,
    "lora_dropout": 0.0
  }
}
```

For more recommendation about hyperparameters, see the [Common Fine-Tuning Hyperparameters](#common-fine-tuning-hyperparameters) section.

### Step 3: Review job

1. The last step is to review your fine-tuning job before you start the process.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-9c8f75625652ee37b7037522781fa2ff4e313799%2FModels%20-%20fine-tuned%20LLMs%20-%20PCW%20-%20step%203%20-%20sagemaker%20models.png?alt=media" alt=""><figcaption><p>Review job</p></figcaption></figure>
2. You can review the optional hyperparameters that you add by clicking **View optional hyperparameters**.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-23a2a6469c555b928605e316bf208726f84fd271%2FModels%20-%20fine-tuned%20LLMs%20-%20view%20details%20-%20optional%20hyperparameters.png?alt=media" alt=""><figcaption><p>Review job</p></figcaption></figure>
3. You can also view the predicted cost by clicking the **View total predicted cost** button on the **Costs** section. It will calculate and show you the total predicted cost for starting the fine-tuning process.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-ff2a8249370cdfa96f80a82022da1c85d4f0109b%2FModels%20-%20fine%20tuning%20-%20PCW%20-%20step%203%20-%20total%20predicted%20cost.png?alt=media" alt=""><figcaption><p>Review job</p></figcaption></figure>

{% hint style="info" %}
Please note that this is just a cost prediction. The final cost may be higher or lower, as each model has its own tokenizer.
{% endhint %}

4. Once you have reviewed the configuration, you will need to check the acknowledgement checkbox.
5. Lastly, click **Start fine-tuning job** and the training process will start.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-6c2811301098dd33552419e415f03bb4c4f67bf6%2FModels%20-%20fine-tuned%20LLMs%20-%20training.png?alt=media" alt=""><figcaption><p>Training process</p></figcaption></figure>

{% hint style="info" %}
It will take several hours for the training process to be completed. Datasaur will notify you by email when the training process is complete.
{% endhint %}

5. Once the training process is complete, your model will be available to deploy.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-541c8c81976415323027268a22cc943773bf02ba%2FModels%20-%20fine-tuned%20LLMs%20-%20not%20deployed.png?alt=media" alt=""><figcaption><p>My models page</p></figcaption></figure>

## Model management

### Model status

There are 7 possible statuses for the fine-tuned models.

1. **Training**: The model is currently being trained on your dataset. This status indicates that the training process is in progress, and the model is learning from your data.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-2e475b3c1305f4704e7c9765f7219e0284c850ff%2FModels%20-%20fine-tuned%20LLMs%20-%20card%20-%20training.png?alt=media" alt=""><figcaption><p>Training status</p></figcaption></figure>
2. **Training failed**: The model training process has failed due to an error. This status indicates that the training process was interrupted, and you may need to investigate and resolve the issue.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-770651512c69c227ab60329fcf04b9dde342ea88%2FModels%20-%20fine-tuned%20LLMs%20-%20card%20-%20training%20failed.png?alt=media" alt=""><figcaption><p>Training failed</p></figcaption></figure>
3. **Stopping training**: The model training process is being stopped. This status indicates that someone has chosen to stop the training.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-2adac5ae9d46e8ac09d179e113f518b1b0453271%2FModels%20-%20fine-tuned%20LLMs%20-%20card%20-%20Stopping%20training.png?alt=media" alt=""><figcaption><p>Stopping training</p></figcaption></figure>
4. **Training stopped**: The model training process has been stopped. This status indicates that the training process has been successfully stopped, and you can’t continue the training once it stopped.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-4d341f4d64aa499bb34510837b265ebd6dc21382%2FModels%20-%20fine-tuned%20LLMs%20-%20card%20-%20stopped.png?alt=media" alt=""><figcaption><p>Training stopped</p></figcaption></figure>
5. **Not deployed**: The model has been trained but has not yet been deployed for use. You can deploy the model to use it in Sandbox.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-4fe8a733cb17f4c13ebabafd06ea293e1469bbcf%2FModels%20-%20fine-tuned%20LLMs%20-%20card%20-%20not%20deployed.png?alt=media" alt=""><figcaption><p>Not deployed</p></figcaption></figure>
6. **Deploying**: The model is being deployed for use. This status indicates that the deployment process is in progress, and the model will soon be available for use in Sandbox.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-7b5c8a68be0c5ecdb41a39494c396ca0e5c2abbd%2FModels%20-%20fine-tuned%20LLMs%20-%20card%20-%20deploying.png?alt=media" alt=""><figcaption><p>Deploying</p></figcaption></figure>
7. **Deployed**: The model has been successfully deployed. This status indicates that the model is now available for use in Sandbox, and you can start using it to generate predictions or responses.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-ada3386d1f6a916af8efe2ff660250b2f4f97b0b%2FModels%20-%20fine-tuned%20LLMs%20-%20card%20-%20deployed.png?alt=media" alt=""><figcaption><p>Deployed</p></figcaption></figure>

### Deploy models

To deploy a fine-tuned model:

1. Click **Deploy** **model** to start the deployment.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-4fe8a733cb17f4c13ebabafd06ea293e1469bbcf%2FModels%20-%20fine-tuned%20LLMs%20-%20card%20-%20not%20deployed.png?alt=media" alt=""><figcaption></figcaption></figure>
2. In the dialog that appears, specify the **auto undeploy schedule**.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-59ead1ce036b7b89e25e1fb46c3f5e2476515057%2FModels%20-%20fine-tuned%20LLMs%20-%20deploy%20confirmation.png?alt=media" alt=""><figcaption></figcaption></figure>
3. Click **Deploy model** in the dialog to confirm and the process will start.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-f05a02826ae3ef952f26a7aeb3edfefbbab601b6%2FModels%20-%20fine-tuned%20LLMs%20-%20deploying.png?alt=media" alt=""><figcaption><p>Deploying process</p></figcaption></figure>
4. Once the process is finish, your model will be available to use for experiment in Sandbox. [Learn more about Sandbox](https://docs.datasaur.ai/llm-projects/sandbox)

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-9ba1baa9af9f67eeda9ad2aef98ade73694523c5%2FModels%20-%20fine-tuned%20LLMs%20-%20deployed.png?alt=media" alt=""><figcaption><p>Model deployed</p></figcaption></figure>

### Undeploy models

1. Click the more menu (three-dots) in the right corner of the model card and select **Undeploy.**

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-0e0bea28a8a456c93601e1742e01309dd518bad9%2FModels%20-%20fine-tuned%20LLMs%20-%20card%20-%20deployed%20-%20more%20menu.png?alt=media" alt=""><figcaption><p>Undeploy model</p></figcaption></figure>
2. Confirm the process by clicking **Undeploy** in the dialog that appears.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-44fcb136dabcf12fc64b9d43f704701a36db11f6%2FModels%20-%20fine-tuned%20LLMs%20-%20undeploy%20confirmation.png?alt=media" alt=""><figcaption></figcaption></figure>
3. Your model will be undeployed and you will no longer be charged for the hourly cost.

### View model details

To view the model details, click the more menu (three-dots) and select **View details.** The details of the fine-tuned model will be shown.

<figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-0e0bea28a8a456c93601e1742e01309dd518bad9%2FModels%20-%20fine-tuned%20LLMs%20-%20card%20-%20deployed%20-%20more%20menu.png?alt=media" alt=""><figcaption><p>View details menu</p></figcaption></figure>

In this dialog, you can view the dataset, validation dataset, models used, hyperparameter configurations, the creator, and storage cost information.

<figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-0d6ba3fd40f54ab43165c85e400d457a6a42218b%2FModels%20-%20fine-tuned%20LLMs%20-%20view%20details.png?alt=media" alt=""><figcaption><p>View details menu</p></figcaption></figure>

<figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-23a2a6469c555b928605e316bf208726f84fd271%2FModels%20-%20fine-tuned%20LLMs%20-%20view%20details%20-%20optional%20hyperparameters.png?alt=media" alt=""><figcaption><p>View details menu</p></figcaption></figure>

### Download models

{% hint style="info" %}
This section applies to **AWS SageMaker JumpStart Bring Your Own Key (BYOK)** fine-tuning only.
{% endhint %}

<figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-0e0bea28a8a456c93601e1742e01309dd518bad9%2FModels%20-%20fine-tuned%20LLMs%20-%20card%20-%20deployed%20-%20more%20menu.png?alt=media" alt=""><figcaption><p>Download model</p></figcaption></figure>

You can use your fine-tuned model for further tasks by **downloading it**. The model can then be loaded in your **local environment** for inference or additional fine-tuning, uploaded to your **own SageMaker or other cloud environment**, or backed up for **compliance or reproducibility purposes**.

When you initiate a download, the selected fine-tuned model is **exported as a `.tar.gz` archive** that contains:

* **Model weights:** The core parameters of the fine-tuned model.
* **Configuration files:** Metadata needed to load the model correctly (e.g., tokenizer config, architecture).
* **Optional adapters:** If the model includes LoRA or other PEFT adapters, these are included as separate files.

### Delete models

To delete a fine-tuned model, click the more menu (three-dots) and select **Delete.**

<figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-0e0bea28a8a456c93601e1742e01309dd518bad9%2FModels%20-%20fine-tuned%20LLMs%20-%20card%20-%20deployed%20-%20more%20menu.png?alt=media" alt=""><figcaption><p>Delete model</p></figcaption></figure>

In the dialog that appears, check the acknowledgement checkbox, and confirm the deletion by clicking **Delete model.**

<figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-f96b4a7958f4286699b355f3dfcff3a06e997d1e%2FModels%20-%20fine-tuned%20LLMs%20-%20delete%20confirmation.png?alt=media" alt=""><figcaption></figcaption></figure>

### Use in Sandbox

Once a fine-tuned model is deployed, it will be available in the Sandbox for further experimentation and testing. This allows you to integrate and test the specialized model within your specific applications. [Learn more about Sandbox](https://docs.datasaur.ai/llm-projects/sandbox).

<figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-a1f3ce69842e58311d4f58d791366ad9eb8f8577%2FSandbox%20-%20fine-tuned%20models%20-%20initial.png?alt=media" alt=""><figcaption><p>Access model via Sandbox</p></figcaption></figure>

## Choosing the Right Resources <a href="#choosing-the-right-resources" id="choosing-the-right-resources"></a>

{% hint style="info" %}
This section applies to **Bring Your Own Key (BYOK)** fine-tuning only.
{% endhint %}

Every fine-tuning job has **different resource requirements** depending on **model size**, **dataset size**, and **training configuration** (e.g., batch size, sequence length, LoRA rank). Each instance differs in **GPU memory (VRAM)**, **number of GPUs**, and **storage capacity**, which affect training performance, duration, and cost.

**Example:**

* `ml.g5.4xlarge` — 1 GPU, 16 GB VRAM, 125 GB storage → not sufficient for 7B model
* `ml.g5.12xlarge` — 4 GPUs, 64 GB VRAM (16 GB per GPU), 375 GB storage

For detailed information, see [AWS documentation](https://aws.amazon.com/ec2/instance-types/).

Selecting the right instance ensures **stable training**, **optimal performance**, and **cost efficiency**. To do so, we first need to understand the fine-tuning requirements before choosing the most suitable instance for VRAM, storage, and instance count calculations.

### VRAM (GPU Memory) <a href="#vram-gpu-memory" id="vram-gpu-memory"></a>

VRAM determines how large your model and training batch size can be. If VRAM is too small, your training may fail or become extremely slow due to gradient checkpointing or offloading.

**Main factors affecting VRAM:**

* **Model size (parameters)** — Larger models require more memory to store weights.
* **Precision (FP32, FP16, BF16, FP8, etc.)** — Lower precision reduces VRAM usage, but might have tradeoffs with accuracy.
* **Optimizer states** — Extra memory for momentum, variance, etc. (e.g., Adam/AdamW).
* **Activations** — Temporary memory for forward/backward pass.
* **Use of LoRA / PEFT** — Fine-tunes only a subset of weights, so it will reduce memory usage significantly.

{% hint style="danger" %}
Activations are not included in this simplified estimation, as they are usually small compared to weights, gradients, and optimizer states for typical batch sizes.
{% endhint %}

#### **Step 1: Identify model size**

* Model has **7B parameters**.
* This is the **baseline factor** for VRAM calculation.

#### **Step 2: Apply precision**

* Precision determines how much memory each parameter uses.
* Common precisions:

  | Precision      | Bytes per parameter | Memory effect                      |
  | -------------- | ------------------- | ---------------------------------- |
  | FP32 (default) | 4                   | Full memory                        |
  | FP16           | 2                   | Half memory of FP32 (×2 reduction) |
  | BF16           | 2                   | Half memory of FP32                |
* **Example:**
  * VRAM = 7B parameters × 2 bytes (FP16) = **14 GB**

#### **Step 3: Add gradients**

* Each weight requires a **gradient of the same precision**.
* Memory for gradients is approximately the **same as weights**, so total VRAM doubles compared to Step 2.
* **Example:**
  * VRAM = 14 GB (weights) + 14 GB (gradients) = **28 GB**

#### **Step 4: Add optimizer states**

* Optimizers like Adam/AdamW store **additional memory** for momentum and variance (\~2× weights). These extra tensors require additional memory for every trainable parameter.

  | Optimizer    | Extra tensors           | Approximate multiplier                           |
  | ------------ | ----------------------- | ------------------------------------------------ |
  | Adam / AdamW | 2 (momentum + variance) | 2× weights                                       |
  | Adafactor    | 1 (factored variance)   | 1× weights (can be less in some implementations) |
  | SGD          | 0–1 (momentum)          | 0–1× weights                                     |
* **Example (AdamW):**
  * VRAM = 28 GB (weights + gradients) \* 28 GB (optimizer) = **56 GB**

#### **Step 5: Adjust for training method (full vs LoRA)**

* For full fine-tuning, all weights, gradients, and optimizer states are used.
  * **Example:**
    * VRAM = 56 GB (previous calculation)
* LoRA fine-tunes only a **small fraction of model parameters**, instead of all of them. This reduces memory for **gradients + optimizer states** (approximately 10–20%), but **the full model weights still need to be loaded** for forward computations.
  * **Example** (Using 20% as fraction):
    * VRAM = Weights+(Gradients + Optimizer)×Fraction of trainable parameters
    * 14 GB + ((14 GB + 28 GB) \* 20%) = 14 GB + 8.4GB = 22.4GB

Please note that these are **rough estimates**. Actual VRAM usage can vary depending on factors such as batch size, sequence length, precision, gradient checkpointing, and framework overhead, which can **increase memory requirements**. At the same time, certain techniques, like LoRA, reduce memory usage for gradients and optimizer states in proportion to the fraction of weights being trained, although the full model weights still need to be loaded.

It is recommended to always leave s**ome extra GPU memory as a buffer** to account for unexpected spikes in memory usage, framework overhead, or temporary allocations during training, which helps prevent out-of-memory errors and ensures stable execution.

### Training Volume Size <a href="#training-volume-size" id="training-volume-size"></a>

Storage determines how much space you have for the **model checkpoints**, **intermediate artifacts**, and **dataset**.

#### **Step 1: Base model storage**

* This is the size of the pre-trained weights. Most pre-trained models are stored in **FP16 or compressed format**.
* **Example:**
  * 7B model (FP16) ≈ **14 GB**

{% hint style="info" %}
If stored in FP32, size roughly doubles (\~28 GB).
{% endhint %}

#### **Step 2: Checkpoints storage**

* During training, **checkpoints are periodically saved** to allow recovery or evaluation.
* Each checkpoint contains **model weights + optimizer states + metadata**.
* Typical checkpoint storage ≈ **2–3× the base model size**, because:
  * Optimizer states (AdamW, Adafactor) require extra memory (\~1–2× weights)
  * Metadata and temporary artifacts add minor overhead
* **Example (1x checkpoint):**
  * 7B model (FP16) → 14 GB × 2–3 ≈ **26–39 GB**

{% hint style="info" %}
More frequent checkpoints or larger optimizers increase storage requirements.
{% endhint %}

#### **Step 3: Dataset size**

* Include the **raw dataset size** (CSV, JSON, etc.).
* **Example:** 1 GB dataset adds 1 GB to volume

#### **Step 4: Recommended volume size**

* Storage Size ≈ Base model + Checkpoints + Dataset
* **Example:**
  * 14 GB (weights) + 30 GB (checkpoints) + 1 GB (dataset) ≈ **35 GB**
* **Recommendation:** Round up (e.g., 50 GB) to allow buffer for temporary files and unexpected artifacts.

### Instance Count <a href="#instance-count" id="instance-count"></a>

Instance count determines **how many machines** are used for distributed training.

* **Single-instance (1)** — Simpler setup, lower cost, but slower for large models.
* **Multiple instances (≥2)** — Distributes model and data across machines for parallel training, but requires more setup time and inter-node communication.

**When to increase instance count:**

* Model is too large to fit in one GPU (e.g., 70B+)
* Dataset is massive and you want to reduce training time

{% hint style="danger" %}
While multi-GPU is helpful to handle large model, beware of overhead (time, VRAM, and compute efficiency).
{% endhint %}

## Common Fine-Tuning Hyperparameters <a href="#common-fine-tuning-hyperparameters" id="common-fine-tuning-hyperparameters"></a>

Hyperparameters control how the model learns and affect **training stability**, **performance**, and **final model quality**. Choosing appropriate values balances **training efficiency**, **accuracy**, and **resource cost**. Common fine-tuning hyperparameters include:

### **Fundamental Hyperparameters** <a href="#fundamental-hyperparameters" id="fundamental-hyperparameters"></a>

* **Epochs (num\_train\_epochs)**
  * Number of times the model sees the full dataset. More epochs improve learning but increase **training time** and risk **overfitting**.
* **Learning rate**
  * Step size for optimizer updates. High learning rates can cause **unstable training**, low rates may result in **slow convergence** or underfitting.

### **Advanced Hyperparameters** <a href="#advanced-hyperparameters" id="advanced-hyperparameters"></a>

* **Batch size (per\_device\_train\_batch\_size)**
  * Number of examples per batch before updating model parameters. Larger batch sizes **increase VRAM usage** but improve gradient estimation stability.
* **Learning rate warmup steps**
  * Gradually increases learning rate at the start of training. Helps **stabilize training** and prevents early divergence.
* **Early stopping callback**
  * Stops training if validation metrics do not improve after a set number of steps. Helps **prevent overfitting** and saves **compute cost**. Only necessary if training uses validation data.
  * **early\_stopping\_patience** determines number of evaluation calls to wait before stopping training when the metric stops improving.
  * **early\_stopping\_threshold** determines minimum improvement required to consider the metric as "improving”.

### Optional **Hyperparameters** <a href="#optional-hyperparameters-1" id="optional-hyperparameters-1"></a>

* **LoRA Configuration**
  * Fine-tunes only a **subset of model parameters**, reducing **memory and optimizer state requirements**.
  * **Key LoRA parameters and their effects:**

    | Parameter        | Description                                              | Example / Effect                                                                                                                     |
    | ---------------- | -------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
    | `r`              | Rank of the low-rank matrices used for adaptation        | Higher `r` can improve learning but uses more memory                                                                                 |
    | `alpha`          | Scaling factor applied to the LoRA updates               | `alpha = 32` → higher values amplify LoRA weight updates; often scaled relative to `r`                                               |
    | `dropout`        | Dropout applied to LoRA layers                           | `dropout = 0.05` → helps prevent overfitting by randomly zeroing some LoRA updates                                                   |
    | `target_modules` | List of model layers to apply LoRA                       | `["q_proj", "v_proj"]` → only these attention projection layers are adapted; targeting more layers increases capacity and memory use |
    | `fan_in_fan_out` | Whether LoRA weights are transposed for some model types | `false` for most standard models; needed for certain architectures                                                                   |
    | `bias`           | How bias terms are handled in LoRA layers                | `"none"`, `"all"`, `"lora_only"`; usually `"none"` for simplicity                                                                    |
  * Example:

    ```json
    {
      "use_lora": true,
      "lora_config": {
        "r": 8,
        "alpha": 8,
        "dropout": 0.05,
        "target_modules": ["q_proj", "v_proj"],
        "fan_in_fan_out": false,
        "bias": "none"
      }
    }
    ```
* **Merge Adapters**
  * Determines whether the LoRA adapters are **merged into the base model weights** after fine-tuning.
  * If true, adapters are merged into the main model, so the model can be used independently without loading separate LoRA weights.
  * If false, Adapters remain separate, allowing further fine-tuning or swapping different adapters.
  * Example:

    ```json
    {
      "use_lora": true,
      "merge_adapters": true
    }
    ```
* **Logging Strategy**
  * Frequency of recording training metrics, such as training loss and accuracy, to provides more insights. But it may slightly slow down training.
  * Options: `epoch`, `steps`, or `none`. If using steps, `logging_steps` must also be set.
  * Example:

    ```json
    "sft_config": {
      "logging_strategy": "steps",
      "logging_steps": 50,
    }
    ```
* **Evaluation strategy**
  * Determines **when validation is going to be performed** (after every epoch or a number of steps).
  * Options: `epoch`, `steps`, or `none`. If using `steps`, `eval_steps` must be specified (or defaults to `logging_steps`).
  * If different from logging strategy, evaluation logs may appear at a different interval.
  * Example:

    ```json
    "sft_config": {
      "eval_strategy": "steps",
    }
    ```
* **Mixed Precision Training**
  * Controls the numerical precision used during training to balance **memory usage** and **training speed**.
  * Options:

    | Options                        | Description                                                     | Note                                                    |
    | ------------------------------ | --------------------------------------------------------------- | ------------------------------------------------------- |
    | `fp32` (32-bit floating point) | highest precision but uses most memory                          | when both fp16 and bf16 are disabled                    |
    | `fp16` (16-bit floating point) | **\~50% memory reduction**, faster training                     | when explicitly enabled                                 |
    | `bf16` (Brain Float 16-bit)    | **better numerical stability** than fp16, requires Ampere+ GPUs | either explicitly enabled OR when using FlashAttention2 |
  * Example:

    ```json
    "sft_config": {
      "fp16": "false",
      "bf16": "true,
    }
    ```
* **Attention Implementation**
  * Controls which **attention algorithm** to use during training. Different implementations offer varying trade-offs between **memory usage, speed,** and **compatibility**.
  * Use use\_cache: false for training to save memory
  * Options:

    | Attention Type                | Description                                                   | Precision   | Training Speed |
    | ----------------------------- | ------------------------------------------------------------- | ----------- | -------------- |
    | eager                         | Universal compatibility but slowest and uses most memory      | fp16/fp32   | 1x (baseline)  |
    | sdpa                          | Good balance of speed and compatibility, works with fp16/fp32 | fp16/fp32   | 1.5x           |
    | flash\_attention\_2 (default) | Fastest and most memory-efficient, but forces bf16 precision  | Forced bf16 | 3x             |
  * Example:

    ```json
    {
      "model_init_config": {
        "attn_implementation": "flash_attention_2",
        "use_cache": false
      },
    }
    ```
* **Gradient Accumulation**
  * Number of batches to accumulate before a backward pass.
  * Useful when **batch size exceeds GPU memory**. Allows training larger effective batch sizes without running out of VRAM.
  * Example:

    ```json
    "sft_config": {
        "gradient_accumulation_steps": 4
      }
    ```
