# Supported Formats

This page details all the supported Datasaur formats, provides examples for each format and clarifies expected file structure where appropriate.\
\
Note: Through file transformers we can customize the output format.

## TXT

TXT file is a simple file format that contains unformatted text and can be easily opened and edited using a basic text editor. It is commonly used for storing and exchanging data, code, and other textual information.

**This format is supported for data import**, especially for simple span based labeling projects.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-76f9b47478308ce208fee3df6de88480994355ee%2FSample%20TXT%20(1).txt?alt=media>" %}
TXT Sample
{% endfile %}

## TSV

[**TSV**](https://www.iana.org/assignments/media-types/text/tab-separated-values) (tab-separated values) file is a simple text format for storing data in a tabular structure. A TSV file encodes a number of records that may contain multiple fields.

* Each record is represented as a single line.
* Each field

  Value is represented as text.
* Fields in a record are separated from

  one other by the tab character .

  * *Note that because is a special character for this format, fields that contain tabs are not allowed in this encoding.*
* The header (first) line of this encoding contains the name of

  each field, separated by tabs.

**This format is supported for data import**, particularly for row based projects.

**Example**

```
Book Title    Author    Genre
Sherlock Holmes: A Study in Scarlet    Sir Arthur Conan Doyle    Fiction
To Kill a Mockingbird    Harper Lee    Fiction
Alan Turing: The Enigma    Andrew Hodges    Non fiction
Humble Pie    Gordon Ramsay    Non fiction
The Little Prince    Antoine de Saint-Exupéry    Fiction
```

## IOB (specialized .tsv)

**IOB** (inside, outside, beginning) is a common labeling format for labeling tokens in computational linguistics (ex: named-entity recognition). IOB is also a .tsv, but conforms to the following rules:

* The **B- prefix** before a tag indicates that the tag is the beginning of a chunk.
* The **I- prefix** before a tag indicates that the tag is inside a chunk.
* The **B- tag** is used only when a tag is followed by a tag of the same type without O tokens between them.
* The **O tag** indicates that a token does not belong to a chunk.

**This format is supported for data import** in span based labeling projects.

**Example**

```
Sherlock	B-PER
Holmes	I-PER
become	O
widely	O
popular	O
in	O
1891 YEAR	O
.	O
```

## CSV

A [**CSV**](https://en.wikipedia.org/wiki/Comma-separated_values) (comma-separated values) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields.

**This format is supported for data import**, mainly for row based labeling projects.

**Example**

```
Book Title,Author,Genre
Sherlock Holmes: A Study in Scarlet,Sir Arthur Conan Doyle,Fiction
To Kill a Mockingbird,Harper Lee,Fiction
Alan Turing: The Enigma,Andrew Hodges,Non fiction
Humble Pie,Gordon Ramsay,Non fiction
The Little Prince,Antoine de Saint-Exupéry,Fiction
```

💡 *As for now, a row-based project using CSV format does not support answers containing <mark style="color:red;">`;`</mark>. We treat it as multiple answers. For example, the answer is <mark style="color:red;">`She brings some flowers: rose; sunflower; and daisy.`</mark> It will be interpreted as three answers that contain<mark style="color:red;">`She brings some flowers: rose`</mark> ,<mark style="color:red;">`sunflower`</mark> , and <mark style="color:red;">`daisy`</mark>.*

## XLS and XLSX

[**XLS**](https://docs.fileformat.com/spreadsheet/xls/) and [**XLSX**](https://wiki.fileformat.com/spreadsheet/xlsx/) is a well-known format for Microsoft Excel documents that was introduced by Microsoft XLS is an older format that was used in older versions of Excel, while XLSX is a newer format that is the default in more recent versions of Excel. Both formats allow users to input, organize, and analyze data in rows and columns. They also support features such as formulas, charts, and graphs. XLSX is a more efficient format that offers better data recovery and larger file size limits.

**These formats are supported for data import**, offering a familiar option for row or document based workflows.

**Example**

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-35a711ea0f5f0b096a4adc8f4961a4d8297ec943%2Fbookreview2020%20(1).xlsx?alt=media>" %}
XLSX Sample
{% endfile %}

![](https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-40d7b3481d7103b72fb93e37710b20fc68b4f25b%2Fxlsx%20\(2\).jpg?alt=media)

## JSON

[**JSON**](https://www.json.org/json-en.html) (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value).

A JSON file may contain the following data structures:

* An **object** is an unordered set of name/value pairs.
  * An object begins with *{left brace* and ends with *}right brace*. Each name is followed by *:colon* and the name/value pairs are separated by *,comma*.
* An **array** is an ordered collection of values.
  * An array begins with *\[left bracket* and ends with *]right bracket*. Values are separated by *,comma*.
* A **value** can be a `string` in double quotes, or a `number`, or `true or false or null`, or an `object` or an `array`. These structures can be nested.
* A **string** is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. A character is represented as a single character string. A string is very much like a C or Java string.
* A **number** is like a C or Java number, except that the octal and hexadecimal formats are not used.
* **Whitespace** can be inserted between any pair of tokens. Excepting a few encoding details, that completely describes the language.

## Datasaur Schema Format

Datasaur Schema is a customized JSON format that is designed to fit all available project types in Datasaur app. This format can be used for [mixed project type](https://docs.datasaur.ai/data-studio-projects/nlp-task-types/mixed-label-sets), e.g. Token + Document labeling. You will receive all label and answer combined in one exported file.

**This format is supported for data export**, providing a complete representation of labels, answers, metadata, and user actions in a single file.

A Datasaur Schema contains the following data structures.

1. **version**: version number of Datasaur schema.
2. **Rows Field**
   1. **content:** the text of the sentence.
   2. **tokens:** the tokens form of the sentence.
   3. **metadata**: contains additional information for a line. You can find the structure and configuration options for metadata [here](https://docs.datasaur.ai/advanced/extensions/metadata).
3. **labelerInfo:** the information about the labeler.
   1. **id:** the unique identifier of a labeler (each labeler has a different id).
   2. **email:** email that labeler used when signing in.
   3. **displayName:** the display name of the email.
4. **labelSets:** contains all the label items that you used for the project.
   1. **index:** the position of the label set in UI
   2. **labelItems**: an array of labelItems for a label set
      1. **id:** id of the labelSet
      2. **labelName:** the displayed name of the label set item
      3. **parentId:** id of the parent label set item
      4. **color:** the color of the label set item
5. **labels:** an array of labels for the document. Labels consist of spanLabels, arrowLabels, boundingBoxLabels, timeLabels.

   1. **spanLabels** are all labels that are applied directly to the token/sentence.
   2. **arrowLabels** are all labels that are applied in the top of arrow.
   3. **boundingBoxLabels** are all labels that are applied in the top of OCR documents.
   4. **timeLabels** are all labels that are applied in the top of audio waveform.

   Below are all attributes under labels.

   1. **id:** identifier from the applied label.
   2. **labeledBy:**
      1. `CONFLICT`: if it has not been resolved
      2. `REVIEWER`: if it has been resolved
      3. `AUTO`: if it has been resolved by meeting the consensus
      4. `LABELER`: if it comes from labeler
   3. **labeledByUserId:** the user id of a reviewer
   4. **acceptedByUserId:** the user id of a reviewer who accepts the label. It will be null if there's no user who accept it manually.
   5. **rejectedByUserId:** the user id of a reviewer who rejects the label. It will be null if there's no user who rejects it manually.
   6. **status:** label status\*\*.\*\* It can be **`REJECTED`,** if it is rejected by Reviewer, and `ACCEPTED` if it is accepted by Reviewer.
   7. **hashCode:** Datasaur's code to represent label information.

      For example, `SPAN:gpe:0:0:0:4:0:0:0:4:3:0:undefined:undefined.` Below is the explanation: \<type:label set item id:layer or label set index:start cell line:cell index:start token index: start char index: end cell line: end cell index: end token index: end char index: counter.>
   8. **textPosition:** information about where are the exact location for the labelled text.
      1. **start:** starting text position
         1. **row**: number of line
         2. **column**: number of column. For token based project, it always gives 0 value
         3. **tokenIndex**: number of token, relative to row number
         4. **charIndex**: character index position, relative to token
      2. **end**: ending text position
         1. **row**: number of line
         2. **column**: number of column. For token based project, it always gives 0 value
         3. **tokenIndex**: number of token, relative to row number
         4. **charIndex**: character index position, relative to token
6. **Arrow label type specific fields**
   1. **originId:** origin id of an arrow label
   2. **destinationId:** origin id of an arrow label
7. **Bounding Box label type specific fields**
   1. **coordinates**: consists of 4 points paired x and y value.
8. **Timestamp label type specific fields**
   1. **startTimeMillis**: starting timestamp in millisecond.
   2. **endTimeMillis**: ending timestamp in millisecond.
9. **comments**: contains all comment that you insert for the document.
   1. **id:** the id of the comment
   2. **parentId:** the id of the parent comment - this will be filed if the comment thread has replies.
   3. **hashCode:** Datasaur's code to represent comment's information, including the value being commented.
   4. **message:** the content of the comment
   5. **type:** the type of comment, can be `SPAN_LABEL`,`SPAN_TEXT`, `ARROW_LABEL`, and `COMMENT`
   6. **userId**: the id of user who create the comment
   7. **createdAt:** the time when the user create the comment
10. **documentQuestions**: contains the question set that is used for a document-based project.
    1. **id**: the id of the question
    2. **name**: default name given per question
    3. **description**: question text from the question set.
    4. **type**: type of the question. It can be in `TEXT`, `DROPDOWN`, `HIERARCHICAL_DROPDOWN`, `NESTED`, `SLIDER`, `DATE`, `TIME`, `CHECKBOX`, `URL`.
    5. **displayed**: state whether it’s shown or not. True if it’s shown in the extension.
    6. **parentId**: the id of parent questions.
11. **documentAnswerSet**: contains the answer from the used question set.

    The answers consists of paired documentQuestion ID and answers. Example: `“1”: “Good”` . It shows `1` as id and `Good` as the question answers,
12. r**owQuestions**: contains the question set that is used for a row-based project.
    1. **id**: the id of the question
    2. **name**: default name given per question
    3. **description**: question text from the question set.
    4. **type**: type of the question. It can be in `TEXT`, `DROPDOWN`, `HIERARCHICAL_DROPDOWN`, `NESTED`, `SLIDER`, `DATE`, `TIME`, `CHECKBOX`, `URL`.
    5. **displayed**: state whether it’s shown or not. True if it’s shown in the extension.
    6. **parentId**: the id of parent questions.
13. **rowAnswerSets**: contains all answer from row-based project. It consists of row number, rowQuestions ID, and answers.

    Example:

    ```jsx
    [
                    5,
                    {
                        "1": [
                            {
                                "2": "A",
                                "3": "B"
                            }
                        ]
                    }
                ],
    ```

    `5` as the row number, `1` as parentId question from `2` and `3`. `2` and `3` are the question ID. `A` and `B` are the answers from `2` and `3`

**Examples**

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-68b0858ca9c4f6e27ce7c8fe32fd98adb59be9ac%2FDatasaur%20Schema%20Row%20Labeling?alt=media>" %}
Sample Datasaur Schema Row
{% endfile %}

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-acb649df9660bbc45711f910dd9123fe43b07ec0%2FDatasaur%20sample%20-%20Schema%20(token%20%2B%20document%20labeling).json?alt=media&token=a7466c50-3c5e-4968-9c2e-24154f8f5027>" %}
Sample Datasaur Schema Token + Document
{% endfile %}

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-892bf466ca8ea99a045aaa3dc585c9eacf314676%2FDatasaur%20Schema%20Bounding%20Box%20Labeling%20(1).json?alt=media>" %}
Sample Datasaur Schema Bounding Box
{% endfile %}

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-63e89f6a1eb43a3b89eb11cd088b716366998a71%2FDatasaur%20sample%20-%20Schema%20(bbox%20labeling%20with%20custom%20attributes).json?alt=media>" %}
Sample Datasaur Schema Bounding Box with Custom Attributes
{% endfile %}

## Conversational JSON

**Conversational JSON** is a format Datasaur created to support conversational or chat-like data. Each file should represent a conversation or chat log containing multiple <mark style="color:red;">`message`</mark>s.

Each <mark style="color:red;">`message`</mark> object must have the following properties:

1. <mark style="color:red;">`content`</mark> : String content of the message. Required.
2. <mark style="color:red;">`speaker`</mark> : Identifies who say / sent the message. Required.
3. <mark style="color:red;">`color`</mark> : Optional. If provided, Datasaur will use the color to render the speaker’s avatar.
4. <mark style="color:red;">`alignment`</mark> : Required. One of <mark style="color:red;">`LEFT`</mark> or <mark style="color:red;">`RIGHT`</mark>. Currently, there are no visual difference between them. In the future, we plan to use this information to place the message boxes accordingly
5. <mark style="color:red;">`indent`</mark> : Required. Integer between 0-4. Currently, there are no visual difference between them. In the future, we plan to use this information to place messages in a thread-like view.

**This format is supported for data import** in conversational labeling projects.

**Example**:

```json
{
  "messages": [
    {
      "content": "So you are actually Victor? How did this happen?",
      "speaker": "Person 1",
      "color": "red",
      "alignment": "LEFT",
      "indent": 0
    },
    {
      "content": "I'm not sure... I started getting memories of Isabella and my past life when I was 10. That's when I started visiting the manor secretly.",
      "speaker": "Person 2",
      "color": "cyan",
      "alignment": "LEFT",
      "indent": 0
    }
  ]
}
```

## JSON Simplified <a href="#json_simplified" id="json_simplified"></a>

JSON Simplified is an export format for [Span labeling](https://docs.datasaur.ai/data-studio-projects/lets-get-labeling/span-based) project. This format contains both the text as well as the labeled spans, along with a character indexing. It’s suitable for simpler workflows where we expect each sentence to be contained and isolated from one another.

In the example below, here are the objects recognized at Datasaur.

* **text:** the sentence.
* **entities:** array of label applied
  * **text:** the token
  * **type:** the label applied.
  * **start\_idx:** the character position in the labeled token.
    * The character position uses zero-based index.
  * **end\_idx:** the last character position + 1 (because end\_index does not include the last character).
    * The character position uses zero-based index.

**This format is supported for both data import and export**, ideal for simple workflows where spans are isolated per sentence.

**JSON (Simplified) export format limitation:**

* Can't export arrow label
* Can't export label multi sentence

**Example**

```javascript
[
  {
    "text": "The new series Narcos created by Chris Brancato , Eric Newman and Carlo Bernard , represents a pretty ambitious step for Netflix .",
    "entities": [
      {
        "text": "Narcos",
        "type": "TITLE",
        "start_idx": 15,
        "end_idx": 21
      },
      {
        "text": "Chris Brancato",
        "type": "PER",
        "start_idx": 33,
        "end_idx": 47
      },
      {
        "text": "Carlo Bernard",
        "type": "PER",
        "start_idx": 66,
        "end_idx": 79
      },
      {
        "text": "Eric Newman",
        "type": "PER",
        "start_idx": 50,
        "end_idx": 61
      },
      {
        "text": "Netflix",
        "type": "ORG",
        "start_idx": 121,
        "end_idx": 128
      }
    ]
  }
]
```

## JSON Advanced <a href="#json_advanced" id="json_advanced"></a>

JSON Advanced is a proprietary Datasaur format designed in collaboration with our users to capture all possible data. This format is commonly used for partial token labeling projects. You can also use it when exporting token-based with arrow projects, such as coreference and dependency.

**This format is supported for both data import and export** in span based projects (with or without arrows).

A JSON\_ADVANCED file may contain the following data structures:

1. **Sentences field**
   1. **id:** the sentence position.
   2. **content:** the text of the sentence.
   3. **tokens:** the tokens form of the sentence.
   4. **labels**
      1. **l:** the label applied.
      2. **layer:** the layer position of the labels. This field is reserved for a project where a labeling of multiple tag set at once. For now you can disregard this field and this field is always set to 0.
      3. **id:** the unique identifier of a label.
         1. If the id has 9 segments, this indicates **span label**. For example, `INNM0ViFwo8LluMTaTIK9:0:0:14:0:0:18:6:0` and here's the explanation `<label set item id>:<layer>:<sidS>:<s>:<charS>:<sidE>:<e>:<charE>:<index>`.
         2. If the id has 21 segments, this indicates **arrow label**. For example, `tfc1FkbbEk9fOLx6haR1s:0:INNM0ViFwo8LluMTaTIK9:0:0:14:0:0:18:6:0:Oq_VuB0s_N7D8ZY0rgYsg:0:0:0:0:0:2:5:0:0` and here's the explanation `<label set item id>:<arrow layer>:<….. origin id>:<….destination id>:<arrow index>.`
      4. **hashCode:** Datasaur's code to represent label information \_\_
         1. Span label. For example, `SPAN:gpe:0:0:0:4:0:0:0:4:3:0:undefined:undefined.` Below is the explanation:
            1. type:label set item id:layer or label set index:start cell line:cell index:start token index: start char index: end cell line: end cell index: end token index: end char index: counter.
         2. Arrow label. For example, `ARROW:dyC-o1HBnn49dcqDSphmJ:1:0:0:0:0:0:0:10:6:0:SPAN:geo:0:0:0:0:0:0:0:0:4:0:undefined:undefined:SPAN:geo:0:0:0:10:0:0:0:10:6:0:undefined:undefined`. Below is the explanation:
            1. type:label set item id:layer or label set index:start cell line:cell index:start token index: start char index: end cell line: end cell index: end token index: end char index: counter:\<span label: origin>:\<span label: destination>.
      5. **documentId:** the id of document.
      6. **sidS, sidE:** the sentence starting and ending position of a label in 0-based index. In Datasaur, it is possible that a label spans across sentences.
      7. **s:** the token starting position of a label in the starting sentence in 0-based index.
      8. **e:** the token ending position of a label in the ending sentence in 0-based index.
      9. **charS:** the character starting position of a label in the starting token in 0-based index.
      10. **charE:** the character ending position of a label in the ending token in 0-based index.
   5. **metadata**: additional information for a cell. You can find the structure and configuration options for metadata [here](https://docs.datasaur.ai/advanced/extensions/metadata).
2. **labelerInfo:** the information about the labeler.
   1. **id:** the unique identifier of a labeler (each labeler has different id).
   2. **email:** email that labeler used when signing in.
   3. **displayName:** the display name of the email.
3. **labelSets:** contains all the label items that you used for the project.
   1. **index:** the position of the label set in UI
   2. **labelItems**: an array of labelItems for a label set
      * **id:** id of the labelSetItem
      * **labelName:** the displayed name of the label set item
      * **parentId:** id of the parent label set item
      * **color:** the color of the label set item
4. **labels:** an array of labels for the document
   1. **labelText:** label content for row-based project. It will be null for other project beside the row-based project.
   2. **id:** identifier from the applied label.
   3. **documentId :** identifier for document where the label is applied.
   4. **startCellLine:** starting line sentence position
   5. **startCellIndex:** starting line column position
   6. **startTokenIndex:** starting token index position
   7. **startCharIndex:** starting character index position (relative to tokenIndex, start from 0 again when tokenIndex incremented)
   8. **endCellLine:** ending line sentence position
   9. **endCellIndex:** ending line column position
   10. **endTokenIndex:** ending token index position
   11. **endCharIndex:** ending character index position
   12. **layer:** the layer where the token is positioned
   13. **counter:** labels with the same name to be placed multiple times in the same position, start from 0
   14. **type:** the type of labels -> SPAN, ARROW, BOUNDING\_BOX
   15. **createdAt:**
       1. **Labeler:** the time labels applied
       2. **Reviewer:** the time labels got accepted
   16. **updatedAt:** last update timestamp on the label
   17. **Review related fields**
       1. **acceptedByUserId:** the user id of a reviewer who accepts the label. It will be null if there's no user who accept it manually.
       2. **rejectedByUserId:** the user id of a reviewer who rejects the label. It will be null if there's no user who rejects it manually
       3. **labeledByUserId:** the user id of a reviewer
       4. **labeledBy:**
          * CONFLICT if it has not been resolved
          * REVIEWER if it has been resolved
          * AUTO if it has been resolved by meeting the consensus
   18. **Arrow label type specific fields**
       1. **originId:** origin id of an arrow label
       2. **originNumber:** auto increment ID for origin
       3. **destinationId:** origin id of an arrow label
       4. **destinationNumber:** auto increment ID for destination
   19. **Bounding box label type specific fields**
       1. **pageIndex:** index of the page if the document contain multiple pages
       2. **nodeCount:** total number of the bounding box points
       3. **x0:** x coordinate of top left position of the bounding box
       4. **y0:** y coordinate of top left position of the bounding box
       5. **x1:** x coordinate of top right position of the bounding box
       6. **y1:** y coordinate of top right position of the bounding box
       7. **x2:** x coordinate of bottom right position of the bounding box
       8. **y2:** y coordinate of bottom right position of the bounding box
       9. **x3:** x coordinate of bottom left position of the bounding box
       10. **y3:** y coordinate of bottom left position of the bounding box
   20. **pages:** an array of page information for OCR project type
       1. **pageIndex:** index of the page if the document contain multiple pages
       2. **pageHeight:** original page height in pixel
       3. **pageWidth:** original page width in pixel
5. **comments**
   1. **id:** the id of the comment
   2. **parentId:** the id of the parent comment - this will be filed if the comment thread has replies.
   3. **hashCode:** Datasaur's code to represent comment's information, including the value being commented
   4. **message:** the content of the comment
   5. **type:** the type of comment, can be `SPAN_LABEL`,`SPAN_TEXT`, `ARROW_LABEL`, and `CELL_LABEL`
   6. **userId**: the id of user who create the comment
   7. **createdAt:** the time when the user create the comment

**Example (token-based with arrow)**

![](https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-607e85d6a9ff15690eab419a31a0d2fd9887a92b%2Fimage%20\(233\)%20\(1\)%20\(1\)%20\(1\).png?alt=media)

```javascript
{
  "sentences": [
    {
      "id": 0,
      "content": "The Little Prince is a novella by French aristocrat , writer , and aviator Antoine de Saint - Exupéry .",
      "tokens": [
        "The",
        "Little",
        "Prince",
        "is",
        "a",
        "novella",
        "by",
        "French",
        "aristocrat",
        ",",
        "writer",
        ",",
        "and",
        "aviator",
        "Antoine",
        "de",
        "Saint",
        "-",
        "Exupéry",
        "."
      ],
      "labels": [
        {
          "id": "vGOy0ZKA-2rqK7netKz9I:0:0:0:0:0:2:5:0",
          "l": "vGOy0ZKA-2rqK7netKz9I",
          "layer": 0,
          "deleted": false,
          "labeledBy": "PRELABELED",
          "createdAt": "2024-08-05T20:23:47.284Z",
          "updatedAt": "2024-08-05T20:23:47.606Z",
          "hashCode": "SPAN:vGOy0ZKA-2rqK7netKz9I:0:0:0:0:0:0:0:2:5:0:undefined:undefined",
          "documentId": "e19c1143-a6ea-490b-ab9f-2a0d7bdafa3e",
          "confidenceScore": null,
          "sidS": 0,
          "s": 0,
          "charS": 0,
          "sidE": 0,
          "e": 2,
          "charE": 5
        },
        {
          "id": ":0:vGOy0ZKA-2rqK7netKz9I:0:0:0:0:0:2:5:0:9I-5oYKvnzJRWHgsZrDe_:0:0:14:0:0:18:6:0:0",
          "l": "",
          "layer": 0,
          "deleted": false,
          "labeledBy": "LABELER",
          "labeledByUserId": 4500,
          "createdAt": "2024-08-05T20:26:12.466Z",
          "updatedAt": "2024-08-05T20:26:12.611Z",
          "hashCode": "ARROW::0:0:0:0:0:0:0:18:6:0:SPAN:vGOy0ZKA-2rqK7netKz9I:0:0:0:0:0:0:0:2:5:0:undefined:undefined:SPAN:9I-5oYKvnzJRWHgsZrDe_:0:0:0:14:0:0:0:18:6:0:undefined:undefined",
          "documentId": "e19c1143-a6ea-490b-ab9f-2a0d7bdafa3e",
          "confidenceScore": null,
          "sidS": 0,
          "s": 0,
          "charS": 0,
          "sidE": 0,
          "e": 18,
          "charE": 6
        },
        {
          "id": "A-92mMT_WppaawOOBJbjt:1:9I-5oYKvnzJRWHgsZrDe_:0:0:14:0:0:18:6:0:vGOy0ZKA-2rqK7netKz9I:0:0:0:0:0:2:5:0:0",
          "l": "A-92mMT_WppaawOOBJbjt",
          "layer": 1,
          "deleted": false,
          "labeledBy": "PRELABELED",
          "createdAt": "2024-08-05T20:23:47.291Z",
          "updatedAt": "2024-08-05T20:23:47.622Z",
          "hashCode": "ARROW:A-92mMT_WppaawOOBJbjt:1:0:0:0:0:0:0:18:6:0:SPAN:9I-5oYKvnzJRWHgsZrDe_:0:0:0:14:0:0:0:18:6:0:undefined:undefined:SPAN:vGOy0ZKA-2rqK7netKz9I:0:0:0:0:0:0:0:2:5:0:undefined:undefined",
          "documentId": "e19c1143-a6ea-490b-ab9f-2a0d7bdafa3e",
          "confidenceScore": null,
          "sidS": 0,
          "s": 0,
          "charS": 0,
          "sidE": 0,
          "e": 18,
          "charE": 6
        },
        {
          "id": "9I-5oYKvnzJRWHgsZrDe_:0:0:14:0:0:18:6:0",
          "l": "9I-5oYKvnzJRWHgsZrDe_",
          "layer": 0,
          "deleted": false,
          "labeledBy": "PRELABELED",
          "createdAt": "2024-08-05T20:23:47.284Z",
          "updatedAt": "2024-08-05T20:23:47.606Z",
          "hashCode": "SPAN:9I-5oYKvnzJRWHgsZrDe_:0:0:0:14:0:0:0:18:6:0:undefined:undefined",
          "documentId": "e19c1143-a6ea-490b-ab9f-2a0d7bdafa3e",
          "confidenceScore": null,
          "sidS": 0,
          "s": 14,
          "charS": 0,
          "sidE": 0,
          "e": 18,
          "charE": 6
        }
      ],
      "metadata": []
    }
  ],
  "labelSets": [
    {
      "index": 0,
      "labelItems": [
        {
          "id": "vGOy0ZKA-2rqK7netKz9I",
          "labelName": "Novel"
        },
        {
          "id": "9I-5oYKvnzJRWHgsZrDe_",
          "labelName": "Male"
        },
        {
          "id": "A-92mMT_WppaawOOBJbjt",
          "labelName": "Author"
        }
      ]
    }
  ],
  "labels": [
    {
      "labelText": null,
      "id": "1685710595",
      "documentId": "e19c1143-a6ea-490b-ab9f-2a0d7bdafa3e",
      "labeledByUserId": null,
      "startCellIndex": 0,
      "startCellLine": 0,
      "startTokenIndex": 0,
      "startCharIndex": 0,
      "endCellIndex": 0,
      "endCellLine": 0,
      "endTokenIndex": 2,
      "endCharIndex": 5,
      "layer": 0,
      "counter": 0,
      "labeledBy": "PRELABELED",
      "acceptedByUserId": null,
      "rejectedByUserId": null,
      "originNumber": "0",
      "destinationNumber": "0",
      "type": "SPAN",
      "labelSetItemId": "vGOy0ZKA-2rqK7netKz9I",
      "status": "ACCEPTED",
      "confidenceScore": null,
      "createdAt": "2024-08-05T20:23:47.284Z",
      "updatedAt": "2024-08-05T20:23:47.606Z",
      "hashCode": "SPAN:vGOy0ZKA-2rqK7netKz9I:0:0:0:0:0:0:0:2:5:0:undefined:undefined"
    },
    {
      "labelText": null,
      "id": "1685710604",
      "documentId": "e19c1143-a6ea-490b-ab9f-2a0d7bdafa3e",
      "labeledByUserId": 4500,
      "startCellIndex": 0,
      "startCellLine": 0,
      "startTokenIndex": 0,
      "startCharIndex": 0,
      "endCellIndex": 0,
      "endCellLine": 0,
      "endTokenIndex": 18,
      "endCharIndex": 6,
      "layer": 0,
      "counter": 0,
      "labeledBy": "LABELER",
      "acceptedByUserId": null,
      "rejectedByUserId": null,
      "originId": "1685710595",
      "originNumber": "1685710595",
      "destinationId": "1685710596",
      "destinationNumber": "1685710596",
      "type": "ARROW",
      "labelSetItemId": "",
      "status": "ACCEPTED",
      "confidenceScore": null,
      "createdAt": "2024-08-05T20:26:12.466Z",
      "updatedAt": "2024-08-05T20:26:12.611Z",
      "origin": {
        "labelText": null,
        "id": "1685710595",
        "documentId": "e19c1143-a6ea-490b-ab9f-2a0d7bdafa3e",
        "labeledByUserId": null,
        "startCellIndex": 0,
        "startCellLine": 0,
        "startTokenIndex": 0,
        "startCharIndex": 0,
        "endCellIndex": 0,
        "endCellLine": 0,
        "endTokenIndex": 2,
        "endCharIndex": 5,
        "layer": 0,
        "counter": 0,
        "labeledBy": "PRELABELED",
        "acceptedByUserId": null,
        "rejectedByUserId": null,
        "originNumber": "0",
        "destinationNumber": "0",
        "type": "SPAN",
        "labelSetItemId": "vGOy0ZKA-2rqK7netKz9I",
        "status": "ACCEPTED",
        "confidenceScore": null,
        "createdAt": "2024-08-05T20:23:47.284Z",
        "updatedAt": "2024-08-05T20:23:47.606Z"
      },
      "destination": {
        "labelText": null,
        "id": "1685710596",
        "documentId": "e19c1143-a6ea-490b-ab9f-2a0d7bdafa3e",
        "labeledByUserId": null,
        "startCellIndex": 0,
        "startCellLine": 0,
        "startTokenIndex": 14,
        "startCharIndex": 0,
        "endCellIndex": 0,
        "endCellLine": 0,
        "endTokenIndex": 18,
        "endCharIndex": 6,
        "layer": 0,
        "counter": 0,
        "labeledBy": "PRELABELED",
        "acceptedByUserId": null,
        "rejectedByUserId": null,
        "originNumber": "0",
        "destinationNumber": "0",
        "type": "SPAN",
        "labelSetItemId": "9I-5oYKvnzJRWHgsZrDe_",
        "status": "ACCEPTED",
        "confidenceScore": null,
        "createdAt": "2024-08-05T20:23:47.284Z",
        "updatedAt": "2024-08-05T20:23:47.606Z"
      },
      "hashCode": "ARROW::0:0:0:0:0:0:0:18:6:0:SPAN:vGOy0ZKA-2rqK7netKz9I:0:0:0:0:0:0:0:2:5:0:undefined:undefined:SPAN:9I-5oYKvnzJRWHgsZrDe_:0:0:0:14:0:0:0:18:6:0:undefined:undefined"
    },
    {
      "labelText": null,
      "id": "1685710597",
      "documentId": "e19c1143-a6ea-490b-ab9f-2a0d7bdafa3e",
      "labeledByUserId": null,
      "startCellIndex": 0,
      "startCellLine": 0,
      "startTokenIndex": 0,
      "startCharIndex": 0,
      "endCellIndex": 0,
      "endCellLine": 0,
      "endTokenIndex": 18,
      "endCharIndex": 6,
      "layer": 1,
      "counter": 0,
      "labeledBy": "PRELABELED",
      "acceptedByUserId": null,
      "rejectedByUserId": null,
      "originId": "1685710596",
      "originNumber": "1685710596",
      "destinationId": "1685710595",
      "destinationNumber": "1685710595",
      "type": "ARROW",
      "labelSetItemId": "A-92mMT_WppaawOOBJbjt",
      "status": "ACCEPTED",
      "confidenceScore": null,
      "createdAt": "2024-08-05T20:23:47.291Z",
      "updatedAt": "2024-08-05T20:23:47.622Z",
      "origin": {
        "labelText": null,
        "id": "1685710596",
        "documentId": "e19c1143-a6ea-490b-ab9f-2a0d7bdafa3e",
        "labeledByUserId": null,
        "startCellIndex": 0,
        "startCellLine": 0,
        "startTokenIndex": 14,
        "startCharIndex": 0,
        "endCellIndex": 0,
        "endCellLine": 0,
        "endTokenIndex": 18,
        "endCharIndex": 6,
        "layer": 0,
        "counter": 0,
        "labeledBy": "PRELABELED",
        "acceptedByUserId": null,
        "rejectedByUserId": null,
        "originNumber": "0",
        "destinationNumber": "0",
        "type": "SPAN",
        "labelSetItemId": "9I-5oYKvnzJRWHgsZrDe_",
        "status": "ACCEPTED",
        "confidenceScore": null,
        "createdAt": "2024-08-05T20:23:47.284Z",
        "updatedAt": "2024-08-05T20:23:47.606Z"
      },
      "destination": {
        "labelText": null,
        "id": "1685710595",
        "documentId": "e19c1143-a6ea-490b-ab9f-2a0d7bdafa3e",
        "labeledByUserId": null,
        "startCellIndex": 0,
        "startCellLine": 0,
        "startTokenIndex": 0,
        "startCharIndex": 0,
        "endCellIndex": 0,
        "endCellLine": 0,
        "endTokenIndex": 2,
        "endCharIndex": 5,
        "layer": 0,
        "counter": 0,
        "labeledBy": "PRELABELED",
        "acceptedByUserId": null,
        "rejectedByUserId": null,
        "originNumber": "0",
        "destinationNumber": "0",
        "type": "SPAN",
        "labelSetItemId": "vGOy0ZKA-2rqK7netKz9I",
        "status": "ACCEPTED",
        "confidenceScore": null,
        "createdAt": "2024-08-05T20:23:47.284Z",
        "updatedAt": "2024-08-05T20:23:47.606Z"
      },
      "hashCode": "ARROW:A-92mMT_WppaawOOBJbjt:1:0:0:0:0:0:0:18:6:0:SPAN:9I-5oYKvnzJRWHgsZrDe_:0:0:0:14:0:0:0:18:6:0:undefined:undefined:SPAN:vGOy0ZKA-2rqK7netKz9I:0:0:0:0:0:0:0:2:5:0:undefined:undefined"
    },
    {
      "labelText": null,
      "id": "1685710596",
      "documentId": "e19c1143-a6ea-490b-ab9f-2a0d7bdafa3e",
      "labeledByUserId": null,
      "startCellIndex": 0,
      "startCellLine": 0,
      "startTokenIndex": 14,
      "startCharIndex": 0,
      "endCellIndex": 0,
      "endCellLine": 0,
      "endTokenIndex": 18,
      "endCharIndex": 6,
      "layer": 0,
      "counter": 0,
      "labeledBy": "PRELABELED",
      "acceptedByUserId": null,
      "rejectedByUserId": null,
      "originNumber": "0",
      "destinationNumber": "0",
      "type": "SPAN",
      "labelSetItemId": "9I-5oYKvnzJRWHgsZrDe_",
      "status": "ACCEPTED",
      "confidenceScore": null,
      "createdAt": "2024-08-05T20:23:47.284Z",
      "updatedAt": "2024-08-05T20:23:47.606Z",
      "hashCode": "SPAN:9I-5oYKvnzJRWHgsZrDe_:0:0:0:14:0:0:0:18:6:0:undefined:undefined"
    }
  ]
}           
```

**Example (token-based with character-based labeling)**

![](https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-35e12fb17ac125e68573230b28a749a6f9db356b%2Fimage%20\(233\).png?alt=media)

```javascript
{
    "sentences": [
        {
            "id": 0,
            "content": "The Little Prince is a novella by French aristocrat , writer , and aviator Antoine de Saint - Exupéry .",
            "tokens": ["The","Little","Prince","is","a","novella","by","French","aristocrat",",","writer",",","and","aviator","Antoine","de","Saint","-","Exupéry","."
            ],
            "labels": [
                {
                    "layer": 2,
                    "sidS": 0,
                    "s": 5,
                    "charS": 0,
                    "sidE": 0,
                    "e": 5,
                    "charE": 4,
                    "l": "dKXDeLxSHz1wZdXvA5yQz",
                    "id": "dKXDeLxSHz1wZdXvA5yQz:2:0:5:0:0:5:4:0",
                    "deleted": false,
                    "labeledBy": "LABELER",
                    "labeledByUserId": 752,
                    "hashCode": "dKXDeLxSHz1wZdXvA5yQz:2:0:0:5:0:0:0:5:4:0:SPAN:undefined:undefined",
                    "documentId": "fc324eb5-3cf4-4a16-baa0-954d1d2e13c8",
                    "comments": []
                }
            ]
        }
    ],
    "labelSets": [
        {
            "labelItems": [
                {
                    "id": "O1T-l9CGbonHyxj0GOtAo",
                    "labelName": "Noun phrase",
                    "parentId": null,
                    "color": "#ff8000"
                },
                {
                    "id": "dKXDeLxSHz1wZdXvA5yQz",
                    "labelName": "NN",
                    "parentId": "O1T-l9CGbonHyxj0GOtAo",
                    "color": "#ff8000"
                },
                {
                    "id": "pWUd_Sa1bAiFe38MzU8OL",
                    "labelName": "NNP",
                    "parentId": "O1T-l9CGbonHyxj0GOtAo",
                    "color": "#ff8000"
                },
                {
                    "id": "6zUYagMuBuYmal8zITzqZ",
                    "labelName": "Verb phrase",
                    "parentId": null,
                    "color": "#df3920"
                },
                {
                    "id": "-8zD8jA9XRKQJBNL5snmp",
                    "labelName": "VBT",
                    "parentId": "6zUYagMuBuYmal8zITzqZ",
                    "color": "#df3920"
                },
                {
                    "id": "dZ9UplPt07D97EmrV5Dpn",
                    "labelName": "VBD",
                    "parentId": "6zUYagMuBuYmal8zITzqZ",
                    "color": "#df3920"
                },
                {
                    "id": "m59PdwZSqx50OY4K58vCw",
                    "labelName": "VBN",
                    "parentId": "6zUYagMuBuYmal8zITzqZ",
                    "color": "#df3920"
                },
                {
                    "id": "BCn1q1clVI9oAje2boyLX",
                    "labelName": "VBI",
                    "parentId": "6zUYagMuBuYmal8zITzqZ",
                    "color": "#df3920"
                },
                {
                    "id": "BwjMomxD2E_UJoCNuy_IL",
                    "labelName": "VB",
                    "parentId": "6zUYagMuBuYmal8zITzqZ",
                    "color": "#df3920"
                }
            ]
        }
    ],
    "labels": [
        {
            "labelText": null,
            "id": "509007771",
            "documentId": "fc324eb5-3cf4-4a16-baa0-954d1d2e13c8",
            "labeledByUserId": 752,
            "startCellIndex": 0,
            "startCellLine": 0,
            "startTokenIndex": 5,
            "startCharIndex": 0,
            "endCellIndex": 0,
            "endCellLine": 0,
            "endTokenIndex": 5,
            "endCharIndex": 4,
            "layer": 2,
            "counter": 0,
            "labeledBy": "LABELER",
            "acceptedByUserId": null,
            "rejectedByUserId": null,
            "originId": null,
            "originNumber": "0",
            "destinationId": null,
            "destinationNumber": "0",
            "type": "SPAN",
            "labelSetItemId": "dKXDeLxSHz1wZdXvA5yQz",
            "status": "ACCEPTED",
            "createdAt": "2021-09-03T09:22:58.456Z",
            "updatedAt": "2021-09-03T09:22:58.512Z"
        }
    ]
}
```

**Example (token-based with bounding-box labeling)**

![](https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-3042164d00b871a4524f7d1f3d878085f0a58589%2FScreen%20Shot%202021-07-15%20at%2010.41.26%20\(1\).png?alt=media)

```javascript
{
  "sentences": [
    {
      "id": 0,
      "content": "73",
      "tokens": ["73"],
      "labels": [
        {
          "layer": 0,
          "sidS": 0,
          "s": 0,
          "charS": 0,
          "sidE": 0,
          "e": 0,
          "charE": 1,
          "l": "fDsCQJFyWy5LnMPtHK4DC",
          "id": "fDsCQJFyWy5LnMPtHK4DC:0:0:0:0:0:0:1:0",
          "deleted": false,
          "labeledBy": "LABELER",
          "labeledByUserId": 1,
          "hashCode": "fDsCQJFyWy5LnMPtHK4DC:0:0:0:0:0:0:0:0:1:0:SPAN:undefined:undefined",
          "documentId": "dac38af2-cfb3-4007-b2ca-302dc8c450fe",
          "comments": []
        }
      ]
    }
  ],
  "labelSets": [
    {
      "labelItems": [
        { "id": "fDsCQJFyWy5LnMPtHK4DC", "labelName": "Queue number" }
      ]
    }
  ],
  "labels": [
    {
      "labeledBy": "PRELABELED",
      "labeledByUserId": null,
      "acceptedByUserId": null,
      "rejectedByUserId": null,
      "type": "BOUNDING_BOX",
      "status": "LABELED",
      "startCellIndex": 0,
      "startCellLine": 0,
      "startTokenIndex": 0,
      "startCharIndex": 0,
      "endCellIndex": 0,
      "endCellLine": 0,
      "endTokenIndex": 0,
      "endCharIndex": 1,
      "layer": 0,
      "counter": 0,
      "pageIndex": 0,
      "x0": 228,
      "y0": 114,
      "x1": 286,
      "y1": 114,
      "x2": 286,
      "y2": 158,
      "x3": 228,
      "y3": 158,
      "nodeCount": 4
    },
    {
      "labelText": null,
      "labeledByUserId": 1,
      "startCellIndex": 0,
      "startCellLine": 0,
      "startTokenIndex": 0,
      "startCharIndex": 0,
      "endCellIndex": 0,
      "endCellLine": 0,
      "endTokenIndex": 0,
      "endCharIndex": 1,
      "layer": 0,
      "counter": 0,
      "labeledBy": "LABELER",
      "acceptedByUserId": null,
      "rejectedByUserId": null,
      "originId": null,
      "originNumber": "0",
      "destinationId": null,
      "destinationNumber": "0",
      "type": "SPAN",
      "labelSetItemId": "fDsCQJFyWy5LnMPtHK4DC",
      "status": "ACCEPTED"
    }
  ],
  "pages": [
    {
      "pageIndex": 0,
      "pageHeight": 619,
      "pageWidth": 551
    }
  ]
}
```

## JSON Tabular <a href="#json_tabular" id="json_tabular"></a>

JSON Tabular is a derivative of the JSON format that is used to represent table data format (in the form of an array of objects). You can choose this format if you are working on row-based labeling.

**This format is supported for data import and export** in row labeling projects.

**Example**

```javascript
[
  {
    "Book Title": "Sherlock Holmes: A Study in Scarlet",
    "Author": "Sir Arthur Conan Doyle",
    "Genre": "Fiction"
  },
  {
    "Book Title": "To Kill a Mockingbird",
    "Author": "Harper Lee",
    "Genre": "Fiction"
  },
  {
    "Book Title": "Alan Turing: The Enigma",
    "Author": "Andrew Hodges",
    "Genre": "Non fiction"
  },
  {
    "Book Title": "Humble Pie",
    "Author": "Gordon Ramsay",
    "Genre": "Non fiction"
  },
  {
    "Book Title": "The Little Prince",
    "Author": "Antoine de Saint-Exupéry",
    "Genre": "Fiction"
  }
]
```

## JSONL (JSON Lines)

{% hint style="info" %}
Per version 6.43.0, Datasaur now supports JSONL natively 🎉
{% endhint %}

JSONL (JSON Lines) - <https://jsonlines.org/> - is a text file format suitable for storing data that can be processed one record at a time. Datasaur supports a subset of valid JSONL files, namely:

1. the file must end in the `.jsonl` extension
2. each record in a file must be in the same structure / format. If the first record / line is an array, all the following lines must also be an array. If the first record is an object, all the following lines must also be a JSON object.

**This format is supported for both data import and export,** in row labeling projects.

Here are some sample JSONL structure that Datasaur supports:

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-819b3d5500100c3ee477f187619204aa7a850605%2Fjson-lines-as-csv-replacement.jsonl?alt=media&token=2de45854-2127-41f7-b469-edc76e42daf3>" %}

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-326f370c41b90ecc3aaad2412006c8a4fb752826%2Fjson-lines-with-object.jsonl?alt=media&token=e4a4595c-3457-44ea-b3b1-3db295acb991>" %}

For JSONL with objects, you can have nested values, for example:

```json
{ "detail" : { "id": 1, "name": "product-name" }, "key": "value", "another-key": "another-value" }
```

Datasaur will render all values after stringify-ing them.

Note that Datasaur relies on the first record / line to check the header length. Any items not in the first line will not be parsed.

Here is an example of how it may affect your workflow:

Let’s take the sample data above, and alter it a bit such that if someone has not `completed` a session, there is no completed data stored

```javascript
{"name": "Gilbert", "session": 2013, "score": 24}
{"name": "Alexa", "session": 2014, "score": 29, "completed": true}
{"name": "May", "session": "2012B", "score": 14}
{"name": "Deloise", "session": "2012A", "score": 19, "completed": true}
```

This file will be parsed just fine, but you will be missing the `completed` column - because there is no `completed` key in the first line.

As such, we highly recommend making your data consistent between each line, to ensure the best compatibility with our parser.

## TSV\_NON\_IOB

TSV\_NON\_IOB is a derivative of the TSV format that represents data that does not follow the IOB format - for example, `B-GEO` is just `GEO`. If your project is token-based (with or without arrows), you can choose this format for export.

**This format is supported for data export** in span based labeling projects.

A TSV\_NON\_IOB file contains the following data structure (*this explanation is based on our example below*):

1. **#FORMAT:** the file header.
2. **#Text:** the sentence representation.
3. **1-1:** the sentence-token.
   1. The first `1` indicates the sentence number.
   2. The second `1` indicates the token number.
4. **0-3:** the character index.
5. **TITLE\[1]:** the label applied.
   1. `[1]` indicates uniquely identify annotation across lines.
6. **Column 5:** indicates layer 2.
7. **author\[2-1]:** the label on the arrow.
   1. `2` indicates the arrow’s token origin.
   2. `1` indicates the arrow’s token destination.
8. **Column 7:** indicates layer 4.
9. **Column 8:** indicates layer 5.

*Note: column 5, 7, and 8 will be filled if you label the token in the mentioned layers.*

{% hint style="info" %}
We built this format to be compatible with [WebAnno](https://webanno.github.io/webanno/releases/3.4.5/docs/user-guide.html#sect_webannotsv)
{% endhint %}

**Example (token-based)**

![](https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-775eb1b22d6e8246f4d78f76d750a916426eef0c%2Flabels.jpg?alt=media)

```
#FORMAT=Datasaur TSV 3

#Text=The Little Prince is a novella by French aristocrat , writer , and aviator Antoine de Saint - Exupéry .
1-1    0-3    The    TITLE[1]    _    _    _    _
1-2    4-10    Little    TITLE[1]    _    _    _    _
1-3    11-17    Prince    TITLE[1]    _    _    _    _
1-4    18-20    is    _    _    _    _    _
1-5    21-22    a    _    _    _    _    _
1-6    23-30    novella    _    _    _    _    _
1-7    31-33    by    _    _    _    _    _
1-8    34-40    French    _    _    _    _    _
1-9    41-51    aristocrat    _    _    _    _    _
1-10    52-53    ,    _    _    _    _    _
1-11    54-60    writer    _    _    _    _    _
1-12    61-62    ,    _    _    _    _    _
1-13    63-66    and    _    _    _    _    _
1-14    67-74    aviator    _    _    _    _    _
1-15    75-82    Antoine    PER[2]    _    _    _    _
1-16    83-85    de    PER[2]    _    _    _    _
1-17    86-91    Saint    PER[2]    _    _    _    _
1-18    92-93    -    PER[2]    _    _    _    _
1-19    94-101    Exupéry    PER[2]    _    _    _    _
1-20    102-103    .    _    _    _    _    _
```

**Example (token-based with arrows)**

![](https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-886788b869842f9b5f5cac944aad9ba5a4af1bc0%2Flabels%20\(2\).jpg?alt=media)

```
#FORMAT=Datasaur TSV 3

#Text=The Little Prince is a novella by French aristocrat , writer , and aviator Antoine de Saint - Exupéry .
1-1    0-3    The    TITLE[1]    _    *    author[2_1]    _
1-2    4-10    Little    TITLE[1]    _    _    _    _
1-3    11-17    Prince    TITLE[1]    _    _    _    _
1-4    18-20    is    _    _    _    _    _
1-5    21-22    a    _    _    _    _    _
1-6    23-30    novella    _    _    _    _    _
1-7    31-33    by    _    _    _    _    _
1-8    34-40    French    _    _    _    _    _
1-9    41-51    aristocrat    _    _    _    _    _
1-10    52-53    ,    _    _    _    _    _
1-11    54-60    writer    _    _    _    _    _
1-12    61-62    ,    _    _    _    _    _
1-13    63-66    and    _    _    _    _    _
1-14    67-74    aviator    _    _    _    _    _
1-15    75-82    Antoine    PER[2]    _    _    _    _
1-16    83-85    de    PER[2]    _    _    _    _
1-17    86-91    Saint    PER[2]    _    _    _    _
1-18    92-93    -    PER[2]    _    _    _    _
1-19    94-101    Exupéry    PER[2]    _    _    _    _
1-20    102-103    .    _    _    _    _    _
```

## CoNLL-U

Universal Dependencies use a revised version of [the CoNLL-X format](http://anthology.aclweb.org/W/W06/W06-2920.pdf) called CoNLL-U. Sentences consist of one or more word lines, and word lines contain the following fields:

1. **sent\_id**: Sentence id.
2. **text**: Sentence.
3. **ID:** Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
4. **FORM:** Word form or punctuation symbol.
5. **LEMMA**: Lemma or stem of word form.
6. **UPOS:** [Universal part-of-speech tag](https://universaldependencies.org/u/pos/index.html).
7. **XPOS:** Language-specific part-of-speech tag; underscore if not available.
8. **FEATS:** List of morphological features from the [universal feature inventory](https://universaldependencies.org/u/feat/index.html) or from a defined [language-specific extension](https://universaldependencies.org/ext-feat-index.html); underscore if not available.
9. **HEAD:** Head of the current word, which is either a value of ID or zero (0).
10. **DEPREL:** [Universal dependency relation](https://universaldependencies.org/u/dep/index.html) to the HEAD ([root](https://universaldependencies.org/u/dep/root.html) iff HEAD = 0) or a defined language-specific subtype of one.
11. **DEPS:** Enhanced dependency graph in the form of a list of head-deprel pairs.
12. **MISC:** Any other annotation.

**This format is supported for data import** in span based labeling projects.

**Example**

```
# sent_id = 1
# text = Sherlock Holmes become widely popular in 1891 .
1	Sherlock	_	_	NNP	_	3	nsubj	_	_
2	Holmes	_	_	NNP	_	_	_	_	_
3	become	_	_	VBD	_	0	root	_	_
4	widely	_	_	RB	_	5	advmod	_	_
5	popular	_	_	JJ	_	3	xcomp	_	_
6	in	_	_	IN	_	_	_	_	_
7	1891	_	_	CD	_	_	_	_	_
8	.	_	_	.	_	_	_	_	_
```

## CoNLL\_2003

[CoNLL\_2003](https://www.aclweb.org/anthology/W03-0419.pdf) is usually used for POS tagging and named entity recognition labeling. All data files contain one word per line with empty lines representing sentence boundaries. At the end of each line there is a tag which states whether the current word is inside a named entity or not. The tag also encodes the type of named entity. Each line contains four fields:

1. The word
2. Part of-speech tag
3. Chunk tag
4. Named entity tag

**This format is supported for data import** **and export,** when project settings allow arrows and multiple layers.

**Note: Importing or exporting files with conll\_2003 format can be done if you checked the following**[ **task settings**](https://datasaurai.gitbook.io/datasaur/nlp-projects/nlp-task-types/span-based)**.**

* **Tokens and token spans should have at most one label**.
* **Allow arrows to be drawn between labels**. Checking this setting will activate layer feature.

You could do POS tagging on Layer 0 and NER tagging on Layer 1. If you export the file with conll\_2003, the result will be as shown as sample file below.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-b035542d6710521d5ffd1fe3d2d05a49dfdabcd4%2FDatasaur%20sample%20-%20CoNLL%202003.conll?alt=media&token=2791275f-0ee6-423e-85ae-9e9bc759377b>" %}

## YOLO (TXT)

[YOLO](https://pjreddie.com/darknet/yolo/) (You Only Look Once) is a popular object detection algorithm known for its speed and accuracy. For that reason, it is often used in real-time object detection in videos and images.

A YOLO file is a text-based format used for storing annotations and labels for object detection tasks. Each line in a YOLO file represents one annotated object/label in an image. One label in a YOLO file is represented with the following format.

```
classId x_center, y_center, width, height
```

1. **Class ID:** An integer representing the object’s label class. The ID starts from `0`. Each Class ID corresponds to a label class’s 0-based index/order in the label set.
2. **Bounding Box:** Four floating-point numbers representing the coordinates of the bounding box in the image. The four numbers are the following.

   1. <mark style="color:red;">`x_center`</mark>: the <mark style="color:red;">`x`</mark> (horizontal) coordinate of the bounding box’s center point.
   2. <mark style="color:red;">`y_center`</mark>: the <mark style="color:red;">`y`</mark> (vertical) coordinate of the bounding box’s center point.
   3. <mark style="color:red;">`width`</mark>: the width of the bounding box.
   4. <mark style="color:red;">`height`</mark>: the height of the bounding box.

   The coordinates are normalized values relative to the image’s width and height.

   The “0, 0” point is the top-left of the image, while the “1, 1” point is the bottom-right of the image.

**This format is supported for data import** **and export** in Bounding Box projects (single image only).

**Example**

```
0 0.80 0.20 0.1 0.2
1 0.60 0.50 0.3 0.5
2 0.25 0.35 0.2 0.4
3 0.45 0.70 0.25 0.35
```

**Limitations for Export**

A YOLO file can only represent labels in one image. Due to that nature, Datasaur has limitations when importing and exporting labels from a multi-page file (e.g. PDF, TIFF) to YOLO.

* If you wish to create a pre-labeled Bounding Box Labeling project with a multi-page file + a YOLO file, the pre-labeled labels will only be applied to the first page.
* If you wish to export a Bounding Box Labeling project with multi-page files to YOLO format, only labels from the first page will be exported.

## LabelMe (XML)

[LabelMe](https://github.com/CSAILVision/LabelMeAnnotationTool) is a open-sourced format used for annotating images with labels for object detection and segmentation tasks. Each annotation file contains metadata about the image, a list of labeled objects, and their corresponding shapes and properties.

Objects in LabelMe are represented with polygonal shapes, which are defined by a series of vertices. This format can be used as an annotation file for bounding box labeling.

**This format is supported for data import** **and export** in Bounding Box projects.

A LabelMe file contains the following data structures.

1. <mark style="color:red;">**`filename`**</mark>: The name of the image file being annotated.
2. <mark style="color:red;">**`folder`**</mark>: The directory or folder containing the image.
3. <mark style="color:red;">**`source`**</mark>: Information about the image and annotations origin.
4. <mark style="color:red;">**`imagesize`**</mark>: Information about the image and annotations origin.
5. <mark style="color:red;">**`object`**</mark>: Array of the annotated objects within the image

As an annotation file, here are the fields used at Datasaur.:

1. <mark style="color:red;">**`object`**</mark>
   1. <mark style="color:red;">**`name`**</mark>: The label or class name of the object.
   2. <mark style="color:red;">**`deleted`**</mark>: Indicates if the object is deleted (<mark style="color:red;">**`0`**</mark> for no, <mark style="color:red;">**`1`**</mark> for yes).
   3. <mark style="color:red;">**`verified`**</mark>: Indicates if the object's annotation has been verified (<mark style="color:red;">**`0`**</mark> for no, <mark style="color:red;">**`1`**</mark> for yes).
   4. <mark style="color:red;">**`occluded`**</mark>: Describes whether the object is occluded (blocked) by another object (<mark style="color:red;">**`yes`**</mark> or <mark style="color:red;">**`no`**</mark>).
   5. <mark style="color:red;">**`date`**</mark>: The date the annotation was made (if provided).
   6. <mark style="color:red;">**`id`**</mark>: A unique identifier for the object within this image.
   7. <mark style="color:red;">**`polygon`**</mark>: Represents the points making up the bounding box surrounding the annotated object.
      1. <mark style="color:red;">**`pt`**</mark>: A list of points making up the polygon.
         1. <mark style="color:red;">**`x`**</mark>: The x-coordinate of the point in pixels.
         2. <mark style="color:red;">**`y`**</mark>: The y-coordinate of the point in pixels.
      2. <mark style="color:red;">**`username`**</mark>: The annotator's username (if provided).
   8. <mark style="color:red;">**`attributes`**</mark>: A string containing additional attributes for the object

**Example**

```xml
<annotation>
  <filename>IDcard_specimen.jpg</filename>
  <folder></folder>
  <source>
    <sourceImage></sourceImage>
    <sourceAnnotation>Datasaur</sourceAnnotation>
  </source>
  <imagesize>
    <nrows>904</nrows>
    <ncols>548</ncols>
  </imagesize>
  <object>
    <name>document_number</name>
    <deleted>0</deleted>
    <verified>0</verified>
    <occluded>no</occluded>
    <date></date>
    <id>588337</id>
    <parts>
      <hasparts></hasparts>
      <ispartof></ispartof>
    </parts>
    <polygon>
      <pt>
        <x>643</x>
        <y>150</y>
      </pt>
      <pt>
        <x>788</x>
        <y>150</y>
      </pt>
      <pt>
        <x>788</x>
        <y>119</y>
      </pt>
      <pt>
        <x>643</x>
        <y>119</y>
      </pt>
      <username></username>
    </polygon>
    <attributes>text=SPECl2014, type=National ID</attributes>
  </object>
</annotation>
```

**Limitations for Import**

Due to the free-style nature of <mark style="color:red;">`attributes`</mark>, currently Datasaur does not support reimporting them in. Datasaur will only read attributes with the <mark style="color:red;">`text`</mark> key, and set the value as the label’s caption.

## Pascal VOC (XML)

[Pascal VOC](http://host.robots.ox.ac.uk/pascal/VOC/) is a widely used format for annotating images with labels for object detection tasks. Each annotation file contains metadata about the image, a list of labeled objects, and their corresponding bounding boxes.

Objects in Pascal VOC are represented with bounding boxes defined by the coordinates of their corners. This format can be used as an annotation file for bounding box labeling.

**This format is supported for data import** **and export** in Bounding Box projects.

A Pascal VOC file may contain the following data structures.

1. <mark style="color:red;">**`filename`**</mark>: The name of the image file being annotated.
2. <mark style="color:red;">**`folder`**</mark>: The directory or folder containing the image.
3. <mark style="color:red;">**`source`**</mark>: Information about the image and annotations origin.
4. <mark style="color:red;">**`size`**</mark>: Dimensions of the image.
   1. <mark style="color:red;">**`width`**</mark>: The width of the image in pixels.
   2. <mark style="color:red;">**`height`**</mark> : The width of the image in pixels.
   3. <mark style="color:red;">**`depth`**</mark> : The number of color channels in the image.
5. <mark style="color:red;">**`segmented`**</mark>: Indicates if the image has been segmented (<mark style="color:red;">**`0`**</mark> for no, <mark style="color:red;">**`1`**</mark> for yes).
6. <mark style="color:red;">**`object`**</mark>: Array of the annotated objects within the image.

As an annotation file, here are the fields used at Datasaur:

1. <mark style="color:red;">**`object`**</mark>
   1. <mark style="color:red;">**`name`**</mark>: The label or class name of the object.
   2. <mark style="color:red;">**`difficult`**</mark>: Indicates if the object is difficult to detect (<mark style="color:red;">**`0`**</mark> for no, <mark style="color:red;">**`1`**</mark> for yes).
   3. <mark style="color:red;">**`occluded`**</mark>: Indicates if the object is occluded (<mark style="color:red;">**`0`**</mark> for no, <mark style="color:red;">**`1`**</mark> for yes).
   4. <mark style="color:red;">**`truncated`**</mark>: Indicates if the object is truncated (<mark style="color:red;">**`0`**</mark> for no, <mark style="color:red;">**`1`**</mark> for yes).
   5. <mark style="color:red;">**`bndbox`**</mark>: The bounding box coordinates for the object.
      1. <mark style="color:red;">**`xmin`**</mark>: The x-coordinate of the top-left corner of the bounding box.
      2. <mark style="color:red;">**`ymin`**</mark>: The y-coordinate of the top-left corner of the bounding box.
      3. <mark style="color:red;">**`xmax`**</mark>: The x-coordinate of the bottom-right corner of the bounding box.
      4. <mark style="color:red;">**`ymax`**</mark>: The y-coordinate of the bottom-right corner of the bounding box.
   6. <mark style="color:red;">**`attributes`**</mark>: Additional attributes for the object (if any).
      1. <mark style="color:red;">**`attribute`**</mark>
         1. <mark style="color:red;">**`name`**</mark>: The name of the attribute.
         2. <mark style="color:red;">**`value`**</mark>: The value of the attribute.

**Example**

```xml
<annotation>
  <filename>IDcard_specimen.jpg</filename>
  <folder>Bounding box labeling</folder>
  <source>
    <database>Datasaur</database>
    <annotation>Datasaur</annotation>
    <image>IDcard_specimen.jpg</image>
  </source>
  <size>
    <width>904</width>
    <height>548</height>
    <depth></depth>
  </size>
  <segmented>0</segmented>
  <object>
    <name>document_number</name>
    <difficult>0</difficult>
    <occluded>0</occluded>
    <truncated>0</truncated>
    <bndbox>
      <xmin>643</xmin>
      <ymin>150</ymin>
      <xmax>788</xmax>
      <ymax>119</ymax>
    </bndbox>
    <attributes>
      <attribute>
        <name>text</name>
        <value>SPECl2014</value>
      </attribute>
      <attribute>
        <name>type</name>
        <value>National ID</value>
      </attribute>
    </attributes>
  </object>
</annotation>
```

**Limitations for export**

Datasaur currently does not handle exporting the image color to <mark style="color:red;">`depth`</mark>, and it will be exported as an empty tag.

## MD (Markdown)

Markdown is a lightweight markup language with plain-text-formatting syntax, created in 2004 by John Gruber with Aaron Swartz. Markdown is often used to format readme files, for writing messages in online discussion forums, and to create rich text using a plain text editor.

**This format is supported for data import** in document labeling projects.

**Example**

```
To all appearances The Little Prince is a children’s book. But ever
since its original publication in French in 1943, Antoine de
Saint-Exupéry’s story has enchanted audiences of all ages. The book’s
beloved hero is a small, blond-haired boy from asteroid B-612, which he
leaves to journey across the galaxy. Along his way, he visits a number
of planets each populated by a sole person with an absurd profession
(the little prince ultimately learns that there is no other kind). When
he lands on Earth, in the middle of the desert, he is met by a
mysterious snake. “Where are all the people?” the little prince asks.
“I’m beginning to feel lonely in this desert.” “You can feel lonely
among people, too,” replies the snake.

However elusive the story’s meaning, few have matched the universality
of its appeal. In April 2017, The Little Prince became the most
translated book in the world, excluding religious texts (which enjoyed
significant head starts). It now exists in 300 languages, a sum that
doesn’t even include the range of translations within languages. In
Korean, there are said to be about 50 different versions. Until
recently, English could claim only a meagre six. Now, Michael Morpurgo,
master storyteller and untested translator, has delivered a definitive
seventh.
```

## SVG

Scalable Vector Graphics (SVG) is an Extensible Markup Language (XML)-based vector image format for two-dimensional graphics with support for interactivity and animation. SVG images and their behaviors are defined in XML text files. This means that they can be searched, indexed, scripted, and compressed. As XML files, SVG images can be created and edited with any text editor, as well as with drawing software.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-8bdeaf1670b92e47ad5a12197f28ee2514f45d53%2Falanturing%20(1).svg?alt=media>" %}
SVG Sample
{% endfile %}

## BMP

A bitmap is a type of memory organization or image file format used to store digital images. The term bitmap comes from the computer programming terminology, meaning just a map of bits, a spatially mapped array of bits. Now, along with pixmap, it commonly refers to the similar concept of a spatially mapped array of pixels. Raster images in general may be referred to as bitmaps or pixmaps, whether synthetic or photographic, in files or memory.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-3645248c28550cde3bda2a42a6591d706310a9bc%2Falanturing%20(1).bmp?alt=media>" %}
BMP Sample
{% endfile %}

## TIFF and TIF

Tagged Image File Format, abbreviated TIFF or TIF, is a computer file format for storing raster graphics images, popular among graphic artists, the publishing industry, and photographers. TIFF is widely supported by scanning, faxing, word processing, optical character recognition, image manipulation, desktop publishing, and page-layout applications

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-304baf75bdff8d4864b22cc419974c4a13439c0b%2Falanturing%20(1).tiff?alt=media>" %}
TIFF Sample
{% endfile %}

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-3110b16ed4e0bb77de7431054ef1525bdad74a2b%2Falanturing%20(1)%20(1).tif?alt=media>" %}
TIF Sample
{% endfile %}

## WEBP

WebP is an image format employing both lossy and lossless compression. It is currently developed by Google, based on technology acquired with the purchase of On2 Technologies. As a derivative of the VP8 video format, it is a sister project to the WebM multimedia container format. WebP-related software is released under a BSD license.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-260cbb22929c2e94c17babf2624b3bb166d23783%2Falanturing%20(1).webp?alt=media>" %}
WEBP Sample
{% endfile %}

## JPEG and JPG

JPEG is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable trade off between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-fa053642ab40492973a2ffd68326e2376dce8b6f%2FDatasaur%20sample%20-%20JPEG.jpeg?alt=media>" %}
JPEG Sample
{% endfile %}

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-c67410ae4336b11f7b31af30f535274f108b4be7%2FDatasaur%20sample%20-%20JPG.jpeg?alt=media>" %}
JPG Sample
{% endfile %}

## PNG

Portable Network Graphics is a raster-graphics file format that supports lossless data compression. PNG was developed as an improved, non-patented replacement for Graphics Interchange Format (GIF). PNG supports palette-based images (with palettes of 24-bit RGB or 32-bit RGBA colors), grayscale images (with or without alpha channel for transparency), and full-color non-palette-based RGB or RGBA images. The PNG working group designed the format for transferring images on the Internet, not for professional-quality print graphics; therefore non-RGB color spaces such as CMYK are not supported

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-182a826cb62daa9526783f3f9132d83a7e38cd0a%2FDatasaur%20sample%20-%20PNG.png?alt=media>" %}
PNG Sample
{% endfile %}

## GIF

Graphics Interchange Format (GIF) is a bitmap image format. The format supports up to 8 bits per pixel for each image, allowing a single image to reference its own palette of up to 256 different colors chosen from the 24-bit RGB color space. It also supports animations and allows a separate palette of up to 256 colors for each frame. These palette limitations make GIF less suitable for reproducing color photographs and other images with color gradients, but well-suited for simpler images such as graphics or logos with solid areas of color. Unlike video, the GIF file format does not support audio.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-e80a7a05003d6cb19fde0c811f9deb23df02cb84%2FDatasaur%20sample%20-%20GIF.gif?alt=media>" %}
GIF Sample
{% endfile %}

## PDF

The Portable Document Format (PDF) is a file format developed by Adobe in the 1990s to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images and other information needed to display it. PDF was standardized as ISO 32000 in 2008, and no longer requires any royalties for its implementation.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-dcb5f02e111e15c6880bcdee6bc02f4a9fb2992d%2Fcovid-indonesiaflyer%20(1).pdf?alt=media>" %}
PDF Sample
{% endfile %}

## PPTX/PPT

PPTX is a zipped, XML-based file format that is part of the Microsoft Office Open XML (also known as OOXML or OpenXML) specification, introduced as part of Microsoft Office 2007 and later. PPTX is the default presentation file format for new PowerPoint presentations. Support for loading and saving PPT files is built into PPTX.

**This format is supported for data import** in document labeling projects.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-3c5e327b5fc2a8f633b3e6c6302b6bf613ddfc87%2FDatasaur%20sample%20-%20PPTX.pptx?alt=media&token=7fd71bde-36e2-4c34-b221-e7d57fd28c58>" %}
PPTX Sample
{% endfile %}

## DOCX/DOC

DOCX is part of Microsoft Office Open XML specification (also known as OOXML or OpenXML) and was introduced with Office 2007. DOCX is a zipped, XML-based file format. Microsoft Word 2007 and later uses DOCX as the default file format when creating a new document. Support for loading and saving legacy DOC files is also included.

**This format is supported for data import** in document labeling projects.

## URL/URLS/URI

URL is a file format that contains a list of urls. On the other hand, a URI is a standardized format used to identify and locate resources on the internet. This format is used to create a document labeling project with the URL Viewer, so you can label web pages through Datasaur.

**This format is supported for data import** in document labeling projects.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-aa1c8fdecca1ddbf5f763077e2a59207d0689ab2%2Fsample%20(1).urls?alt=media>" %}
URL Sample
{% endfile %}

## HTML

HTML is a markup language used to create web pages and other types of online content. It provides a standardized way of defining the structure and appearance of web pages, including text, images, and multimedia elements like audio and video.

HTML files are commonly used to create and publish websites, as well as to share content across the internet. To ensure your multimedia elements are rendered properly in Datasaur, please make sure they use full URLs instead of relative paths. For example, `img` tags should have its `src` property like this:

```html
<img src="https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png" />
```

**This format is supported for data import** in document labeling projects.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-5595ce8fd4d40ea45d2a127dafb8a1f7e7800cfc%2FSample%20HTML%20(1).zip?alt=media>" %}
HTML Sample
{% endfile %}

## MP4

MP4 is a video file format that is widely used for streaming and sharing videos online. It provides high-quality video and audio compression, making it a popular choice for digital content creators and viewers alike.

**This format is supported for data import** in audio labeling projects.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-b4b92e1562d08a514db8b2900fbd4ca0afeb54b1%2FSample%20MP4%20(1).mp4?alt=media>" %}
MP4 Sample
{% endfile %}

## M4A

M4A, short for MPEG-4 Audio, is a file format used to store audio data. It is a part of the MPEG-4 container format, which can hold various types of media like audio, video, and text.

**This format is supported for data import** in audio labeling projects.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-3b9fb86c75253dfb8ac361fe2c8a96d8c0fe8185%2FDatasaur%20sample%20-%20M4A.m4a?alt=media&token=e40fd72b-6dd5-48c0-abb8-f1ad34b7670d>" %}
M4A Sample
{% endfile %}

## MP3

MP3 is a digital audio file format commonly used for storing music and other audio recordings. It is a compressed format that allows for high-quality sound while minimizing the file size.

**This format is supported for data import** in audio labeling projects.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-665c116fcacadaf762808889e7601ad9e41d6c5e%2FSample%20MP3%20(1).mp3?alt=media>" %}
MP3 Sample
{% endfile %}

## FLAC

Flac is a lossless audio file format that preserves the original quality of the recording. It is often used by audiophiles and music producers who require the highest level of audio fidelity.

**This format is supported for data import** in audio labeling projects.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-858eeda943a9afefa2b9e426ea4a1a0c4ffa3efe%2FSample%20FLAC%20(1).flac?alt=media>" %}
FLAC Sample
{% endfile %}

## AAC

AAC is a file format for storing music or other sounds. It stands for Advanced Audio Coding or Advanced Audio Codec. It is one of the standard formats that comes from the MPEG organization, the same people who invented MP3.

**This format is supported for data import** in audio labeling projects.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-3678124018a8e03545fa78a1cc974f4de2ddb17a%2FDatasaur%20sample%20-%20AAC.aac?alt=media&token=1212a0d8-e543-4836-adc4-3f7d6269c4bc>" %}
AAC Sample
{% endfile %}

## WAV

WAV is a high-quality audio file format that is often used for storing uncompressed audio recordings. It is a popular format in professional audio production and is known for its high level of accuracy and fidelity.

**This format is supported for data import** in audio labeling projects.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-3822b14b7dd7bac1d89ec09cbea6731bbb2cd655%2FSample%20WAV%20(1).wav?alt=media>" %}
WAV Sample
{% endfile %}

## SRT

SRT is a subtitle file format used for adding subtitles to video content. It contains the text of the subtitles along with timing information to synchronize them with the video.

**This format is supported for data import** in audio labeling projects.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-b1694f0731befeb3efce54cf940bffbf5f2b142f%2FSample%20SRT%20(1).srt?alt=media>" %}
SRT Sample
{% endfile %}

## VTT

VTT is a subtitle file format that is commonly used for adding captions and subtitles to video content. It is a newer format than SRT and supports more advanced features such as text styling and positioning.

**This format is supported for data import** in audio labeling projects.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-5551623a8edfaff36e73d703a302e1833156ad98%2FSample%20VTT%20(1).vtt?alt=media>" %}
VTT Sample
{% endfile %}

## LayoutLM

LayoutLM (Layout Language Model) is a transformer-based model from Microsoft that is designed to process and understand documents by combining text and layout information.

A LayoutLM file is a `.tsv` file where each row represents a labeled element within a document. Each row includes the following fields:

1. `text`: the content of the label (caption).
2. `xmin`: x coordinate of top left position of the bounding box.
3. `ymin`: y coordinate of top left position of the bounding box.
4. `xmax`: x coordinate of bottom right position of the bounding box.
5. `ymax`: y coordinate of bottom right position of the bounding box.
6. `width`: the width of the bounding box.
7. `height`: the height of the bounding box.
8. `label`: the label class name.
9. `page_index`: index of the page if the document contains multiple pages.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-4c19394a17429cc5306d69b93ede7ebf20a050a3%2FDatasaur%20Sample%20-%20LayoutLM.tsv?alt=media&token=794f173e-3f37-4341-9c14-f9a6674be556>" %}
LayoutLM Example
{% endfile %}

#### **Limitations for Export**

In LayoutLM format, labels are expected to contain only single-word entries. If a label has a multi-word or multi-line caption, it will be disregarded by default. However, there is an exception for labels with the same number of words as associated shapes (eg. merged labels): such labels are processed as multiple, distinct labels, each containing a single word and its corresponding shape.
