# Importable Format

`Importable` format is a JSON format which is used to import data to Datasaur project.

A `Importable` JSON format may contain the following data structures:

1. **type:** the `Importable` type (which identified with value "BOUNDING\_BOX")
2. **cells:** an array containing the intersection of a row and a column
   1. **content:** a sentence in the cell
   2. **index:** the column index for the cell
   3. **line:** the row index for the cell
   4. **metadata:** additional information for a cell
   5. **tokens:** array of strings to define custom tokenization
3. **labelSets:** the label set which is used by the project.
4. **labels:** an array of labels
   1. **Span label type**
      1. **id**: A unique number
      2. **type**: Identified with value "SPAN"
      3. **startCellLine**, **startCellIndex**, **startTokenIndex, startCharIndex.** The starting position of the span label. Please refer to [this](/compatibility-and-updates/common-terminology.md#cell) for a thorough explanation of how cells, tokens, and characters are positioned.
      4. **endCellLine**, **endCellIndex**, **endTokenIndex**, **endCharIndex**. The ending position of the span label. Please refer to [this](/compatibility-and-updates/common-terminology.md#cell) for a thorough explanation of how cells, tokens, and characters are positioned.
      5. **layer**: The label set index to which this label belongs.
      6. **counter**: The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.
      7. **labelSetItemId:** The ID of this label within the label set.
   2. **Arrow label type**
      1. **id**: A unique number
      2. **type**: Identified with value "ARROW"
      3. **originId**: The id of the label from which this arrow starts.
      4. **destinationId**: The id of the label to which this arrow ends.
      5. **startCellLine**, **startCellIndex**, **startTokenIndex**, **startCharIndex**. This is the same as the origin label's.
      6. **endCellLine**, **endCellIndex**, **endTokenIndex**, **endCharIndex.** This is the same as the destination label's.
      7. **layer**: The label set index to which this label belongs.
      8. **counter**: The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.
      9. **labelSetItemId**: The ID of this label within the label set. Use an empty string if this arrow does not have a label.
   3. **Bounding box label type**
      1. **type:** identified with value "BOUNDING\_BOX"
      2. **startCellLine:** starting line's sentence position
      3. **startCellIndex:** starting line's column position
      4. **startTokenIndex:** starting token index position
      5. **startCharIndex:** starting character index position (relative to tokenIndex, starts from 0 again when tokenIndex is incremented)
      6. **endCellLine:** ending line sentence position
      7. **endCellIndex:** ending line column position
      8. **endTokenIndex:** ending token index position
      9. **endCharIndex:** ending character index position
      10. **layer:** the layer where the token is positioned
      11. **counter:** The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.
      12. **pageIndex:** index of the page if the document contain multiple pages
      13. **nodeCount:** total number of the bounding box points
      14. **x0:** x coordinate of top left position of the bounding box
      15. **y0:** y coordinate of top left position of the bounding box
      16. **x1:** x coordinate of top right position of the bounding box
      17. **y1:** y coordinate of top right position of the bounding box
      18. **x2:** x coordinate of bottom right position of the bounding box
      19. **y2:** y coordinate of bottom right position of the bounding box
      20. **x3:** x coordinate of bottom left position of the bounding box
      21. **y3:** y coordinate of bottom left position of the bounding box
5. **pages:** an array of page information
   1. **pageIndex:** index of the page if the document contain multiple pages
   2. **pageHeight:** original page height in pixel
   3. **pageWidth:** original page width in pixel

## Example (with span and arrow label type)

```json
{
  "cells": [
    {
      "content": "The quick brown fox jumps over the lazy dog",
      "tokens": ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"],
      "line": 0,
      "index": 0,
      "metadata": []
    }
  ],
  "labelSets": [
    {
      "name": "Subjects",
      "index": 0,
      "labelItems": [
        { "id": "FOX_ID", "labelName": "Fox" },
        { "id": "DOG_ID", "labelName": "Dog" }
      ]
    },
    {
      "name": "Verbs",
      "index": 1,
      "labelItems": [
        { "id": "JUMP_ID", "labelName": "Jump" }
      ]
    }
  ],
  "labels": [
    {
      "id": 1,
      "type": "SPAN",
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 1,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 3,
      "endCharIndex": 2,
      "layer": 0,
      "counter": 0,
      "labelSetItemId": "FOX_ID"
    },
    {
      "id": 2,
      "type": "SPAN",
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 6,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 8,
      "endCharIndex": 2,
      "layer": 0,
      "counter": 0,
      "labelSetItemId": "DOG_ID"
    },
    {
      "id": 3,
      "originId": 1,
      "destinationId": 2,
      "type": "ARROW",
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 1,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 8,
      "endCharIndex": 2,
      "layer": 1,
      "counter": 0,
      "labelSetItemId": "JUMP_ID"
    }
  ],
  "name": "Example"
}
```

The above JSON will produce the following output:

![](/files/UpSeym4PL7iuSgRC0XKL)

### Explanation

The above project contains 1 cell and 3 labels (2 span labels and 1 arrow label) from 2 different label sets.

#### Defining the Cells

Refer to this for a more thorough explanation

There is only one Cell in the project, and for that Cell we set:

* **content** as `The quick brown fox jumps over the lazy dog`.
* **tokens** as `["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]`, because we use a simple whitespace tokenization. You can define your own tokenization method here.
* **line** as 0, since the cell is on the first row.
* **index** as 0, since the cell is on the left most column.
* **metadata** as `[]`, since we don't display any metadata.

#### Defining the Label Sets

There will be 2 label sets named `Subjects` (indexed 0) and `Verbs` (indexed 1), please refer to [this](/compatibility-and-updates/common-terminology.md#label-set) for thorough explanation of Label Set Index. Label Items within each Label Set must have unique id, for example in `Subjects` Label Set we have `FOX_ID` and `DOG_ID`.

#### Defining the Labels

The first label (`Fox`):

* starts from the first character of the second token in the first cell, hence we set:
  * **startCellLine** as 0, since the first cell's Line is 0.
  * **startCellIndex** as 0, since the first cell's Index is 0.
  * **startTokenIndex** as 1, since the label starts from the second token.
  * **startCharIndex** as 0, since the label starts from the first character of the token.
* ends at the third character of the fourth token in the first cell, hence we set:
  * **endCellLine** as 0
  * **endCellIndex** as 0
  * **endTokenIndex** as 3
  * **endCharIndex** as 2
* comes from the first Label Set, hence we set:
  * **layer** as 0
  * **labelSetItemId** as `FOX_ID`

The second label (`Dog`) JSON follows the same idea of first label (`Fox`).

The third label (`Jump`):

* It originates from the first label (`Fox`) and ends at the second label (`Dog`), hence we set:
  * **originId** as 1, because the first label has id 1.
  * **destinationId** as 2, because the second label has id 2
* **startCellLine**, **startCellIndex**, **startTokenIndex**, **startCharIndex** are the same with the first label's (`Fox`).
* **endCellLine**, **endCellIndex**, **endTokenIndex**, **endCharIndex** are the same with the second label's (`Dog`).
* comes from the second Label Set, hence we set:
  * **layer** as 0
  * **labelSetItemId** as `JUMP_ID`

## Example (with bounding box label type)

```javascript
{
  "cells": [
    {
      "content": "SHIHLIN TAIWAN",
      "index": 0,
      "line": 0,
      "metadata": [],
      "tokens": [
        "SHIHLIN",
        "TAIWAN"
      ]
    },
    {
      "content": "STREET SNACKS",
      "index": 0,
      "line": 1,
      "metadata": [],
      "tokens": [
        "STREET",
        "SNACKS"
      ]
    }
  ],
  "labelSets": [],
  "labels": [
    {
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 0,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 0,
      "endCharIndex": 6,
      "layer": 0,
      "counter": 0,
      "pageIndex": 0,
      "type": "BOUNDING_BOX",
      "nodeCount": 4,
      "x0": 130,
      "y0": 154,
      "x1": 255,
      "y1": 154,
      "x2": 255,
      "y2": 186,
      "x3": 130,
      "y3": 186
    },
    {
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 1,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 1,
      "endCharIndex": 5,
      "layer": 0,
      "counter": 0,
      "pageIndex": 0,
      "type": "BOUNDING_BOX",
      "nodeCount": 4,
      "x0": 261,
      "y0": 154,
      "x1": 375,
      "y1": 154,
      "x2": 375,
      "y2": 186,
      "x3": 261,
      "y3": 186
    }
  ],
  "name": "receipt.jpg",
  "pages": [
    {
      "pageIndex": 0,
      "pageHeight": 619,
      "pageWidth": 551
    }
  ],
  "type": "BOUNDING_BOX"
}
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.datasaur.ai/api/custom-ocr/importable-format.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
