Importable Format
Importable
format is a JSON format which is used to import data to Datasaur project.
A Importable
JSON format may contain the following data structures:
type: the
Importable
type (which identified with value "BOUNDING_BOX")cells: an array containing the intersection of a row and a column
content: a sentence in the cell
index: the column index for the cell
line: the row index for the cell
metadata: additional information for a cell
tokens: array of strings to define custom tokenization
labelSets: the label set which is used by the project.
labels: an array of labels
Span label type
id: A unique number
type: Identified with value "SPAN"
startCellLine, startCellIndex, startTokenIndex, startCharIndex. The starting position of the span label. Please refer to this for a thorough explanation of how cells, tokens, and characters are positioned.
endCellLine, endCellIndex, endTokenIndex, endCharIndex. The ending position of the span label. Please refer to this for a thorough explanation of how cells, tokens, and characters are positioned.
layer: The label set index to which this label belongs.
counter: The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.
labelSetItemId: The ID of this label within the label set.
Arrow label type
id: A unique number
type: Identified with value "ARROW"
originId: The id of the label from which this arrow starts.
destinationId: The id of the label to which this arrow ends.
startCellLine, startCellIndex, startTokenIndex, startCharIndex. This is the same as the origin label's.
endCellLine, endCellIndex, endTokenIndex, endCharIndex. This is the same as the destination label's.
layer: The label set index to which this label belongs.
counter: The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.
labelSetItemId: The ID of this label within the label set. Use an empty string if this arrow does not have a label.
Bounding box label type
type: identified with value "BOUNDING_BOX"
startCellLine: starting line sentence position
startCellIndex: starting line column position
startTokenIndex: starting token index position
startCharIndex: starting character index position (relative to tokenIndex, start from 0 again when tokenIndex incremented)
endCellLine: ending line sentence position
endCellIndex: ending line column position
endTokenIndex: ending token index position
endCharIndex: ending character index position
layer: the layer where the token is positioned
counter: The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.
pageIndex: index of the page if the document contain multiple pages
nodeCount: total number of the bounding box points
x0: x coordinate of top left position of the bounding box
y0: y coordinate of top left position of the bounding box
x1: x coordinate of top right position of the bounding box
y1: y coordinate of top right position of the bounding box
x2: x coordinate of bottom right position of the bounding box
y2: y coordinate of bottom right position of the bounding box
x3: x coordinate of bottom left position of the bounding box
y3: y coordinate of bottom left position of the bounding box
pages: an array of page information
pageIndex: index of the page if the document contain multiple pages
pageHeight: original page height in pixel
pageWidth: original page width in pixel
Example (with span and arrow label type)
The above JSON will produce the following output:
Explanation
The above project contains 1 cell and 3 labels (2 span labels and 1 arrow label) from 2 different label sets.
Defining the Cells
Refer to this for a more thorough explanati
There is only one Cell in the project, and for that Cell we set:
content as
The quick brown fox jumps over the lazy dog
.tokens as
["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
, because we use a simple whitespace tokenization. You can define your own tokenization method here.line as 0, since the cell is on the first row.
index as 0, since the cell is on the left most column.
metadata as
[]
, since we don't display any metadata.
Defining the Label Sets
There will be 2 label sets named Subjects
(indexed 0) and Verbs
(indexed 1), please refer to this for thorough explanation of Label Set Index. Label Items within each Label Set must have unique id, for example in Subjects
Label Set we have FOX_ID
and DOG_ID
.
Defining the Labels
The first label (Fox
):
starts from the first character of the second token in the first cell, hence we set:
startCellLine as 0, since the first cell's Line is 0.
startCellIndex as 0, since the first cell's Index is 0.
startTokenIndex as 1, since the label starts from the second token.
startCharIndex as 0, since the label starts from the first character of the token.
ends at the third character of the fourth token in the first cell, hence we set:
endCellLine as 0
endCellIndex as 0
endTokenIndex as 3
endCharIndex as 2
comes from the first Label Set, hence we set:
layer as 0
labelSetItemId as
FOX_ID
The second label (Dog
) JSON follows the same idea of first label (Fox
).
The third label (Jump
):
It originates from the first label (
Fox
) and ends at the second label (Dog
), hence we set:originId as 1, because the first label has id 1.
destinationId as 2, because the second label has id 2
startCellLine, startCellIndex, startTokenIndex, startCharIndex are the same with the first label's (
Fox
).endCellLine, endCellIndex, endTokenIndex, endCharIndex are the same with the second label's (
Dog
).comes from the second Label Set, hence we set:
layer as 0
labelSetItemId as
JUMP_ID
Example (with bounding box label type)
Last updated