Supported Formats
Sources: Wikipedia and IANA
Last updated
Sources: Wikipedia and IANA
Last updated
This page details all the supported Datasaur formats, provides examples for each format and clarifies expected file structure where appropriate. Note: Through file transformers we can customize the output format.
TXT file is a simple file format that contains unformatted text and can be easily opened and edited using a basic text editor. It is commonly used for storing and exchanging data, code, and other textual information.
A TSV (tab-separated values) file is a simple text format for storing data in a tabular structure. A TSV file encodes a number of records that may contain multiple fields.
Each record is represented as a single line.
Each field
value is represented as text.
Fields in a record are separated from
one other by the tab character .
Note that because is a special character for this format, fields that contain tabs are not allowed in this encoding.
The header (first) line of this encoding contains the name of
each field, separated by tabs.
Example
IOB (inside, outside, beginning) is a common labeling format for labeling tokens in computational linguistics (ex: named-entity recognition). IOB is also a .tsv, but conforms to the following rules:
The B- prefix before a tag indicates that the tag is the beginning of a chunk.
The I- prefix before a tag indicates that the tag is inside a chunk.
The B- tag is used only when a tag is followed by a tag of the same type without O tokens between them.
The O tag indicates that a token does not belong to a chunk.
Example
A CSV (comma-separated values) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.
A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields.
Example
💡 As for now, a row-based project using CSV format does not support answers containing ;
. We treat it as multiple answers. For example, the answer is She brings some flowers: rose; sunflower; and daisy.
It will be interpreted as three answers that containShe brings some flowers: rose
,sunflower
, and daisy
.
XLS and XLSX is a well-known format for Microsoft Excel documents that was introduced by Microsoft XLS is an older format that was used in older versions of Excel, while XLSX is a newer format that is the default in more recent versions of Excel. Both formats allow users to input, organize, and analyze data in rows and columns. They also support features such as formulas, charts, and graphs. XLSX is a more efficient format that offers better data recovery and larger file size limits.
Example
JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value).
A JSON file may contain the following data structures:
An object is an unordered set of name/value pairs.
An object begins with {left brace and ends with }right brace. Each name is followed by :colon and the name/value pairs are separated by ,comma.
An array is an ordered collection of values.
An array begins with [left bracket and ends with ]right bracket. Values are separated by ,comma.
A value can be a string
in double quotes, or a number
, or true or false or null
, or an object
or an array
. These structures can be nested.
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. A character is represented as a single character string. A string is very much like a C or Java string.
A number is like a C or Java number, except that the octal and hexadecimal formats are not used.
Whitespace can be inserted between any pair of tokens. Excepting a few encoding details, that completely describes the language.
Datasaur Schema is a customized JSON format that is designed to fit all available project types in Datasaur app. This format can be used for mixed project type, e.g. Token + Document labeling. You will receive all label and answer combined in one exported file.
A Datasaur Schema contains the following data structures.
version: version number of Datasaur schema.
Rows Field
content: the text of the sentence.
tokens: the tokens form of the sentence.
metadata: contains additional information for a line. You can find the structure and configuration options for metadata here.
labelerInfo: the information about the labeler.
id: the unique identifier of a labeler (each labeler has a different id).
email: email that labeler used when signing in.
displayName: the display name of the email.
labelSets: contains all the label items that you used for the project.
index: the position of the label set in UI
labelItems: an array of labelItems for a label set
id: id of the labelSet
labelName: the displayed name of the label set item
parentId: id of the parent label set item
color: the color of the label set item
labels: an array of labels for the document. Labels consist of spanLabels, arrowLabels, boundingBoxLabels, timeLabels.
spanLabels are all labels that are applied directly to the token/sentence.
arrowLabels are all labels that are applied in the top of arrow.
boundingBoxLabels are all labels that are applied in the top of OCR documents.
timeLabels are all labels that are applied in the top of audio waveform.
Below are all attributes under labels.
id: identifier from the applied label.
labeledBy:
CONFLICT
: if it has not been resolved
REVIEWER
: if it has been resolved
AUTO
: if it has been resolved by meeting the consensus
LABELER
: if it comes from labeler
labeledByUserId: the user id of a reviewer
acceptedByUserId: the user id of a reviewer who accepts the label. It will be null if there's no user who accept it manually.
rejectedByUserId: the user id of a reviewer who rejects the label. It will be null if there's no user who rejects it manually.
status: label status**.** It can be REJECTED
, if it is rejected by Reviewer, and ACCEPTED
if it is accepted by Reviewer.
hashCode: Datasaur's code to represent label information.
For example, SPAN:gpe:0:0:0:4:0:0:0:4:3:0:undefined:undefined.
Below is the explanation: <type:label set item id:layer or label set index:start cell line:cell index:start token index: start char index: end cell line: end cell index: end token index: end char index: counter.>
textPosition: information about where are the exact location for the labelled text.
start: starting text position
row: number of line
column: number of column. For token based project, it always gives 0 value
tokenIndex: number of token, relative to row number
charIndex: character index position, relative to token
end: ending text position
row: number of line
column: number of column. For token based project, it always gives 0 value
tokenIndex: number of token, relative to row number
charIndex: character index position, relative to token
Arrow label type specific fields
originId: origin id of an arrow label
destinationId: origin id of an arrow label
Bounding Box label type specific fields
coordinates: consists of 4 points paired x and y value.
Timestamp label type specific fields
startTimeMillis: starting timestamp in millisecond.
endTimeMillis: ending timestamp in millisecond.
comments: contains all comment that you insert for the document.
id: the id of the comment
parentId: the id of the parent comment - this will be filed if the comment thread has replies.
hashCode: Datasaur's code to represent comment's information, including the value being commented.
message: the content of the comment
type: the type of comment, can be SPAN_LABEL
,SPAN_TEXT
, ARROW_LABEL
, and COMMENT
userId: the id of user who create the comment
createdAt: the time when the user create the comment
documentQuestions: contains the question set that is used for a document-based project.
id: the id of the question
name: default name given per question
description: question text from the question set.
type: type of the question. It can be in TEXT
, DROPDOWN
, HIERARCHICAL_DROPDOWN
, NESTED
, SLIDER
, DATE
, TIME
, CHECKBOX
, URL
.
displayed: state whether it’s shown or not. True if it’s shown in the extension.
parentId: the id of parent questions.
documentAnswerSet: contains the answer from the used question set.
The answers consists of paired documentQuestion ID and answers. Example: “1”: “Good”
. It shows 1
as id and Good
as the question answers,
rowQuestions: contains the question set that is used for a row-based project.
id: the id of the question
name: default name given per question
description: question text from the question set.
type: type of the question. It can be in TEXT
, DROPDOWN
, HIERARCHICAL_DROPDOWN
, NESTED
, SLIDER
, DATE
, TIME
, CHECKBOX
, URL
.
displayed: state whether it’s shown or not. True if it’s shown in the extension.
parentId: the id of parent questions.
rowAnswerSets: contains all answer from row-based project. It consists of row number, rowQuestions ID, and answers.
Example:
5
as the row number, 1
as parentId question from 2
and 3
. 2
and 3
are the question ID. A
and B
are the answers from 2
and 3
Examples
Conversational JSON is a format Datasaur created to support conversational or chat-like data. Each file should represent a conversation or chat log containing multiple message
s.
Each message
object must have the following properties:
content
: String content of the message. Required.
speaker
: Identifies who say / sent the message. Required.
color
: Optional. If provided, Datasaur will use the color to render the speaker’s avatar.
alignment
: Required. One of LEFT
or RIGHT
. Currently, there are no visual difference between them. In the future, we plan to use this information to place the message boxes accordingly
indent
: Required. Integer between 0-4. Currently, there are no visual difference between them. In the future, we plan to use this information to place messages in a thread-like view.
Example:
JSON Simplified is an export format for Span labeling project. This format contains both the text as well as the labeled spans, along with a character indexing. It’s suitable for simpler workflows where we expect each sentence to be contained and isolated from one another.
In the example below, here are the objects recognized at Datasaur.
text: the sentence.
entities: array of label applied
text: the token
type: the label applied.
start_idx: the character position in the labeled token.
The character position uses zero-based index.
end_idx: the last character position + 1 (because end_index does not include the last character).
The character position uses zero-based index.
JSON (Simplified) export format limitation:
Can't export arrow label
Can't export label multi sentence
Example
JSON Advanced is a proprietary Datasaur format designed in collaboration with our users to capture all possible data. This format is commonly used for partial token labeling projects. You can also use it when exporting token-based with arrow projects, such as coreference and dependency.
A JSON_ADVANCED file may contain the following data structures:
Sentences field
id: the sentence position.
content: the text of the sentence.
tokens: the tokens form of the sentence.
labels
l: the label applied.
layer: the layer position of the labels. This field is reserved for a project where a labeling of multiple tag set at once. For now you can disregard this field and this field is always set to 0.
id: the unique identifier of a label.
If the id has 9 segments, this indicates span label. For example, INNM0ViFwo8LluMTaTIK9:0:0:14:0:0:18:6:0
and here's the explanation <label set item id>:<layer>:<sidS>:<s>:<charS>:<sidE>:<e>:<charE>:<index>
.
If the id has 21 segments, this indicates arrow label. For example, tfc1FkbbEk9fOLx6haR1s:0:INNM0ViFwo8LluMTaTIK9:0:0:14:0:0:18:6:0:Oq_VuB0s_N7D8ZY0rgYsg:0:0:0:0:0:2:5:0:0
and here's the explanation <label set item id>:<arrow layer>:<….. origin id>:<….destination id>:<arrow index>.
hashCode: Datasaur's code to represent label information __
Span label. For example, SPAN:gpe:0:0:0:4:0:0:0:4:3:0:undefined:undefined.
Below is the explanation:
type:label set item id:layer or label set index:start cell line:cell index:start token index: start char index: end cell line: end cell index: end token index: end char index: counter.
Arrow label. For example, ARROW:dyC-o1HBnn49dcqDSphmJ:1:0:0:0:0:0:0:10:6:0:SPAN:geo:0:0:0:0:0:0:0:0:4:0:undefined:undefined:SPAN:geo:0:0:0:10:0:0:0:10:6:0:undefined:undefined
. Below is the explanation:
type:label set item id:layer or label set index:start cell line:cell index:start token index: start char index: end cell line: end cell index: end token index: end char index: counter:<span label: origin>:<span label: destination>.
documentId: the id of document.
sidS, sidE: the sentence starting and ending position of a label in 0-based index. In Datasaur, it is possible that a label spans across sentences.
s: the token starting position of a label in the starting sentence in 0-based index.
e: the token ending position of a label in the ending sentence in 0-based index.
charS: the character starting position of a label in the starting token in 0-based index.
charE: the character ending position of a label in the ending token in 0-based index.
metadata: additional information for a cell. You can find the structure and configuration options for metadata here.
labelerInfo: the information about the labeler.
id: the unique identifier of a labeler (each labeler has different id).
email: email that labeler used when signing in.
displayName: the display name of the email.
labelSets: contains all the label items that you used for the project.
index: the position of the label set in UI
labelItems: an array of labelItems for a label set
id: id of the labelSetItem
labelName: the displayed name of the label set item
parentId: id of the parent label set item
color: the color of the label set item
labels: an array of labels for the document
labelText: label content for row-based project. It will be null for other project beside the row-based project.
id: identifier from the applied label.
documentId : identifier for document where the label is applied.
startCellLine: starting line sentence position
startCellIndex: starting line column position
startTokenIndex: starting token index position
startCharIndex: starting character index position (relative to tokenIndex, start from 0 again when tokenIndex incremented)
endCellLine: ending line sentence position
endCellIndex: ending line column position
endTokenIndex: ending token index position
endCharIndex: ending character index position
layer: the layer where the token is positioned
counter: labels with the same name to be placed multiple times in the same position, start from 0
type: the type of labels -> SPAN, ARROW, BOUNDING_BOX
createdAt:
Labeler: the time labels applied
Reviewer: the time labels got accepted
updatedAt: last update timestamp on the label
Review related fields
acceptedByUserId: the user id of a reviewer who accepts the label. It will be null if there's no user who accept it manually.
rejectedByUserId: the user id of a reviewer who rejects the label. It will be null if there's no user who rejects it manually
labeledByUserId: the user id of a reviewer
labeledBy:
CONFLICT if it has not been resolved
REVIEWER if it has been resolved
AUTO if it has been resolved by meeting the consensus
Arrow label type specific fields
originId: origin id of an arrow label
originNumber: auto increment ID for origin
destinationId: origin id of an arrow label
destinationNumber: auto increment ID for destination
Bounding box label type specific fields
pageIndex: index of the page if the document contain multiple pages
nodeCount: total number of the bounding box points
x0: x coordinate of top left position of the bounding box
y0: y coordinate of top left position of the bounding box
x1: x coordinate of top right position of the bounding box
y1: y coordinate of top right position of the bounding box
x2: x coordinate of bottom right position of the bounding box
y2: y coordinate of bottom right position of the bounding box
x3: x coordinate of bottom left position of the bounding box
y3: y coordinate of bottom left position of the bounding box
pages: an array of page information for OCR project type
pageIndex: index of the page if the document contain multiple pages
pageHeight: original page height in pixel
pageWidth: original page width in pixel
comments
id: the id of the comment
parentId: the id of the parent comment - this will be filed if the comment thread has replies.
hashCode: Datasaur's code to represent comment's information, including the value being commented
message: the content of the comment
type: the type of comment, can be SPAN_LABEL
,SPAN_TEXT
, ARROW_LABEL
, and CELL_LABEL
userId: the id of user who create the comment
createdAt: the time when the user create the comment
Example (token-based with arrow)