Project Templates
Exploring Datasaur's pre-built project templates
Datasaur has project templates that help you get started quickly with preconfigured settings. Let's explore each one.
Named entity recognition
Named entity recognition (NER), also referred to as named entity extraction, is the process of identifying and extracting entities within text. These entities are classified into predefined categories representing real-world concepts, such as people, organizations, and locations.

Named entities are not always single tokens. In the example above, University of London is an entity that has three tokens. Multi-token entities are common in NER.
Part-of-speech
Part-of-speech (POS) tagging is the process of labeling each word in a text with its grammatical role based on the sentence context. Identifying the role of each word helps models understand sentence structure and meaning.
You can define custom parts of speech for labeling.

Note: A common industry standard for English POS tagging is based on the parts of speech defined by the Penn Treebank Project.
Coreference
Coreference resolution is the task of identifying all expressions in a text that refer to the same entity. It is widely used in applications such as information extraction, text summarization, question answering, and machine translation.

Coreference can include nouns, noun phrases, proper nouns, and pronouns. For example, his is a pronoun that refers to Sherlock, a noun phrase. Coreference resolution reduces ambiguity in a document by linking related expressions. It typically involves labeling relevant phrases first, then drawing arrows to connect them.
Dependency
Dependency parsing is the task of identifying and labeling grammatical relationships between words in a sentence. These relations consist of a head and a dependent. In the example below Sherlock is the subject of the verb became.

For Dependency labeling, common industry standards include Universal Dependencies and the Stanford typed dependencies Manual.
Document labeling
Document labeling is the task of classifying and categorizing data. Unlike span labeling, the labeler answers questions about the text instead of labeling specific spans within it. This approach is commonly used for tasks such as sentiment analysis or applying metadata to a document.
In Datasaur, document labeling can be done at the row level or the document level. The row-level approach uses the row labeling project type.

In document labeling, you can label images, .pdf files, and even .gif files.

When creating projects via the Document labeling project template, the following settings are applied by default.
Any uploaded questions are set as
required.Answer sets can be edited. If a labeler types an answer that doesn't match an existing label, she can click Add
<your answer>as a new answer.

Optical Character Recognition (OCR)
Optical Character Recognition (OCR) is the task of converting text in images or scanned documents into machine-readable text. Common OCR use cases include invoices, receipts, and legal documents. Watch this Youtube video for instructions on creating an OCR project.
In the video, the user creates an OCR project by uploading the original document with its corresponding transcription as a .txt file. Alternatively, your workspace can integrate with an OCR provider of your choice.
Datasaur works with any OCR provider you choose. To let us know which provider you want to use, contact us at [email protected]. Once integrated, the OCR technology will generate transcriptions automatically during project creation, specifically in step 2.

Note: when uploading OCR document pairs, ensure that each image file and its corresponding transcription have the same file name. For example, SEC.pdf and SEC.txt.
Create your own template
You can view the built-in project templates on the Projects page. You can also create your own template by following these steps:
Create a project with all the desired settings.
On the Projects page, click the triple-dot menu on the project and select Save as template.
Rename the template as desired and optionally upload an icon.

If you are in a team workspace and need a custom script for project creation, this will surely save your time!

Last updated