Project Templates

Exploring Datasaur's pre-built project templates

Just like Microsoft Word or PowerPoint have templates, Datasaur has project templates that allow you to quickly get started with pre-built settings. Let's explore each one in turn.

Named Entity Recognition

Named Entity Recognition (NER) is also referred to as Named Entity Extraction. It describes the process of identifying and extracting specific entities in a text. These entities will be classified into various predefined categories that can represent real-world objects, such as places, organizations, names, locations, etc.

Named entities are not always single tokens. In the example above, The Strand Magazine is an entity that has two tokens. Multi-token labeling is common in NER labeling.

Part of Speech

Part of Speech (POS) tagging is the process of labeling each word in a text with its part of speech based on the context of the sentence. Once we define the role of each word, it will be useful for training the algorithm to understand the structure and meaning of a sentence.

You are welcome to define your own parts of speech for labeling.

Note: one standard industry practice for English POS is to follow the parts of speech as defined by the Penn Treebank Project.

Coreference

Coreference resolution is the task of identifying all expressions that refer to the same entity in a text. This kind of task can be beneficial for many applications, including information extraction, text summarization, question answering, and machine translation.

Coreference resolution usually includes nouns, noun phrases, proper nouns, and pronouns. We can see that his is a pronoun and refers to Sherlock Holmes, which is a noun phrase. Coreference resolution helps eliminate ambiguity in deciphering a document. It often requires labeling phrases first, then drawing arrows from one to another.

Dependency

Dependency parsing is the task of labeling relations between words. These relations consist of a head and a dependent. Please consider the example below. Sherlock is the subject of the verb became.

Note: for Dependency label sets, common industry best practices include Universal Dependencies or the Stanford typed dependencies Manual.

3KB

Datasaur sample - DEP.conllu

Document Labeling

Document labeling is the task of classifying and categorizing data. This type of labeling is different from the types discussed above, because the labeler is answering questions about the text, rather than labeling spans of tokens within the text. It can be beneficial for projects such as sentiment analysis or applying metadata to a document.

In Datasaur, document labeling can be done on a per-row basis or on a per-document basis.

You can also label images, .pdf, and even .gif in document labeling.

Note: when create projects via the DOC project template, the following settings will apply as a default.

Any uploaded questions will be set as required.
Answer sets can be edited. If the labeler types something in the text box that does not match an existing label, she can click "Add <your answer> as a new answer".

Optical Character Recognition (OCR)

Optical Character Recognition (OCR) is the task of translating text inside images or scanned documents into machine-readable text data. Some common use applications of OCR include invoices, receipts, or legal documents. Watch this video on Youtube for how instructions on how to create an OCR project. In the video above, the user creates an OCR project by uploading the original document with the corresponding transcription via a txt file. However, your workspace can integrate with an OCR technology. Datasaur is agnostic toward which OCR provider you believe is best for your documents. Please let us know which OCR provider you have in mind by contacting us at [email protected]. By integrating an OCR provider, the technology will create the transcriptions within Project Creation process (specifically in step 2).

Note: when uploading pairs of OCR documents, please make sure your image files and their corresponding transcription have the same file name. For example, SEC.pdf and SEC.txt.

Create your own template!

After successfully sign up, every new user will see Datasaur built-in project templates on their workspace. Now, they are allowed to create their own template!

The first step is creating a project with all setting has been set. Then, click triple-dots on the project and choose Save as template. You can rename the template as you want, and even upload an avatar.

If you are in a team workspace and need a custom script for project creation, this will surely save your time!

Last updated 9 months ago