Span Based
Last updated
Last updated
Span-based
In span-based labeling, the labeling process can be done by labeling tokens or spans of tokens. Span-based labeling is well-suited for projects such as NER and POS. Here are the things that are important for you to know before labeling your project. (See this Youtube video for a visual guide on span-based labeling). To begin labeling, you can easily use your pointer to select a word/token in your dataset, you will then see a list of your labels. You can select a label using your pointer. However, Datasaur is designed for you to be a power-user. So we have more efficient methods for you to manually label. In the following documentation we will guide you on how to use hotkeys to apply labels and how to apply multiple labels to the same token span.
The label box will appear when you click on the tokens. You can click manually on the labels or use the corresponding keyboard shortcuts by typing 1
, 2
, 3
, or 4
.
Due to a limited number of numerals on the keyboard, keyboard shortcuts are only available for the first 9 labels. Do you have more than 9 labels? Read the next section!
If you have a long list of labels, you can search for specific labels in the label box by starting to type out parts of the label. In the following example, you could begin to type "date" then immediately select the label by using the corresponding hotkey: "1"
You can apply multiple labels to the same token span. Here are three methods:
1) Apply a label to a span. Select the same span and hold shift while you select an additional label. You can keep applying additional labels as long as you hold down shift when selecting the additional label.
2) The second way is to use keyboard shortcuts. Select the span, use up and down to find the right label, then press shift + enter or shift + return. 3) We have a new feature that enables you to apply multiple labels to the same token without having to hold down shift. See our next discussion below!
3) We have a new feature that enables you to apply multiple labels to the same token without having to hold down shift. See our next discussion below!
We know that sometimes you need to label a token or spans with multiple labels. Previously, we supported this capability by having you hold down the SHIFT
key while selecting the appropriate labels. But now, we’ve taken things up a notch and made the process even smoother with our new feature!
Enable multiple labels selection allows you to select multiple labels and apply them to the same token or spans without the need to hold down the SHIFT key. It's a real time-saver and simplifies the labeling process.
As a user, you have the flexibility to choose your preferred method, whether it’s using keyboard shortcuts or directly enabling the setting in the interface.
You can find the setting under File menu > Settings > Personalization.
The Personalization setting can be accessed once the project is created. Please note that each user needs to enable this setting for their project, as it will not reflect to others.
Once the setting is enabled, follow these simple steps to apply multiple labels to a token or a span:
Select the desired labels by clicking on them. You can select as many labels as needed. There are other ways to select the labels:
Use keyboard shortcuts (1-9)
Navigate between labels with the arrow keys (up and down) and press Enter
Click the “Apply Labels” button, and all your selected labels will be automatically applied to the token or span.
You can also can use TAB key to navigate between labels until it reaches the “Apply labels” button, and then press Enter.
If you ever need to modify the labels applied to a token or a span, you have several options,
Click on the label you wish to change
Unselect the label you want to change, then select the correct label
Click “Apply labels”
There is a case that you want to add a new label classes to the label box here. For this, adding a new label classes will automatically be selected inside the label box.
We support this capability by holding CTRL
and select the tokens or spans. Let’s say there are two spans we would like to apply the same labels: Sherlock Holmes and Dr. John Watson.
If Sherlock Holmes
and Dr. John Watson
doesn’t have any labels applied, you can simply select the appropriate labels, then click “Apply labels” button. Those labels will be applied to both spans.
If Sherlock Holmes
and Dr. John Watson
have already had PERSON as the label,
Checkboxes will be reset — PERSON label will be not selected
“Reset to mixed labels” button will show, but it’s disabled
Select ORGANIZATION and BOOK TITLE labels as we want to change the current label applied
“Reset to mixed labels” button will be clickable
Clicking this will remain PERSON as the label for both spans
Click “Apply labels”
ORGANIZATION and BOOK TITLE labels will be applied to Sherlock Holmes
and Dr. John Watson
If Sherlock Holmes
and Dr. John Watson
have already had PERSON as the label, and you would like to add a new label classes,
Checkboxes will be reset — PERSON label will be not selected
“Reset to mixed labels” button will show, but it’s disabled
Type a new label, say CHARACTER, then click Add new label
CHARACTER will be automatically selected
“Reset to mixed labels” button will be clickable
Clicking this will remain PERSON as the label for both spans
Click “Apply labels”
CHARACTER will be applied to Sherlock Holmes
and Dr. John Watson
A couple of notes of the enabled multiple labels selection feature:
Multiple labels selection only available for token/spans, not for arrows
If you enable the multiple labels selection in the Personalization in an arrow labeling,
the checkbox will still be displayed in the arrow label box
will not be able to select multiple labels
If you have enabled the Tokens and token spans should have at most one label setting in Step 3 (most likely for Part of Speech use case),
the checkbox will still be displayed in the arrow label box
will not be able to select multiple labels
You can Edit the sentence by double-click on the row, then you can edit the sentence. When editing, we will show you the original sentence. Please take a note that we will tokenize the sentence at the server. To apply changes, you can do one of these:
Press shift
+ enter
if you want to use space as token separator and not using the default tokenizer that Datasaur has.
Click Save
after editing.
When you’re making significant edits to sentences or using the Shift+Enter for the tokenizer, particularly at the beginning or end of sentences where labels are already present, it might result in those labels being removed. So, please take extra care when making these kinds of changes to avoid any unintended label removal.
You can add new lines by right-clicking on the row then choosing Insert Line Above or Insert Line Below.
You can delete lines by right-clicking on the row then choosing Delete Line.
Deleting label can be done in two ways:
Right-clicking the label and clicking on Delete label.
Press delete
or backspace
on your keyboard.
You can delete all the labels on a given sentence by right-clicking anywhere in the sentence and choosing Delete Sentence Labels.
Once you have labeled tokens, you can draw arrows between labels by following these steps.
Clicking the label
Hold it
Pointing it to the other label
You can even apply labels to the arrows themselves. In order to do so, double-click the arrow and select the appropriate label.
You can also reverse arrows, delete arrows, and delete labels by right-clicking on the arrow.
You can move to the desired line via the Go menu.
Go to Start will take you to the first line.
Go to End will take you to the last line.
Go to Line will take you to a specific line.
Go to Next Unlabeled Token will take you to the next unlabeled token.
Go to Previous Unlabeled Token will take you to the previous unlabeled token.
Go to Next Unlabeled Line will take you to the next unlabeled line.
Go to Previous Unlabeled Line will take you to the previous unlabeled line.
Go to Next File will take you to the next file.
Go to Previous File will take you to the previous file.
This menu allows user to customize their view according to their preference. This is accessible through View menu. From this menu, user can choose between "Show all lines" and "Show lines based on label status".
Here are the preview if the user choose Show lines based on label status > Labels from Labelers.
This setting allows users to customize their labeling experience according to their preferences. This is accessible through File -> Settings menu -> Personalization tab.
When this setting is enabled, marking a document as complete will automatically move you to the next document. This can be done either from the extension or by using a shortcut (Ctrl + m).
This setting eliminates the need for manual navigation between documents after marking one as complete.
Paragraph/sentence labeling optimizes the interface for when you are applying labels to longer sentences or entire paragraphs. You will have the option to show the label as an index bar on the left-hand side, and hide the label above the text to avoid clutter.
You can enable this by altering the project settings in token-based projects:
Click File on the top left, then click Settings -> Personalization.
Check Show index bar for labels.
Character selection allows you to select and apply labels on a character-level basis, so you don't have to select the entire token.
Click File on the top left, then click Settings --> Task settings.
Open Default text selection.
Choose `Character selection`
Labeling the character can be done in two ways:
Select the desired character using your mouse.
Select the character using keyboard shortcuts shift
+ right
.
If you want to label the entire sentence, you can simply click on the line number.
Select multiple lines at once can be done by holding shift
and clicking the desired line number.
You can also select multiple lines starting with any line number, for example selecting lines 4-8.
This feature allows you to adjust the selection to an already-applied span label, so you don’t have to delete and reapply the label.
Please note that you will need the ability to modify the applied label in order to adjust span selection.
To start adjusting any label selection, right-click on the label that you want to adjust. You should see “Adjust span selection” option.
The 'Adjust span selection' won't be accessible in Reviewer mode if the label is still conflicted and hasn't been resolved.
After you’ve selected the option, you will enter Adjusting Mode. In this mode, you can move the selection handle to create the new selection.
Shortcut, extension, and title bar functionality will be disabled while the user is in Adjusting Mode.
You can change the selection by dragging and dropping the selection handle to the desired position.
All span labeling settings in Task settings, such as “Spans should have at most one label” and “Limit selection to a span of 1 token,” will also be applied.
Exiting the mode will automatically save the selection. Click anywhere besides the handle to exit the mode. Additionally, the user will see a saving indicator below the label selection, and it will disappear after the save is successful.
In addition to labeling the transcription in the OCR interface, you can also draw bounding boxes on the viewer and bind them to the corresponding text in the text transcription.
Before drawing the bounding boxes, click the icon shown in the screenshot below to enable the drawing capability. Once the icon's color turns blue, you are ready to begin drawing the bounding boxes!
After you create bounding boxes and bind them to the text,
Clicking a bounding box on a specific page will automatically scroll the text editor to find the corresponding text
Clicking a span of text or a label will also automatically scroll the media viewer to find the corresponding bounding box
This feature can be helpful for PDF with multiple pages.
Synchronized scrolling is a feature available in OCR Span Labeling project.
This feature reduces the effort of manually scrolling through a PDF file and a transcription file by synchronizing the scroll position between the two viewers. This way, you won’t have to scroll through both viewers to ensure that each viewer is aligned.
This feature is currently only available for OCR labeling project with PDF files.
Below are the step-by-step you can follow to enable this feature:
Make sure you are working in an OCR Span Labeling Project. Read how to create one here.
On the bottom bar, you will see a button to toggle synchronized scrolling.
Click on the button, and wait until the mapping process is finished.
💡 The mapping process is specific to the currently opened document. Processing other documents requires opening each respective document individually.
You can check the mapping progress by hovering on the button or checking on the progress indicator on the top part of the editor.
While the mapping is in progress, you can still interact with the document and do any labeling. But, any action that modifies the sentence content will be disabled such as editing the sentence.
When the mapping process is finished, you will be notified by a success snack bar message.
Finally, you can try to scroll on one of the viewer, then the other viewer should be automatically scroll.
The mapping result is saved for every user assigned to that document. Therefore, only one assignee needs to wait for the mapping process to finish.
Once that assignee has completed the mapping process, any other assignee can enable synchronized scrolling without waiting for the mapping process again.
To trigger the auto-scroll, you can simply scroll on the transcription or the document viewer. Scrolling can be done with your mouse wheel, touchpad scroll gesture, or using the up-down arrow keys.
You can also click-to-highlight any span on the transcription to trigger the auto-scroll.
You can also temporarily disable this feature by simply clicking again on the Synchronized Scrolling button at the bottom of the page. The good news is that you can immediately re-enable it.
To create scroll points between texts in the PDF and the transcription, the app maps the text content of your file to your transcription using text matching. To enhance the accuracy of our mapping, ensure that you follow these points.
Use native PDFs rather than scanned PDFs. Native PDFs have their text contents embedded in the file, which can simply be extracted and used for the mapping process. Meanwhile, scanned PDFs have images of texts, which the app currently is unable to extract.
For your transcription file, avoid having one line with content that spans across multiple lines from your PDF document. A rule of thumb is to have one line from your PDF file as one line in the transcription file.
The auto-scrolling behavior works best if the Document viewer is at 100% zoom.
Enabling PII Anonymization may break the mapping process, as parts of the transcription will be masked, and the app may fail to find matches for the masked transcription lines.
Synchronized scrolling will be temporarily disabled when the document is rotated.
It will be enabled again when the document is back to its original orientation.
Making any modifications to the transcription (editing, inserting, or deleting sentences) may result in unexpected behavior in synchronized scrolling.
The mapping process uses the original transcription to enable synchronized scrolling and currently does not account for changes made to the transcription text after the initial mapping process.
Once you have finished labeling, click Mark as complete. This will signify to your team you are done with the project, and it is ready for Review or Export.