Datasaur
Visit our websitePricingBlogPlaygroundAPI Docs
  • Welcome to Datasaur
    • Getting started with Datasaur
  • Data Studio Projects
    • Labeling Task Types
      • Span Based
        • OCR Labeling
        • Audio Project
      • Row Based
      • Document Based
      • Bounding Box
      • Conversational
      • Mixed Labeling
      • Project Templates
        • Test Project
    • Creating a Project
      • Data Formats
      • Data Samples
      • Split Files
      • Consensus
      • Dynamic Review Capabilities
    • Pre-Labeled Project
    • Let's Get Labeling!
      • Span Based
        • Span + Line Labeling
      • Row & Document Based
      • Bounding Box Labeling
      • Conversational Labeling
      • Label Sets / Question Sets
        • Dynamic Question Set
      • Multiple Label Sets
    • Reviewing Projects
      • Review Sampling
    • Adding Documents to an Ongoing Project
    • Export Project
  • LLM Projects
    • LLM Labs Introduction
    • Sandbox
      • Direct Access LLMs
      • File Attachment
      • Conversational Prompt
    • Deployment
      • Deployment API
    • Knowledge base
      • External Object Storage
      • File Properties
    • Models
      • Amazon SageMaker JumpStart
      • Amazon Bedrock
      • Open AI
      • Azure OpenAI
      • Vertex AI
      • Custom model
      • Fine-tuning
      • LLM Comparison Table
    • Evaluation
      • Automated Evaluation
        • Multi-application evaluation
        • Custom metrics
      • Ranking (RLHF)
      • Rating
      • Performance Monitoring
    • Dataset
    • Pricing Plan
  • Workspace Management
    • Workspace
    • Role & Permission
    • Analytics
      • Inter-Annotator Agreement (IAA)
        • Cohen's Kappa Calculation
        • Krippendorff's Alpha Calculation
      • Custom Report Builder
      • Project Report
      • Evaluation Metrics
    • Activity
    • File Transformer
      • Import Transformer
      • Export Transformer
      • Upload File Transformer
      • Running File Transformer
    • Label Management
      • Label Set Management
      • Question Set Management
    • Project Management
      • Self-Assignment
        • Self-Unassign
      • Transfer Assignment Ownership
      • Reset Labeling Work
      • Mark Document as Complete
      • Project Status Workflow
        • Read-only Mode
      • Comment Feature
      • Archive Project
    • Automation
      • Action: Create Projects
  • Assisted Labeling
    • ML Assisted Labeling
      • Amazon Comprehend
      • Amazon SageMaker
      • Azure ML
      • CoreNLP NER
      • CoreNLP POS
      • Custom API
      • FewNERD
      • Google Vertex AI
      • Hugging Face
      • LLM Assisted Labeling
        • Prompt Examples
        • Custom Provider
      • LLM Labs (beta)
      • NLTK
      • Sentiment Analysis
      • spaCy
      • SparkNLP NER
      • SparkNLP POS
    • Data Programming
      • Example of Labeling Functions
      • Labeling Function Analysis
      • Inter-Annotator Agreement for Data Programming
    • Predictive Labeling
  • Assisted Review
    • Label Error Detection
  • Building Your Own Model
    • Datasaur Dinamic
      • Datasaur Dinamic with Hugging Face
      • Datasaur Dinamic with Amazon SageMaker Autopilot
  • Advanced
    • Script-Generated Question
    • Shortcuts
    • Extensions
      • Labels
      • Review
      • Document and Row Labeling
      • Bounding Box Labels
      • List of Files
      • Comments
      • Analytics
      • Dictionary
      • Search
      • Labeling Guidelines
      • Metadata
      • Grammar Checker
      • ML Assisted Labeling
      • Data Programming
      • Datasaur Dinamic
      • Predictive Labeling
      • Label Error Detection
      • LLM Sandbox
    • Tokenizers
  • Integrations
    • External Object Storage
      • AWS S3
        • With IRSA
      • Google Cloud Storage
      • Azure Blob Storage
      • Dropbox
    • SAML
      • Okta
      • Microsoft Entra ID
    • SCIM
      • Okta
      • Microsoft Entra ID
    • Webhook Notifications
      • Webhook Signature
      • Events
      • Custom Headers
    • Robosaur
      • Commands
        • Create Projects
        • Apply Project Tags
        • Export Projects
        • Generate Time Per Task Report
        • Split Document
      • Storage Options
  • API
    • Datasaur APIs
    • Credentials
    • Create Project
      • New mutation (createProject)
      • Python Script Example
    • Adding Documents
    • Labeling
      • Create Label Set
      • Add Label Sets into Existing Project
      • Get List of Label Sets in a Project
      • Add Label Set Item into Project's Label Set
      • Programmatic API Labeling
      • Inserting Span and Arrow Label into Document
    • Export Project
      • Custom Webhook
    • Get Data
      • Get List of Projects
      • Get Document Information
      • Get List of Tags
      • Get Cabinet
      • Export Team Overview
      • Check Job
    • Custom OCR
      • Importable Format
    • Custom ASR
    • Run ML-Assisted Labeling
  • Security and Compliance
    • Security and Compliance
      • 2FA
  • Compatibility & Updates
    • Common Terminology
    • Recommended Machine Specifications
    • Supported Formats
    • Supported Languages
    • Release Notes
      • Version 6
        • 6.111.0
        • 6.110.0
        • 6.109.0
        • 6.108.0
        • 6.107.0
        • 6.106.0
        • 6.105.0
        • 6.104.0
        • 6.103.0
        • 6.102.0
        • 6.101.0
        • 6.100.0
        • 6.99.0
        • 6.98.0
        • 6.97.0
        • 6.96.0
        • 6.95.0
        • 6.94.0
        • 6.93.0
        • 6.92.0
        • 6.91.0
        • 6.90.0
        • 6.89.0
        • 6.88.0
        • 6.87.0
        • 6.86.0
        • 6.85.0
        • 6.84.0
        • 6.83.0
        • 6.82.0
        • 6.81.0
        • 6.80.0
        • 6.79.0
        • 6.78.0
        • 6.77.0
        • 6.76.0
        • 6.75.0
        • 6.74.0
        • 6.73.0
        • 6.72.0
        • 6.71.0
        • 6.70.0
        • 6.69.0
        • 6.68.0
        • 6.67.0
        • 6.66.0
        • 6.65.0
        • 6.64.0
        • 6.63.0
        • 6.62.0
        • 6.61.0
        • 6.60.0
        • 6.59.0
        • 6.58.0
        • 6.57.0
        • 6.56.0
        • 6.55.0
        • 6.54.0
        • 6.53.0
        • 6.52.0
        • 6.51.0
        • 6.50.0
        • 6.49.0
        • 6.48.0
        • 6.47.0
        • 6.46.0
        • 6.45.0
        • 6.44.0
        • 6.43.0
        • 6.42.0
        • 6.41.0
        • 6.40.0
        • 6.39.0
        • 6.38.0
        • 6.37.0
        • 6.36.0
        • 6.35.0
        • 6.34.0
        • 6.33.0
        • 6.32.0
        • 6.31.0
        • 6.30.0
        • 6.29.0
        • 6.28.0
        • 6.27.0
        • 6.26.0
        • 6.25.0
        • 6.24.0
        • 6.23.0
        • 6.22.0
        • 6.21.0
        • 6.20.0
        • 6.19.0
        • 6.18.0
        • 6.17.0
        • 6.16.0
        • 6.15.0
        • 6.14.0
        • 6.13.0
        • 6.12.0
        • 6.11.0
        • 6.10.0
        • 6.9.0
        • 6.8.0
        • 6.7.0
        • 6.6.0
        • 6.5.0
        • 6.4.0
        • 6.3.0
        • 6.2.0
        • 6.1.0
        • 6.0.0
      • Version 5
        • 5.63.0
        • 5.62.0
        • 5.61.0
        • 5.60.0
  • Deployment
    • Self-Hosted
      • AWS Marketplace
        • Data Studio
        • LLM Labs
Powered by GitBook
On this page
  • Span Labeling
  • Standard Search
  • Advanced Search
  • Row Labeling
  • Search All Files
  • Label All
  1. Advanced
  2. Extensions

Search

Last updated 1 month ago

The Search extension helps users quickly find specific words, phrases, or labeled tokens within their data. It’s useful for navigating both individual documents and entire projects, with features like label-specific searches, regex searches, and exact word matching. Results are clearly displayed in a list, making it easier to analyze and work with large datasets efficiently.

Span Labeling

In a Span Labeling project, two types of searches are available: Standard and Advanced.

Standard Search

The Standard Search allows users to perform simple searches based on text and labels using keywords or regular expressions (regex). This search type is intuitive and provides quick access to relevant data by matching the input with the text or labels in the project.

Search Based on Text

Text-based search allows users to search for specific words or patterns within the data by specifying a word filter and entering a keyword to locate matching text in the project.

Word Filter

This option lets users define how their search keywords are matched to results. The available options are:

  • Contains any word: Matches results that contain any of the specified words.

    • Example: Searching for men will match with men, mentioned, abandonment.

  • Exact word: Displays only exact matches for the search keyword.

    • Example: Searching for men will match men but not mentioned.

  • Regex: Allows users to search using regular expressions for advanced pattern matching.

    • Example: Searching for men* will match words starting with men, such as mentioned.

Search Based on Label

Label-based search allows users to find specific labels or categories in the data, with the word filter set to "Contains any word.”

Advanced Search

The Advanced Search provides a more sophisticated way to search by allowing the combination of multiple conditions to refine results. This search type supports complex queries using MongoDB query syntax.

Configure Conditions

There are two ways to configure the search conditions:

  1. Logic Builder: A user-friendly interface to create conditions visually.

  2. Query: Directly input advanced queries for complex conditions.

Configure Conditions via Logic Builder

Users can create searches with multiple conditions, where each condition includes a search target, a filter operation, and a keyword. These conditions can be combined using logical operators such as "OR" or "AND" to define the relationship between the conditions.

  • Search target: The focus of the search.

    • Text: Matches words or content in the spans.

    • Label: Matches the labels applied to the text.

    • Metadata: Matches information attached to the line (in key-value pair).

  • Filter operation: Determines how the search target is matched.

    • is: Matches search target that exactly matches the specified keyword.

    • is not: Matches search target that explicitly does not match the specified keyword.

    • contains: Matches search target that contains the specified keyword.

    • does not contain: Matches search target that does not contain the specified keyword

    • matches regex: Matches search target that fits the regular expression pattern.

  • Keyword: The value the search will look for.

    • For Text and Label, this is the word or phrase to match.

    • For Metadata, this is the key: value pair used to filter information.

  • Logical operator: Specifies how multiple conditions are connected.

    • OR: Matches results that meet at least one condition.

    • AND: Matches results that meet all conditions.

Configure Conditions via Query

Key Operators

  • $regex — Search for text patterns (combine with $options for behavior like case-insensitivity using $options: "i").

  • $not — Exclude matches.

  • $or — Requires at least one condition to match.

  • $and — Requires all conditions to match.

Notes: $or and $and require exactly 2 conditions.

Search Condition

  1. Text condition — Searches for text content.

    1. First example: Find text containing "is". This will match text like "This is used to train data."

      {
        "cellFragment.content": {
          "$regex": "is",
          "$options": "i"
        }
      }
    2. Second example: Find text not containing "is". This will match text like "Labeling the data will be done in Datasaur."

      {
        "cellFragment.content": {
          "$not": {
            "$regex": "is",
            "$options": "i"
          }
        }
      }
  2. Label condition — Filters based on labeled spans.

    1. First example: Find spans labeled with a label containing "GEO". This will match labels like “GEO”, “Location Geo”, and “Geospatial Data”.

      {
        "spanLabels": {
          "$elemMatch": {
            "labelClassName": {
              "$regex": "GEO",
              "$options": "i"
            }
          }
        }
      }
    2. Second example: Find spans labeled exactly with "GEO". This will match labels like "GEO", "geo", or any other case variations, but the entire label must be "GEO" with no additional characters.

      {
        "spanLabels": {
          "$elemMatch": {
            "labelClassName": {
              "$regex": "^GEO$",
              "$options": "i"
            }
          }
        }
      }
  3. Metadata condition — Searches for key-value pairs in the metadata attached to each line.

    1. Example: Find metadata where the key is "category" and the value is "education".

      {
        "cellFragment.metadata": {
          "$elemMatch": {
            "key": {
              "$regex": "category",
              "$options": "i"
            },
            "value": {
              "$regex": "education",
              "$options": "i"
            }
          }
        }
      }
  4. Logical OR condition — Matches if any of the conditions are true.

    1. Example: Find text containing either "France" or "John".

      {
        "$or": [
          { 
      	    "cellFragment.content": { 
      		    "$regex": "France", 
      		    "$options": "i" 
      	    }
          },
          { 
      	    "cellFragment.content": { 
      		    "$regex": "John", "$options": "i" 
      	    }
          }
        ]
      }
  5. Logical AND condition — Matches only if all conditions are true.

    1. Example: Find text containing both "France" and "John".

      {
        "$and": [
          { 
      	    "cellFragment.content": { 
      		    "$regex": "France", 
      		    "$options": "i" 
      	    }
          },
          { 
      	    "cellFragment.content": { 
      		    "$regex": "John", "$options": "i" 
      	    }
          }
        ]
      }

Search Result

The search operates at the line level, meaning it evaluates each line individually against the list of specified conditions.

💡For conditions with negative operators (is not, does not contain), only the lines that meet the specified conditions will be displayed in the results.

To enhance readability, users can enable the "Show only matching lines in the text viewer" option. When activated, this option hides non-matching lines, allowing users to focus only on relevant results in the text viewer.

Row Labeling

Allows users to search within the data of the table, enabling them to find specific information across multiple rows and columns quickly by specifying the search target, word filter, and entering the keyword.

Search target

This option allows users to specify the focus of the search. The available options are:

  • Text: Matches the words or content in the data column.

  • Label: Matches the words or content in the answer column.

Word Filter

This option lets users define how their search keywords are matched to results. The available options are:

  • Contains any word: Matches results that contain any of the specified words.

    • Example: Searching for men will match with men, mentioned, abandonment.

  • Exact word: Displays only exact matches for the search keyword.

    • Example: Searching for men will match men but not mentioned.

  • Regex: Allows users to search using regular expressions for advanced pattern matching.

    • Example: Searching for men* will match words starting with men, such as mentioned.

Search All Files

The Search All Files option allows users to search across all files within a project. When this option is checked, the search will include results from every file in the project. If the option is unchecked, the search will be limited to the current file only.

This is useful for users who want to either perform a broad search across all files or focus on a specific file within the project.

Label All

Only available for Span labeling projects.

The Label All feature allows users to quickly label all matching results in the project.

For example, searching for the text james will show the number of instances of james in the document. After selecting PER from the dropdown, pressing the Label All button will apply the PER label to all instances of james in the document.

This feature is a useful tool for bulk labeling, making the process faster and more efficient. It is especially beneficial for projects that require detailed text analysis, enhancing accuracy and saving time.

Tips & Tricks To make it easier to navigate through the results, you can use the Up Arrow or Down Arrow keys to move to the next or previous result.

You can set up conditions using MongoDB queries. Datasaur supports a subset of the MongoDB , which are listed below.

query selectors