spaCy

Supported Labeling Types: Span Labeling

Spacy provides NLP pipeline optimized for named entity recognition (NER), dependency parsing, and tokenization. It can be used for a variety of information extraction and NLP tasks, making it a powerful tool for automated text processing and analysis.

Image of ML Assisted with SpaCy
ML Assisted with SpaCy

Model Details

  • Suitable for Span-based projects, which involve extracting meaningful text spans such as names, dates, and monetary values.

  • Uses the en_core_web_lg model where the "lg" (large) variant includes 685,000 unique vectors with 300 dimensions, providing exceptional semantic understanding capabilities.

  • Built on diverse web content including news articles, blogs, and commentary, ensuring broad coverage across different text types.

  • Hosted locally within the Datasaur Intelligence container, eliminating external dependencies and network latency.

Usage

  • This model is a pre-trained large English NLP model, containing word vectors and trained for entity recognition, part-of-speech tagging, and syntactic analysis.

  • The en_core_web_lg model includes the following named entity categories:

    • PERSON – Individuals, including full names.

    • NORP – Nationalities, religious, or political groups.

    • FACILITY – Buildings, airports, highways, bridges, etc.

    • ORG – Organizations such as companies, agencies, institutions.

    • GPE – Geopolitical entities like countries, cities, and states.

    • LOC – Non-GPE locations, mountain ranges, bodies of water.

    • PRODUCT – Objects, vehicles, foods, etc. (Not services.)

    • EVENT – Named events such as hurricanes, wars, sports events.

    • WORK_OF_ART – Titles of books, songs, paintings, etc.

    • LAW – Named laws, treaties, or legal documents.

    • LANGUAGE – Any named language.

    • DATE – Absolute or relative dates or periods.

    • TIME – Specific times of day.

    • PERCENT – Percentage values.

    • MONEY – Monetary values.

    • QUANTITY – Measurements of weight, distance, volume, etc.

    • ORDINAL – First, second, third, etc.

    • CARDINAL – Numerals that do not fall under other categories.

Last updated