Giter VIP home page Giter VIP logo

new-entity-labelling's Introduction

Custom Named Entity Recognizer (NER)

Color Encoded Entities

  • Typically the pretrained natural language processing (NLP) models are trained to recognise limited entities.

  • This notebook provides a short and crisp way to recognise user set of entities to cater to a specific task, where an entity can sequence of tokens (word, character etc.).

  • The process of annotating data follow the BILUO convention, where the scheme is defined as follows:

    TAG DESCRIPTION
    B EGIN The first token of a multi-token entity.
    I N An inner token of a multi-token entity.
    L AST The final token of a multi-token entity.
    U NIT A single-token entity.
    O UT A non-entity token.
  • One can read this paper by Akbik et al., this should help in understanding the algorithm behind the sequence labelling (i.e. multiple-word entities).

  • FAQ: Please do read faq for more clarification.

Installation Process with Error Resolution

  • Using spaCy: Industrial-Strength Natural Language Processing

  • Installation

    • pip install spacy
  • Download a pretrained model

    • python -m spacy download en - This downloads only simples and light weight model

    • To download other pretrained model python -m spacy download en_core_web_md (source) this is also small and only has tagger, parser, ner. There are other and more options are available at the source along with for different languages.

    • To load the model run following:

        import spacy
        nlp = spacy.load('en') # en_core_web_md
        # If this line outputs a error (lint or something) please see below to resolve them
      

Not required for this notebook:

  • Pretrained Transformer model

    • The prerequisite is to install spacy transformer pip install spacy-transformers

    • To install one of the transformer model: python -m spacy download --always-link en_trf_bertbaseuncased_lg --timeout=10000. Timeout arg is for to avoid error because of timeout. Might need to install timeout package

    • To load the model run following:

      import spacy
      nlp = spacy.load('en_trf_bertbaseuncased_lg')
      
    • The above line might cause an error something along [E050] Can't find model 'en_trf_bertbaseuncased_lg'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory. The solution is as follows:

      from spacy.cli import link
      from spacy.util import get_package_path
      
      model_name = "en_trf_bertbaseuncased_lg" # model_name = 'en_core_web_md'
      model_path = get_package_path(model_name)
      link(model_name, model_name, force=True, model_path=model_path)
      

    Note:

    • Follow the above steps if there is error loading this, en_core_web_md, or any other model from here.

    • If there is error something like [E048] Can't import language trf from spacy.lang: No module named 'spacy.lang.trf' then make sure you have already installed spacy-transformers using pip install spacy-transformers. Look if it is installed using pip list. If not then install it else restart the project of ipython console again should work.

Q: Will the training for new types of entities can harm the previously learned entities?

ANS Yes, most likely if the older type is not included in current training data.

# Note: If you're using an existing model, make sure to mix in examples of
# other entity types that spaCy correctly recognized before. Otherwise, your
# model might learn the new type, but "forget" what it previously knew.
# https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting 

Q: How to reproduce results in spcCy?

Ans: Fix the random seed for both spacy and numpy if only running on CPU:

import spacy
import numpy as np

s = 999
np.random.seed(s)
spacy.util.fix_random_seed(s)

if running on GPU then also need to fix the random seed for cupy

cupy.random.seed(s)

Read more here.

Q: Why BILUO is better than IOB scheme?

Ans: There are several coding schemes for encoding entity annotations as token tags. These coding schemes are equally expressive, but not necessarily equally learnable. Ratinov and Roth showed that the minimal Begin, In, Out scheme was more difficult to learn than the BILUO scheme that we use, which explicitly marks boundary tokens.

Q: How to input data to the Model?

Ans: Use GoldParse to parse the token-annotated to convert in a required format directly or use offset-indices process.

new-entity-labelling's People

Contributors

prem2017 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.