Giter VIP home page Giter VIP logo

spacy-lookup's Introduction

spacy-lookup: Named Entity Recognition based on dictionaries

spaCy v2.0 extension and pipeline component for adding Named Entities metadata to Doc objects. Detects Named Entities using dictionaries. The extension sets the custom Doc, Token and Span attributes ._.is_entity, ._.entity_type, ._.has_entities and ._.entities.

Named Entities are matched using the python module flashtext, and looks up in the data provided by different dictionaries.

Installation

spacy-lookup requires spacy v2.0.16 or higher.

pip install spacy-lookup

Usage

First, you need to download a language model.

python -m spacy download en

Import the component and initialise it with the shared nlp object (i.e. an instance of Language), which is used to initialise flashtext with the shared vocab, and create the match patterns. Then add the component anywhere in your pipeline.

import spacy
from spacy_lookup import Entity

nlp = spacy.load('en')
entity = Entity(keywords_list=['python', 'product manager', 'java platform'])
nlp.add_pipe(entity, last=True)

doc = nlp(u"I am a product manager for a java and python.")
assert doc._.has_entities == True
assert doc[0]._.is_entity == False
assert doc[3]._.entity_desc == 'product manager'
assert doc[3]._.is_entity == True

print([(token.text, token._.canonical) for token in doc if token._.is_entity])

spacy-lookup only cares about the token text, so you can use it on a blank Language instance (it should work for all available languages!), or in a pipeline with a loaded model. If you're loading a model and your pipeline includes a tagger, parser and entity recognizer, make sure to add the entity component as last=True, so the spans are merged at the end of the pipeline.

Available attributes

The extension sets attributes on the Doc, Span and Token. You can change the attribute names on initialisation of the extension. For more details on custom components and attributes, see the processing pipelines documentation.

Token._.is_entity

bool

Whether the token is an entity.

Token._.entity_type

unicode

A human-readable description of the entity.

Doc._.has_entities

bool

Whether the document contains entity.

Doc._.entities

list

(entity, index, description) tuples of the document's entities.

Span._.has_entities

bool

Whether the span contains entity.

Span._.entities

list

(entity, index, description) tuples of the span's entities.

Settings

On initialisation of Entity, you can define the following settings:

nlp Language The shared nlp object. Used to initialise the matcher with the shared Vocab, and create Doc match patterns.
attrs tuple Attributes to set on the ._ property. Defaults to ('has_entities', 'is_entity', 'entity_type', 'entity').
`keywords_list ` list

Optional lookup table with the list of terms to look for.

`keywords_dict ` dict

Optional lookup table with the list of terms to look for.

`keywords_file ` string

Optional filename with the list of terms to look for.

entity = Entity(nlp, keywords_list=['python', 'java platform'], label='ACME')
nlp.add_pipe(entity)
doc = nlp(u"I am a product manager for a java platform and python.")
assert doc[3]._.is_entity

spacy-lookup's People

Contributors

alvaroabascar avatar basiledjigg avatar johnnyleitrim avatar leonqli avatar mpuig avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spacy-lookup's Issues

Could merging spans after successful lookups be avoided or optional?

Hi, I've been very happy to discover this contribution to the spacy ecosystem.

I wanted to write up an issue I encountered during evaluation of a model that included an Entity component. It's probably the cause of #2.

import spacy
from spacy.scorer import Scorer
from spacy_lookup import Entity

nlp = spacy.blank('en')
text = 'The Chamber of Commerce testified in support.'

For use later as a gold parse, create a Doc from the blank model:

gold = nlp(text)

Run the lookup:

entity = Entity(nlp, keywords_list=['Chamber of Commerce'], label='ORG')
nlp.add_pipe(entity, last=True)
doc = nlp(text)

# Successful
assert doc.ents[0].text == 'Chamber of Commerce'

As the readme mentions, a multi-token entity span will be merged after a successful lookup. This means our two documents (processed and gold) have diverged.

assert doc[1].text == 'Chamber of Commerce'
assert gold[1].text == 'Chamber'

When calling the evaluate() method of a Language instance, this will eventually happen (simplified from the actual source):

scorer = Scorer()
scorer.score(doc, gold, verbose=True)
# ValueError: [E078] Error computing score: number of words in Doc (6) does not equal number of words in GoldParse (8).

There's unfortunately no Span.split() method, so this isn't easily reversible.

TypeError: set_extension() got an unexpected keyword argument 'force

Hi,

Thanks for this library, i am look at using it on on of my projects.

I think the force argument is obsolete in the latest version of spacy, i'm getting an error below when importing your library.

Using TensorFlow backend.
Traceback (most recent call last):
  File "C:/Users/atti/Projects/NER/ner_n.py", line 92, in <module>
    main()
  File "C:/Users/atti/Projects/NER/ner_n.py", line 88, in main
    process_news_feeds(5, show_details=False)
  File "C:/Users/baa5uo/Projects/NER/ner_n.py", line 46, in process_news_feeds
    nm.configure()
  File "C:\Users\atti\Projects\NER\ner_models.py", line 70, in configure
    entity_field = Entity(keywords_list=fields, label='FIELD')
  File "C:\Users\atti\AppData\Local\Continuum\anaconda3\envs\nlp2\lib\site-packages\spacy_lookup\__init__.py", line 28, in __init__
    Doc.set_extension(self._has_entities, getter=self.has_entities, force=True)
  File "doc.pyx", line 97, in spacy.tokens.doc.Doc.set_extension
TypeError: set_extension() got an unexpected keyword argument 'force

Changed the following lines in Entity:

Register attribute on the Doc and Span

    Doc.set_extension(self._has_entities, getter=self.has_entities)
    Doc.set_extension(self._entities, getter=self.iter_entities)
    Span.set_extension(self._has_entities, getter=self.has_entities)
    Span.set_extension(self._entities, getter=self.iter_entities)

    # Register attribute on the Token.
    Token.set_extension(self._is_entity, default=False)
    Token.set_extension(self._entity_desc, getter=self.get_entity_desc)

Fuzzy similarity

How to create fuzzy similarity with this library, something like mapping partial word to correct word

Example: Appple ---> Apple

label value

It's not clear how to access/retrieve a value of label, passed into constructor at Entity instantiation step

Cannot extract all entities which had same range values

I made an example as below by using spacy with lookup dependencies.

entity = Entity(
    keywords_list=["Japan", "Tokyo", "US"],
    label='from_location',
    case_sensitive=True)
nlp.add_pipe(entity, name='location')

entity2 = Entity(
    keywords_list=["Korea", "Japan", "US", "Tokyo"],
    label='to_location',
    case_sensitive=True)
nlp.add_pipe(entity2, name='to_location')

doc = nlp(
    u"I want to go to Tokyo Japan tomorrow morning from US. Can you book a ticket?")
for token in doc:
    if token._.is_entity:
        pprint([(token.text, token._.canonical, token.ent_type_, token.pos_, token.idx, token.idx + len(token.text))])

Here is the result:

[('Tokyo', 'Tokyo', 'from_location', 'X', 16, 21)]
[('Japan', 'Japan', 'from_location', 'X', 22, 27)]
[('US', 'US', 'from_location', 'X', 50, 52)]

However, my expectation is:

[('Tokyo', 'Tokyo', 'to_location', 'X', 16, 21)]
[('Tokyo', 'Tokyo', 'from_location', 'X', 16, 21)]
[('Japan', 'Japan', 'to_location', 'X', 22, 27)]
[('Japan', 'Japan', 'from_location', 'X', 22, 27)]
[('US', 'US', 'to_location', 'X', 50, 52)]
[('US', 'US', 'from_location', 'X', 50, 52)]

Is there anyone know why?

TypeError: object of type 'NoneType' has no len()

Hello,

I tend to run into this error at "spacy_lookup_init_.py" line 60,

doc.ents = list(doc.ents) + [entity]

Looks like line 58 and line 60 is incorrectly indented. It should be inside of "if entity:"

Please check if this is the case. I can get around the error by indenting the two lines.

Thanks,
Agnes

Finding keywords on lemmas?

I'm using this library and it works great, but there are occasions when I'm looking for the pluralized version of a word to be recognised, e.g. mouldy green apples. If I have apple as an added keyword, it won't find apple in that example sentence. Is there a way to have the keyword be found on the lemma of apples, i.e. apple?

TypeError: object of type 'NoneType' has no len()

For some of the data, I keep getting this error. I tried to indent the code as suggested in #7, but it did not work.


flashtext 2.7
spacy 2.0.16
spacy-lookup 0.0.3


Any suggestions please?

TypeError Traceback (most recent call last)
in
----> 1 doc = nlp(data)
2
3 print(doc..entities)
4 for ent in doc.ents:
5 print((ent.text, ent.start_char, ent.end_char, ent.label
))

~/.virtualenvs/spacy_lookup/lib/python3.6/site-packages/spacy/language.py in call(self, text, disable)
344 if not hasattr(proc, 'call'):
345 raise ValueError(Errors.E003.format(component=type(proc), name=name))
--> 346 doc = proc(doc)
347 if doc is None:
348 raise ValueError(Errors.E005.format(name=name))

~/.virtualenvs/spacy_lookup/lib/python3.6/site-packages/spacy_lookup/init.py in call(self, doc)
53 spans.append(entity)
54 # Overwrite doc.ents and add entity โ€“ be careful not to replace!
---> 55 doc.ents = list(doc.ents) + [entity]
56
57

doc.pyx in spacy.tokens.doc.Doc.ents.set()

TypeError: object of type 'NoneType' has no len()

iter_entities redundancy

it appears there is some redundancy in what iter_entities returns. For a pattern hello matching inside doc heLLo world, returns ('heLLo', 0, 'heLLo'). I would expect the entity description to match the canonical pattern hello instead of the observed token heLLo. Am I misinterpreting your intent for entity_desc?

def iter_entities(self, tokens):
        return [(t.text, i, t._.get(self._entity_desc))
                for i, t in enumerate(tokens)
                if t._.get(self._is_entity)]

def get_entity_desc(self, token):
        return token.text

How to retrieve native spacy functionality for type 'Entity'

I do not know a great deal about spacy, but it appears that you are rewriting the Entity object for spacy here, which may cause issues with trying to use spacy functions after the dictionary match.

I have provided an example below of the trouble I'm having.

I was attempting to use the spacy 'GoldParse' to get performance measures after including a dictionary to improve my trained NER model.

Any idea how to fix the fact that spacy now thinks there is not an entity?

      9         if annot is None:
     10             annot = []
---> 11         doc = nlp_model(input_)
     12         gold = GoldParse(doc, entities = annot)
     13         try:

~/.local/lib/python3.6/site-packages/spacy/language.py in __call__(self, text, disable)
    350             if not hasattr(proc, '__call__'):
    351                 raise ValueError(Errors.E003.format(component=type(proc), name=name))
--> 352             doc = proc(doc)
    353             if doc is None:
    354                 raise ValueError(Errors.E005.format(name=name))

~/.local/lib/python3.6/site-packages/spacy_lookup/__init__.py in __call__(self, doc)
     38         for _, start, end in matches:
     39             entity = doc.char_span(start, end, label=self.label)
---> 40             for token in entity:
     41                 token._.set(self._is_entity, True)
     42             spans.append(entity)

TypeError: 'NoneType' object is not iterable

ValueError: [E098] Trying to set conflicting doc.ents: '(0, 1, 'ORG')' and '(0, 1, '')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

ValueError: [E098] Trying to set conflicting doc.ents: '(0, 1, 'ORG')' and '(0, 1, '')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

My code is:

import spacy
from spacy_lookup import Entity

nlp = spacy.load('en')
entity = Entity(keywords_list=['Digitoxin'])
nlp.add_pipe(entity, last=True)

doc = nlp(u"Digitoxin metabolism by rat liver microsomes .")

print(doc._.entities)

ValueError: [E090] Extension 'has_entities' already exists on Doc

Hi,

Thanks for a great library, i recently started to have this error, previously it was working. Any ideas on what might be the cause for this ?

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-dfa60562b3ab> in <module>()
     10 nlp = spacy.load('en')
---> 12 entity_field = Entity(nlp, keywords_list=df_fields['Field'].values.tolist(), label='FIELD')
     13 nlp.add_pipe(entity_field, last=True,name='ner_field')

~\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\spacy_lookup\__init__.py in __init__(self, nlp, keywords_list, keywords_dict, keywords_file, label, attrs)
     26         self.label = label
     27         # Add attributes
---> 28         Doc.set_extension(self._has_entities, getter=self.has_entities)
     29         Doc.set_extension(self._entities, getter=self.iter_entities)
     30         Span.set_extension(self._has_entities, getter=self.has_entities)

doc.pyx in spacy.tokens.doc.Doc.set_extension()

ValueError: [E090] Extension 'has_entities' already exists on Doc. To overwrite the existing extension, set `force=True` on `Doc.set_extension`.

thanks,
Attila

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.