mpuig / spacy-lookup Goto Github PK

Named Entity Recognition based on dictionaries

License: MIT License

Python 100.00%

spacy nlp natural-language-processing spacy-pipeline spacy-extension named-entity-recognition ner

spacy-lookup's Introduction

spacy-lookup: Named Entity Recognition based on dictionaries

spaCy v2.0 extension and pipeline component for adding Named Entities metadata to Doc objects. Detects Named Entities using dictionaries. The extension sets the custom Doc, Token and Span attributes ._.is_entity, ._.entity_type, ._.has_entities and ._.entities.

Named Entities are matched using the python module flashtext, and looks up in the data provided by different dictionaries.

Installation

spacy-lookup requires spacy v2.0.16 or higher.

pip install spacy-lookup

Usage

First, you need to download a language model.

python -m spacy download en

Import the component and initialise it with the shared nlp object (i.e. an instance of Language), which is used to initialise flashtext with the shared vocab, and create the match patterns. Then add the component anywhere in your pipeline.

import spacy
from spacy_lookup import Entity

nlp = spacy.load('en')
entity = Entity(keywords_list=['python', 'product manager', 'java platform'])
nlp.add_pipe(entity, last=True)

doc = nlp(u"I am a product manager for a java and python.")
assert doc._.has_entities == True
assert doc[0]._.is_entity == False
assert doc[3]._.entity_desc == 'product manager'
assert doc[3]._.is_entity == True

print([(token.text, token._.canonical) for token in doc if token._.is_entity])

spacy-lookup only cares about the token text, so you can use it on a blank Language instance (it should work for all available languages!), or in a pipeline with a loaded model. If you're loading a model and your pipeline includes a tagger, parser and entity recognizer, make sure to add the entity component as last=True, so the spans are merged at the end of the pipeline.

Available attributes

The extension sets attributes on the Doc, Span and Token. You can change the attribute names on initialisation of the extension. For more details on custom components and attributes, see the processing pipelines documentation.

`Token._.is_entity`	bool	Whether the token is an entity.
`Token._.entity_type`	unicode	A human-readable description of the entity.
`Doc._.has_entities`	bool	Whether the document contains entity.
`Doc._.entities`	list	`(entity, index, description)` tuples of the document's entities.
`Span._.has_entities`	bool	Whether the span contains entity.
`Span._.entities`	list	`(entity, index, description)` tuples of the span's entities.

Settings

On initialisation of Entity, you can define the following settings:

`nlp`	`Language`	The shared `nlp` object. Used to initialise the matcher with the shared `Vocab`, and create `Doc` match patterns.
`attrs`	tuple	Attributes to set on the ._ property. Defaults to `('has_entities', 'is_entity', 'entity_type', 'entity')`.
`keywords_list	` list	Optional lookup table with the list of terms to look for.
`keywords_dict	` dict	Optional lookup table with the list of terms to look for.
`keywords_file	` string	Optional filename with the list of terms to look for.

entity = Entity(nlp, keywords_list=['python', 'java platform'], label='ACME')
nlp.add_pipe(entity)
doc = nlp(u"I am a product manager for a java platform and python.")
assert doc[3]._.is_entity

spacy-lookup's People

Contributors

Stargazers

Watchers

spacy-lookup's Issues

Could merging spans after successful lookups be avoided or optional?

Hi, I've been very happy to discover this contribution to the spacy ecosystem.

I wanted to write up an issue I encountered during evaluation of a model that included an Entity component. It's probably the cause of #2.

import spacy
from spacy.scorer import Scorer
from spacy_lookup import Entity

nlp = spacy.blank('en')
text = 'The Chamber of Commerce testified in support.'

For use later as a gold parse, create a Doc from the blank model:

gold = nlp(text)

Run the lookup:

entity = Entity(nlp, keywords_list=['Chamber of Commerce'], label='ORG')
nlp.add_pipe(entity, last=True)
doc = nlp(text)

# Successful
assert doc.ents[0].text == 'Chamber of Commerce'

As the readme mentions, a multi-token entity span will be merged after a successful lookup. This means our two documents (processed and gold) have diverged.

assert doc[1].text == 'Chamber of Commerce'
assert gold[1].text == 'Chamber'

When calling the evaluate() method of a Language instance, this will eventually happen (simplified from the actual source):

scorer = Scorer()
scorer.score(doc, gold, verbose=True)
# ValueError: [E078] Error computing score: number of words in Doc (6) does not equal number of words in GoldParse (8).

There's unfortunately no Span.split() method, so this isn't easily reversible.

TypeError: set_extension() got an unexpected keyword argument 'force

Hi,

Thanks for this library, i am look at using it on on of my projects.

I think the force argument is obsolete in the latest version of spacy, i'm getting an error below when importing your library.

Using TensorFlow backend.
Traceback (most recent call last):
  File "C:/Users/atti/Projects/NER/ner_n.py", line 92, in <module>
    main()
  File "C:/Users/atti/Projects/NER/ner_n.py", line 88, in main
    process_news_feeds(5, show_details=False)
  File "C:/Users/baa5uo/Projects/NER/ner_n.py", line 46, in process_news_feeds
    nm.configure()
  File "C:\Users\atti\Projects\NER\ner_models.py", line 70, in configure
    entity_field = Entity(keywords_list=fields, label='FIELD')
  File "C:\Users\atti\AppData\Local\Continuum\anaconda3\envs\nlp2\lib\site-packages\spacy_lookup\__init__.py", line 28, in __init__
    Doc.set_extension(self._has_entities, getter=self.has_entities, force=True)
  File "doc.pyx", line 97, in spacy.tokens.doc.Doc.set_extension
TypeError: set_extension() got an unexpected keyword argument 'force

Changed the following lines in Entity:

Register attribute on the Doc and Span

    Doc.set_extension(self._has_entities, getter=self.has_entities)
    Doc.set_extension(self._entities, getter=self.iter_entities)
    Span.set_extension(self._has_entities, getter=self.has_entities)
    Span.set_extension(self._entities, getter=self.iter_entities)

    # Register attribute on the Token.
    Token.set_extension(self._is_entity, default=False)
    Token.set_extension(self._entity_desc, getter=self.get_entity_desc)

Fuzzy similarity

How to create fuzzy similarity with this library, something like mapping partial word to correct word

Example: Appple ---> Apple

label value

It's not clear how to access/retrieve a value of label, passed into constructor at Entity instantiation step

Error with some data: object of type 'NoneType' has no len()

Hi guys

I have the similar issue as https://github.com/marcotcr/lime/issues/294, which was also raised in https://github.com/mpuig/spacy-lookup/issues/9 and https://github.com/mpuig/spacy-lookup/issues/7

I run python 3.6 and I have installed spaCy from master. Nevertheless I get:

How to install this with conda ??

Please help me how to install this with anaconda ??

Cannot extract all entities which had same range values

I made an example as below by using spacy with lookup dependencies.

entity = Entity(
    keywords_list=["Japan", "Tokyo", "US"],
    label='from_location',
    case_sensitive=True)
nlp.add_pipe(entity, name='location')

entity2 = Entity(
    keywords_list=["Korea", "Japan", "US", "Tokyo"],
    label='to_location',
    case_sensitive=True)
nlp.add_pipe(entity2, name='to_location')

doc = nlp(
    u"I want to go to Tokyo Japan tomorrow morning from US. Can you book a ticket?")
for token in doc:
    if token._.is_entity:
        pprint([(token.text, token._.canonical, token.ent_type_, token.pos_, token.idx, token.idx + len(token.text))])

Here is the result:

[('Tokyo', 'Tokyo', 'from_location', 'X', 16, 21)]
[('Japan', 'Japan', 'from_location', 'X', 22, 27)]
[('US', 'US', 'from_location', 'X', 50, 52)]

However, my expectation is:

[('Tokyo', 'Tokyo', 'to_location', 'X', 16, 21)]
[('Tokyo', 'Tokyo', 'from_location', 'X', 16, 21)]
[('Japan', 'Japan', 'to_location', 'X', 22, 27)]
[('Japan', 'Japan', 'from_location', 'X', 22, 27)]
[('US', 'US', 'to_location', 'X', 50, 52)]
[('US', 'US', 'from_location', 'X', 50, 52)]

Is there anyone know why?

TypeError: object of type 'NoneType' has no len()

Hello,

I tend to run into this error at "spacy_lookup_init_.py" line 60,

doc.ents = list(doc.ents) + [entity]

Looks like line 58 and line 60 is incorrectly indented. It should be inside of "if entity:"

Please check if this is the case. I can get around the error by indenting the two lines.

Thanks,
Agnes

Finding keywords on lemmas?

I'm using this library and it works great, but there are occasions when I'm looking for the pluralized version of a word to be recognised, e.g. mouldy green apples. If I have apple as an added keyword, it won't find apple in that example sentence. Is there a way to have the keyword be found on the lemma of apples, i.e. apple?

TypeError: object of type 'NoneType' has no len()

For some of the data, I keep getting this error. I tried to indent the code as suggested in #7, but it did not work.

flashtext 2.7
spacy 2.0.16
spacy-lookup 0.0.3

Any suggestions please?

TypeError Traceback (most recent call last)
in
----> 1 doc = nlp(data)
2
3 print(doc..entities)
4 for ent in doc.ents:
5 print((ent.text, ent.start_char, ent.end_char, ent.label))

~/.virtualenvs/spacy_lookup/lib/python3.6/site-packages/spacy/language.py in call(self, text, disable)
344 if not hasattr(proc, 'call'):
345 raise ValueError(Errors.E003.format(component=type(proc), name=name))
--> 346 doc = proc(doc)
347 if doc is None:
348 raise ValueError(Errors.E005.format(name=name))

~/.virtualenvs/spacy_lookup/lib/python3.6/site-packages/spacy_lookup/init.py in call(self, doc)
53 spans.append(entity)
54 # Overwrite doc.ents and add entity – be careful not to replace!
---> 55 doc.ents = list(doc.ents) + [entity]
56
57

doc.pyx in spacy.tokens.doc.Doc.ents.set()

TypeError: object of type 'NoneType' has no len()

iter_entities redundancy

it appears there is some redundancy in what iter_entities returns. For a pattern hello matching inside doc heLLo world, returns ('heLLo', 0, 'heLLo'). I would expect the entity description to match the canonical pattern hello instead of the observed token heLLo. Am I misinterpreting your intent for entity_desc?

def iter_entities(self, tokens):
        return [(t.text, i, t._.get(self._entity_desc))
                for i, t in enumerate(tokens)
                if t._.get(self._is_entity)]

def get_entity_desc(self, token):
        return token.text

How to retrieve native spacy functionality for type 'Entity'

I do not know a great deal about spacy, but it appears that you are rewriting the Entity object for spacy here, which may cause issues with trying to use spacy functions after the dictionary match.

I have provided an example below of the trouble I'm having.

I was attempting to use the spacy 'GoldParse' to get performance measures after including a dictionary to improve my trained NER model.

Any idea how to fix the fact that spacy now thinks there is not an entity?

      9         if annot is None:
     10             annot = []
---> 11         doc = nlp_model(input_)
     12         gold = GoldParse(doc, entities = annot)
     13         try:

~/.local/lib/python3.6/site-packages/spacy/language.py in __call__(self, text, disable)
    350             if not hasattr(proc, '__call__'):
    351                 raise ValueError(Errors.E003.format(component=type(proc), name=name))
--> 352             doc = proc(doc)
    353             if doc is None:
    354                 raise ValueError(Errors.E005.format(name=name))

~/.local/lib/python3.6/site-packages/spacy_lookup/__init__.py in __call__(self, doc)
     38         for _, start, end in matches:
     39             entity = doc.char_span(start, end, label=self.label)
---> 40             for token in entity:
     41                 token._.set(self._is_entity, True)
     42             spans.append(entity)

TypeError: 'NoneType' object is not iterable

ValueError: [E098] Trying to set conflicting doc.ents: '(0, 1, 'ORG')' and '(0, 1, '')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

My code is:

import spacy
from spacy_lookup import Entity

nlp = spacy.load('en')
entity = Entity(keywords_list=['Digitoxin'])
nlp.add_pipe(entity, last=True)

doc = nlp(u"Digitoxin metabolism by rat liver microsomes .")

print(doc._.entities)

what is the difference with Entity ruler?

Entity ruler can also complete this function, which should I use ?

ValueError: [E090] Extension 'has_entities' already exists on Doc

Hi,

Thanks for a great library, i recently started to have this error, previously it was working. Any ideas on what might be the cause for this ?

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-dfa60562b3ab> in <module>()
     10 nlp = spacy.load('en')
---> 12 entity_field = Entity(nlp, keywords_list=df_fields['Field'].values.tolist(), label='FIELD')
     13 nlp.add_pipe(entity_field, last=True,name='ner_field')

~\AppData\Local\Continuum\anaconda3\envs\nlp\lib\site-packages\spacy_lookup\__init__.py in __init__(self, nlp, keywords_list, keywords_dict, keywords_file, label, attrs)
     26         self.label = label
     27         # Add attributes
---> 28         Doc.set_extension(self._has_entities, getter=self.has_entities)
     29         Doc.set_extension(self._entities, getter=self.iter_entities)
     30         Span.set_extension(self._has_entities, getter=self.has_entities)

doc.pyx in spacy.tokens.doc.Doc.set_extension()

ValueError: [E090] Extension 'has_entities' already exists on Doc. To overwrite the existing extension, set `force=True` on `Doc.set_extension`.

thanks,
Attila

mpuig / spacy-lookup Goto Github PK

spacy-lookup's Introduction

spacy-lookup: Named Entity Recognition based on dictionaries

Installation

Usage

Available attributes

Settings

spacy-lookup's People

Contributors

Stargazers

Watchers

Forkers

spacy-lookup's Issues

Register attribute on the Doc and Span

Any suggestions please?

Recommend Projects

Recommend Topics

Recommend Org