argilla-io / spacy-wordnet Goto Github PK

spacy-wordnet creates annotations that easily allow the use of wordnet and wordnet domains by using the nltk wordnet interface

License: MIT License

Python 100.00%

spacy pipeline wordnet

spacy-wordnet's Introduction

spaCy WordNet

spaCy Wordnet is a simple custom component for using WordNet, MultiWordnet and WordNet domains with spaCy.

The component combines the NLTK wordnet interface with WordNet domains to allow users to:

Get all synsets for a processed token. For example, getting all the synsets (word senses) of the word bank.
Get and filter synsets by domain. For example, getting synonyms of the verb withdraw in the financial domain.

Getting started

The spaCy WordNet component can be easily integrated into spaCy pipelines. You just need the following:

Prerequisites

Python 3.X
spaCy

You also need to install the following NLTK wordnet data:

python -m nltk.downloader wordnet
python -m nltk.downloader omw

Install

pip install spacy-wordnet

Supported languages

Almost all Open Multi Wordnet languages are supported.

Usage

Once you choose the desired language (from the list of supported ones above), you will need to manually download a spaCy model for it. Check the list of available models for each language at SpaCy 2.x or SpaCy 3.x.

English example

Download example model:

python -m spacy download en_core_web_sm

Run:

import spacy

from spacy_wordnet.wordnet_annotator import WordnetAnnotator 

# Load an spacy model
nlp = spacy.load('en_core_web_sm')
# Spacy 3.x
nlp.add_pipe("spacy_wordnet", after='tagger')
# Spacy 2.x
# nlp.add_pipe(WordnetAnnotator(nlp, name="spacy_wordnet"), after='tagger')
token = nlp('prices')[0]

# wordnet object link spacy token with nltk wordnet interface by giving acces to
# synsets and lemmas 
token._.wordnet.synsets()
token._.wordnet.lemmas()

# And automatically tags with wordnet domains
token._.wordnet.wordnet_domains()

spaCy WordNet lets you find synonyms by domain of interest for example economy

economy_domains = ['finance', 'banking']
enriched_sentence = []
sentence = nlp('I want to withdraw 5,000 euros')

# For each token in the sentence
for token in sentence:
    # We get those synsets within the desired domains
    synsets = token._.wordnet.wordnet_synsets_for_domain(economy_domains)
    if not synsets:
        enriched_sentence.append(token.text)
    else:
        lemmas_for_synset = [lemma for s in synsets for lemma in s.lemma_names()]
        # If we found a synset in the economy domains
        # we get the variants and add them to the enriched sentence
        enriched_sentence.append('({})'.format('|'.join(set(lemmas_for_synset))))

# Let's see our enriched sentence
print(' '.join(enriched_sentence))
# >> I (need|want|require) to (draw|withdraw|draw_off|take_out) 5,000 euros

Portuguese example

Download example model:

python -m spacy download pt_core_news_sm

Run:

import spacy

from spacy_wordnet.wordnet_annotator import WordnetAnnotator 

# Load an spacy model
nlp = spacy.load('pt_core_news_sm')
# Spacy 3.x
nlp.add_pipe("spacy_wordnet", after='tagger', config={'lang': nlp.lang})
# Spacy 2.x
# nlp.add_pipe(WordnetAnnotator(nlp.lang), after='tagger')
text = "Eu quero retirar 5.000 euros"
economy_domains = ['finance', 'banking']
enriched_sentence = []
sentence = nlp(text)

# For each token in the sentence
for token in sentence:
    # We get those synsets within the desired domains
    synsets = token._.wordnet.wordnet_synsets_for_domain(economy_domains)
    if not synsets:
        enriched_sentence.append(token.text)
    else:
        lemmas_for_synset = [lemma for s in synsets for lemma in s.lemma_names('por')]
        # If we found a synset in the economy domains
        # we get the variants and add them to the enriched sentence
        enriched_sentence.append('({})'.format('|'.join(set(lemmas_for_synset))))

# Let's see our enriched sentence
print(' '.join(enriched_sentence))
# >> Eu (querer|desejar|esperar) retirar 5.000 euros

spacy-wordnet's People

Contributors

Stargazers

Watchers

Forkers

ml-lab wkryst letnotimitateothers asterbini kjaquier erelin6613 xrosliang frutik vnicius aashsach hydroptix martinagalletti devrimcavusoglu superankie falcaopetri it176131 danacity wangcj05 lukhnos

spacy-wordnet's Issues

Question about synsets

I'm using Spacy Wordnet to find synonymous for words.

I would like to check if the synonym "exists" together with the words that are before and after.

For example: I want to {buy|acquire|bought} a car.

I would like to check if it was used in training or if there is a sequence internally in the spacy: 'want to buy a', 'want to acquire a' and 'want to bought a', and if possible how many times they were used. For example, in this case the synonym "bought" is bad, so it probably wouldn't be found once.

Do you know if this is possible with Spacy / Spacy Wordnet?

Support latest NLTK?

Is there a chance you might support NLTK 3.5+? I see that the library supports 3.3 at latest, at least at the moment.

Portuguese / Spanish not working

I have test to portuguese language but it's not working. Why?

Traceback (most recent call last):
  File "synonyms2.py", line 8, in <module>
    token = nlp('prices')[0]
  File "/home/gleidson/.local/lib/python3.8/site-packages/spacy/language.py", line 439, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/home/gleidson/.local/lib/python3.8/site-packages/spacy_wordnet/wordnet_annotator.py", line 17, in __call__
    wordnet = Wordnet(token=token, lang=self.__lang)
  File "/home/gleidson/.local/lib/python3.8/site-packages/spacy_wordnet/wordnet_domains.py", line 31, in __init__
    self.__lang = fetch_wordnet_lang(lang)
  File "/home/gleidson/.local/lib/python3.8/site-packages/spacy_wordnet/__utils__.py", line 35, in fetch_wordnet_lang
    raise Exception('Language {} not supported'.format(lang))
Exception: Language pt not supported

And for Spanish, the synonyms were in English.

Double Factory

Hello, I'm a spaCy core developer. Thank you for creating and maintaining this package.

I was recently updating the example docs for this package in the Universe (explosion/spaCy#11593), and looking at the implementation, I noticed you have two calls to Language.factory with different names. This means the component can be added with nlp.add_pipe("spacy_wordnet") or nlp.add_pipe("wordnet"). Is this intentional?

Inconsistent default language

The default language for the WordnetAnnotator depends on how the pipeline component was instantiated.
This is especially important given that SpaCy 2.x and 3.x each use a different approach.

https://github.com/recognai/spacy-wordnet/blob/ada107749d0e195dd966a2d4afe94136cdb230ab/spacy_wordnet/wordnet_annotator.py#L9-L10
https://github.com/recognai/spacy-wordnet/blob/ada107749d0e195dd966a2d4afe94136cdb230ab/spacy_wordnet/wordnet_annotator.py#L18-L21

SpaCy 3 support

I have SpaCy 3 installed:

$ spacy info

============================== Info about spaCy ==============================

spaCy version    3.0.3                         
Location         /opt/anaconda3/envs/XXX/lib/python3.8/site-packages/spacy
Platform         Linux-5.4.0-65-generic-x86_64-with-glibc2.10
Python version   3.8.5                         
Pipelines        en_core_web_sm (3.0.0)

I have installed spacy-wordnet:

$ pip install spacy-wordnet

Collecting spacy-wordnet
  Downloading spacy-wordnet-0.0.4.tar.gz (648 kB)
     |████████████████████████████████| 648 kB 6.5 MB/s 
Collecting nltk<3.4,>=3.3
  Downloading nltk-3.3.0.zip (1.4 MB)
     |████████████████████████████████| 1.4 MB 9.8 MB/s 
Requirement already satisfied: six in /opt/anaconda3/envs/syllabus/lib/python3.8/site-packages (from nltk<3.4,>=3.3->spacy-wordnet) (1.15.0)
Building wheels for collected packages: spacy-wordnet, nltk
  Building wheel for spacy-wordnet (setup.py) ... done
  Created wheel for spacy-wordnet: filename=spacy_wordnet-0.0.4-py2.py3-none-any.whl size=650293 sha256=73e3b3c9921a3a9fb841638b1fd3ab4d6442563a7cab510c5c663fa74849e242
  Stored in directory: /home/XXX/.cache/pip/wheels/d5/4a/26/0311c16a5294b36a6e018c0816f9e61a5377287fdd276e0f5c
  Building wheel for nltk (setup.py) ... done
  Created wheel for nltk: filename=nltk-3.3-py3-none-any.whl size=1394469 sha256=75e771dc87340388bb5d475c1732827b2987617b059307b5ff6d123bb75bf656
  Stored in directory: /home/XXX/.cache/pip/wheels/19/1d/3a/0a8c14c30132b4f9ffd796efbb6746f15b3d6bcfc1055a9346
Successfully built spacy-wordnet nltk
Installing collected packages: nltk, spacy-wordnet
Successfully installed nltk-3.3 spacy-wordnet-0.0.4

And tried to use it in Python:

import spacy
from spacy_wordnet.wordnet_annotator import WordnetAnnotator

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('WordnetAnnotator', after="tagger")
---------------------------------------------------------------------------                          
ValueError                                Traceback (most recent call last)
<ipython-input-13-a2750d534e73> in <module>
----> 1 nlp.add_pipe('WordnetAnnotator', after="tagger")

/opt/anaconda3/envs/XXX/lib/python3.8/site-packages/spacy/language.py in add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    765                     lang_code=self.lang,
    766                 )                         
--> 767             pipe_component = self.create_pipe(
    768                 factory_name,
    769                 name=name,

/opt/anaconda3/envs/XXX/lib/python3.8/site-packages/spacy/language.py in create_pipe(self, factory_name, name, config, raw_config, validate)
    636                 lang_code=self.lang,
    637             )                             
--> 638             raise ValueError(err)
    639         pipe_meta = self.get_factory_meta(factory_name)
    640         config = config or {}

ValueError: [E002] Can't find factory for 'WordnetAnnotator' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).                                        

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, parser, beam_parser, entity_linker, ner, beam_ner, entity_ruler, lemmatizer, tagger, morphologizer, senter, sentencizer, textcat, textcat_multilabel, en.lemmatizer

I suppose that SpaCy 3 is not yet supported.