Comments (7)
spacy.tokens.Token.repvec loads pretrained word representation.
I agree that it would be very useful to be able to efficiently retrieve similar words...
from spacy.
On Wednesday, May 6, 2015, mfilipov [email protected] wrote:
spacy.tokens.Token.repvec loads pretrained word representation.
This should have a better API -- currently you need to precompile the word
vectors.I agree that it would be very useful to be able to efficiently retrieve
similar words...
This is an open research problem. Consider that there are (n^2-n)/2
combinations, with n of order 10^6 for our vocab.
The folks over at gensim are thinking about this too: piskvorky/gensim#51
One solution is to stick a cache in front of the query system. We could probably serve a majority of similarity queries from a small cache.
—
Reply to this email directly or view it on GitHub
#58 (comment).
from spacy.
I don't know how Matthew generated the built-in embedding vectors, but to me the quality is okay.
But if you're not happy with it and want to use your own embeddings, there's spacy.vocab.write_binary_vectors
you can use, it takes bzip2 format file as input and set the output to .../spacy/en/data/vocab/vec.bin
. I've never done it and just read from the code, so I might be wrong. (EDIT: apparently there's a script to do that)
Also remember, spaCy expecting 300
as the size of each vector as it's hardcoded here and here.
For most_similar
function, that's easy to implement, but I'd rather do it somewhere outside spaCy. You can also use gensim
to load spaCy vectors!
import spacy.en
from gensim.models.word2vec import Word2Vec, Vocab
nlu = spacy.en.English()
model = Word2Vec(size=300)
for i, lex in enumerate(nlu.vocab):
model.vocab[lex.orth_] = Vocab(index=i, count=None)
model.index2word.append(lex.orth_)
model.syn0norm = np.asarray(map(lambda x: x.repvec, nlu.vocab))
loaded to model.syn0norm
because spaCy vectors are already normalised and to avoid gensim
L2-normalisation for all-zeros vectors.
In [249]: model.most_similar(u'space')
Out[249]:
[(u'Space', 1.0),
(u'SPACE', 1.0),
(u'SPACES', 0.7741692662239075),
(u'spaces', 0.7741692662239075),
(u'Spaces', 0.7741692662239075),
(u'workspace', 0.6425580978393555),
(u'hyperspace', 0.578804612159729),
(u'CYBERSPACE', 0.5667369961738586),
(u'cyberspace', 0.5667369961738586),
(u'Cyberspace', 0.5667369961738586)]
But if you just want most_similar
function, read doc on how compute vector similarity or you can do something like this...
import numpy as np
import spacy.en
nlu = spacy.en.English()
vectors = np.asarray(map(lambda x: x.repvec, nlu.vocab))
vocab = map(lambda x: x.orth_, nlu.vocab)
def most_similar(word, topn=5):
if isinstance(word, str):
word = unicode(word)
dists = np.dot(vectors, nlu.vocab[word].repvec)
return map(lambda x: (vocab[x], dists[x]), np.argsort(dists)[::-1][:topn])
Example:
In [189]: most_similar(u'query')
Out[189]:
[(u'query', 0.99999994),
(u'Query', 0.99999994),
(u'queries', 0.74670404),
(u'keystroke', 0.65798581),
(u'signup', 0.62638754)]
Extra
Maybe you want to know the similar words from the same sentence. That's also easy, you can have a function like this..
def most_similar_in_sentence(tokens, word, pos_tags=[], topn=5):
if isinstance(word, str):
word = unicode(word)
if pos_tags:
vocab = filter(lambda x: x.pos_ in pos_tags, tokens)
else:
vocab = tokens
dists = np.dot(np.asarray(map(lambda x: x.repvec, vocab)), nlu.vocab[word].repvec)
return map(lambda x: (vocab[x].orth_, dists[x]), np.argsort(dists)[::-1][:topn])
Example
In [199]: text = u"""One solution is to stick a cache in front of the query system. We could probably serve a majority of similarity queries from a small cache."""
In [200]: tokens = nlu(text)
In [201]: most_similar_in_sentence(tokens, u'query')
Out[201]:
[(u'query', 0.99999994),
(u'queries', 0.74670404),
(u'cache', 0.49115649),
(u'cache', 0.49115646),
(u'solution', 0.36885005)]
Maybe you want to filter based on POS tag.
In [202]: most_similar_in_sentence(tokens, u'query', pos_tags=['VERB'])
Out[202]:
[(u'stick', 0.29887787),
(u'could', 0.23798984),
(u'is', 0.2375059),
(u'serve', 0.23126227)]
In [203]: most_similar_in_sentence(tokens, u'query', pos_tags=['NOUN', 'VERB', 'ADJ', 'NUM'], topn=10)
Out[203]:
[(u'query', 0.99999994),
(u'queries', 0.74670416),
(u'cache', 0.49115649),
(u'cache', 0.49115646),
(u'solution', 0.36885005),
(u'system', 0.32158077),
(u'stick', 0.29887787),
(u'similarity', 0.29796919),
(u'One', 0.24750805),
(u'could', 0.23798984)]
Hope that helps!
from spacy.
The vectors are taken from here: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
These embeddings were computed from dependencies, using a parser that's very similar to SpaCy's (the Goldberg and Nivre (2012) model. SpaCy uses some updates on this model, most importantly Brown cluster features.)
The dependency-based embeddings seem substantially better to me. At least, they agree more closely with my expectations for what sort of regularities these vectors should be capturing.
Once I get around to generating embeddings myself, I'd like to have a set of vectors keyed by (lemma, POS tag) tuples, instead of the current string-keyed vectors, which rely on somewhat arbitrary text pre-processing. I'd also like to do named entity linking, and have vectors in the same space for entities.
This stuff is in the roadmap, but somewhere behind constituency parsing, and the named entity plans.
from spacy.
See additional documentaton here: http://spacy.io/tutorials/load-new-word-vectors/
from spacy.
@honnibal : Can you share the code snippet to load Omer Levy's dep embedding
from spacy.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
from spacy.
Related Issues (20)
- displaCy: Separating Punctuations in Dependency Visualization HOT 1
- spaCy training stopping automatically in Google Colab
- Spacy-transformers - update transformers compatibility HOT 4
- NER component in en_core_web_trf doesn't depend on transformer HOT 1
- en_core_web_sm/md/lg stopped loading today (02/04/2024) HOT 1
- Custom component to split coordinations
- Fail to train openai-community / gpt2 model for custom NER on SpaCy framework HOT 1
- Summary HOT 1
- Sharding Warning HOT 1
- nlp.pipe() with multiple processes on Windows VSCode HOT 2
- `Spacy` has inconsistency when dividing sentences HOT 5
- Incorrect detection of sentence boundaries, if last sentence missing eos symbol for trf model
- Enable override of existing custom pipe HOT 1
- Check that filter_spans input is a Span HOT 3
- Tokenizer Incorrectly Splitting "M1M" HOT 1
- Version incompatibility between Spacy, Cuda, Pytorch and Python HOT 3
- Accessing private transformer models HOT 1
- Problems converting Doc object to/from json HOT 1
- The word transitions to the wrong prototype HOT 1
- Fuzzy Matching not working HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spacy.