Giter VIP home page Giter VIP logo

pytorch-sgns's Introduction

PyTorch SGNS

Word2Vec's SkipGramNegativeSampling in Python.

Yet another but quite general negative sampling loss implemented in PyTorch.

It can be used with ANY embedding scheme! Pretty fast, I bet.

vocab_size = 20000
word2vec = Word2Vec(vocab_size=vocab_size, embedding_size=300)
sgns = SGNS(embedding=word2vec, vocab_size=vocab_size, n_negs=20)
optim = Adam(sgns.parameters())
for batch, (iword, owords) in enumerate(dataloader):
    loss = sgns(iword, owords)
    optim.zero_grad()
    loss.backward()
    optim.step()

New: support negative sampling based on word frequency distribution (0.75th power) and subsampling (resolving word frequency imbalance).

To test this repo, place a space-delimited corpus as data/corpus.txt then run python preprocess.py and python train.py --weights --cuda (use -h option for help).

pytorch-sgns's People

Contributors

theeluwin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-sgns's Issues

Using of discard probabilities

As I can see the 'ws' variable representing discard probabilities of each word is unused. Should it be applied for calculating weights?

applying regularisation

Hi theeluwin!

First of all thanks for the code, it was well written and helped me a ton in building my own word2vec model.

This is not an issue per se, but something I'm potentially adding to the word2vec model using your code, the main idea is to use regularisation on embeddings in a temporal setting. I've run into trouble with the code and I'm wondering if you'd be so kind as to help out!

the main idea is that I'm training 2 sets of models (model 0 & 1) consecutively based on 2 sets of corpora, the 2 sets are temporally adjacent (say news articles of 01/jan and 02/jan), during the training of model 1, I'd like to add a penalty term to the loss/cost function:
for all the words in set(vocab_0)&set(vocab_1), I'd like to minimise the distance of the same word's embeddings from period 0 & 1.

I'm not sure if it makes sense!

So far I'm testing on embeddings of rather small dimensions ~ 20, therefore I'm using the Euclidean distance as a measure.

based on your code, I added a fordward_r function in the Word2Vec class:
`
def forward_r(self, data):

if data is not None:
    v = LT(data)
    v = v.cuda() if self.ivectors.weight.is_cuda else v
    return(self.ivectors(v))
else:
    return(None)

`

This function simply extracts the relevant embeddings (words from the intersection of the 2 vocabs)

and then in the SGNS, I'm now only testing on 1 particular embedding, I added the following loss calculation that look like this:

rvectors = self.embedding.forward_r(rwords)
rloss = 3*((rvectors.squeeze() - self.vector3)**2).sum()

and finally it woud return the following total loss:
return -(oloss + nloss).mean() + rloss

However the problem is, the loss gets stuck, it never updates, and it appears that the back propagation is not working properly.

As you can probably tell, I'm rather new to pytorch and I'm not sure if you could lend me a hand on what's happening!

Thank you so much in advance!

Where is Expectation

screenshot from 2018-06-28 19-07-42

In this formula we have an expectation of $w_i$. That means for each pair of $(w_I, w_O)$ we should calculate this expectation. But as I can see in your code you are sampling n_negs of Negative Samples for each pair of $(w_I, w_O)$. Wouldn't that be more correct if we sample n_negs times $N$ of $w_i$ to obtain an empirical mean of expression in square brackets and after than accumulate n_negs of means?

An error occurred when testing the repo

Hi
Thank you for sharing the code. However, when I tried to test the repo with "python preprocess.py" and " python train.py --weights --cuda", the first one worked well and generated processed data, whereas the second reported the error as follows:

[Epoch 1]: 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 93, in train(parse_args())
File "train.py", line 81, in train loss = sgns(iword, owords)
File "/home/weixin/anaconda2/envs/p3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/weixin/Downloads/pytorch-sgns-master/model.py", line 70, in forward
ivectors = self.embedding.forward_i(iword).unsqueeze(2)
File "/home/weixin/Downloads/pytorch-sgns-master/model.py", line 42, in forward_i
return self.ivectors(v)
File "/home/weixin/anaconda2/envs/p3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/weixin/anaconda2/envs/p3/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 103, in forward
self.scale_grad_by_freq, self.sparse
RuntimeError: save_for_backward can only save input or output tensors, but argument 0 doesn't satisfy this condition

I am quite new to Pytorch so any idea what might go wrong?
Many thanks.

How to ensure that the negative sampled words are not the target word?

First, thanks for you excellent code :)

In model.py, the following piece of code suggests that we may get positive word when we do negative sampling, though the probability is very small.
nwords = t.multinomial(self.weights, batch_size * context_size * self.n_negs, replacement=True).view(batch_size, -1)
I'm wondering why you didn't perform equality check, is that because it doesn't affect the quality of trained word vectors but slow down the training speed?
Are there other reasons?

myConfusion

Why code below, in the project, can be used as "loss".
oloss = t.bmm(ovectors, ivectors).squeeze().sigmoid().log().mean(1) nloss = t.bmm(nvectors, ivectors).squeeze().sigmoid().log().view(-1, context_size, self.n_negs).sum(2).mean(1)

In my judgment, "loss" should be "prediction" - "actual result".
But, in the upper code, "oloss" is prediction, without operation on actual result.

Purpose of unks in skipgram function

Hi,

Can you please explain, what is the purpose of including the <UNK> tokens in the owords vector produced by skipgram function? What should model learn by using these as training examples?

Also, what is the purpose of variable ws in train function, if it's not used anywhere after its definition?

Confused by the loss function.

In your code, you minimized -(oloss + nloss).mean()

which means (oloss+nloss) should be large.
So, "oloss become large and nloss become small " is expected.

Although -(oloss+nloss) decrease, I got oloss become small and nloss become large, how so?

Bug in the Loss Function

The loss function currently implement is
-(oloss + nloss).mean()

It should be
(-oloss + nloss).mean()

You want to minimize the distance between "positive samples" and maximize the distance between "negative samples".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.