theeluwin / pytorch-sgns Goto Github PK

View Code? Open in Web Editor NEW

299.0 9.0 59.0 883 KB

Skipgram Negative Sampling implemented in PyTorch

License: MIT License

Python 100.00%

pytorch word2vec skipgram pytorch-implementation

pytorch-sgns's Introduction

PyTorch SGNS

Word2Vec's SkipGramNegativeSampling in Python.

Yet another but quite general negative sampling loss implemented in PyTorch.

It can be used with ANY embedding scheme! Pretty fast, I bet.

vocab_size = 20000
word2vec = Word2Vec(vocab_size=vocab_size, embedding_size=300)
sgns = SGNS(embedding=word2vec, vocab_size=vocab_size, n_negs=20)
optim = Adam(sgns.parameters())
for batch, (iword, owords) in enumerate(dataloader):
    loss = sgns(iword, owords)
    optim.zero_grad()
    loss.backward()
    optim.step()

New: support negative sampling based on word frequency distribution (0.75th power) and subsampling (resolving word frequency imbalance).

To test this repo, place a space-delimited corpus as data/corpus.txt then run python preprocess.py and python train.py --weights --cuda (use -h option for help).

pytorch-sgns's People

Contributors

Stargazers

Watchers

pytorch-sgns's Issues

what is a good loss value finally?

I download your code and trained among 100 million sentences, eventually I got a loss of 0.44, can it be a good one?

why initialize the first line of embedding matrix as 0? t.zeros(1, self.embedding_size)

Using of discard probabilities

As I can see the 'ws' variable representing discard probabilities of each word is unused. Should it be applied for calculating weights?

applying regularisation

Hi theeluwin!

First of all thanks for the code, it was well written and helped me a ton in building my own word2vec model.

This is not an issue per se, but something I'm potentially adding to the word2vec model using your code, the main idea is to use regularisation on embeddings in a temporal setting. I've run into trouble with the code and I'm wondering if you'd be so kind as to help out!

the main idea is that I'm training 2 sets of models (model 0 & 1) consecutively based on 2 sets of corpora, the 2 sets are temporally adjacent (say news articles of 01/jan and 02/jan), during the training of model 1, I'd like to add a penalty term to the loss/cost function:
for all the words in set(vocab_0)&set(vocab_1), I'd like to minimise the distance of the same word's embeddings from period 0 & 1.

I'm not sure if it makes sense!

So far I'm testing on embeddings of rather small dimensions ~ 20, therefore I'm using the Euclidean distance as a measure.

based on your code, I added a fordward_r function in the Word2Vec class:
`
def forward_r(self, data):

if data is not None:
    v = LT(data)
    v = v.cuda() if self.ivectors.weight.is_cuda else v
    return(self.ivectors(v))
else:
    return(None)

This function simply extracts the relevant embeddings (words from the intersection of the 2 vocabs)

and then in the SGNS, I'm now only testing on 1 particular embedding, I added the following loss calculation that look like this:

rvectors = self.embedding.forward_r(rwords)
rloss = 3*((rvectors.squeeze() - self.vector3)**2).sum()

and finally it woud return the following total loss:
return -(oloss + nloss).mean() + rloss

However the problem is, the loss gets stuck, it never updates, and it appears that the back propagation is not working properly.

As you can probably tell, I'm rather new to pytorch and I'm not sure if you could lend me a hand on what's happening!

Thank you so much in advance!

Different embeddings for input/output words?

Hey there, great skipgram example, so thank you for that.

I have a question on why you decided to use different embeddings for the "input" words and "output"/"negative" words? See lines below:
https://github.com/theeluwin/pytorch-sgns/blob/master/model.py#L29:L30

I imagine this could give better performance on some problem, but haven't been able to test this myself yet. Thanks for the help!

Where is Expectation

In this formula we have an expectation of $w_i$. That means for each pair of $(w_I, w_O)$ we should calculate this expectation. But as I can see in your code you are sampling n_negs of Negative Samples for each pair of $(w_I, w_O)$. Wouldn't that be more correct if we sample n_negs times $N$ of $w_i$ to obtain an empirical mean of expression in square brackets and after than accumulate n_negs of means?

An error occurred when testing the repo

Hi
Thank you for sharing the code. However, when I tried to test the repo with "python preprocess.py" and " python train.py --weights --cuda", the first one worked well and generated processed data, whereas the second reported the error as follows:

[Epoch 1]: 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 93, in train(parse_args())
File "train.py", line 81, in train loss = sgns(iword, owords)
File "/home/weixin/anaconda2/envs/p3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/weixin/Downloads/pytorch-sgns-master/model.py", line 70, in forward
ivectors = self.embedding.forward_i(iword).unsqueeze(2)
File "/home/weixin/Downloads/pytorch-sgns-master/model.py", line 42, in forward_i
return self.ivectors(v)
File "/home/weixin/anaconda2/envs/p3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/weixin/anaconda2/envs/p3/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 103, in forward
self.scale_grad_by_freq, self.sparse
RuntimeError: save_for_backward can only save input or output tensors, but argument 0 doesn't satisfy this condition

I am quite new to Pytorch so any idea what might go wrong?
Many thanks.

How to ensure that the negative sampled words are not the target word?

First, thanks for you excellent code :)

In model.py, the following piece of code suggests that we may get positive word when we do negative sampling, though the probability is very small.
nwords = t.multinomial(self.weights, batch_size * context_size * self.n_negs, replacement=True).view(batch_size, -1)
I'm wondering why you didn't perform equality check, is that because it doesn't affect the quality of trained word vectors but slow down the training speed?
Are there other reasons?

myConfusion

Why code below, in the project, can be used as "loss".
oloss = t.bmm(ovectors, ivectors).squeeze().sigmoid().log().mean(1) nloss = t.bmm(nvectors, ivectors).squeeze().sigmoid().log().view(-1, context_size, self.n_negs).sum(2).mean(1)

In my judgment, "loss" should be "prediction" - "actual result".
But, in the upper code, "oloss" is prediction, without operation on actual result.

Purpose of unks in skipgram function

Hi,

Can you please explain, what is the purpose of including the <UNK> tokens in the owords vector produced by skipgram function? What should model learn by using these as training examples?

Also, what is the purpose of variable ws in train function, if it's not used anywhere after its definition?

Confused by the loss function.

In your code, you minimized -(oloss + nloss).mean()

which means (oloss+nloss) should be large.
So, "oloss become large and nloss become small " is expected.

Although -(oloss+nloss) decrease, I got oloss become small and nloss become large, how so?

Bug in the Loss Function

The loss function currently implement is
-(oloss + nloss).mean()

It should be
(-oloss + nloss).mean()

You want to minimize the distance between "positive samples" and maximize the distance between "negative samples".

Add requirements.txt

torch
tqdm

theeluwin / pytorch-sgns Goto Github PK

pytorch-sgns's Introduction

PyTorch SGNS

pytorch-sgns's People

Contributors

Stargazers

Watchers

Forkers

pytorch-sgns's Issues

Recommend Projects

Recommend Topics

Recommend Org