princetonml / sif Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 306.0 2.79 MB

sentence embedding by Smooth Inverse Frequency weighting scheme

License: MIT License

Python 100.00%

sif's People

Contributors

Stargazers

Watchers

Forkers

corbyrosset 1v-0 vyraun devsinghsachan chagge stevenlol jankim benjamesbabala schangpi maryam-imani wallace-team-research fancyerii 0xsimulacra akiratu oarriaga spate141 giancds chenlicheng520 vanl nyk510 guhaifudeng lixinsu zhangjiulong nyutal loretoparisi nansept saraswat ml-lab yuquanle panyang little1tow techstone tandychao binbinbian kuonanhong lightsilver boluoyu leezqcst zhouhoo kauttoj waiteryee1 adrianhust szelenka silvioolivastri anikacyp dingxiaofei2017 ruijiera soares-f tpr-ly wyxingyux christiansheng dromescu xinshu koomook douxiaotian abhisheklolage gilnoh dotrado sruan2 gauravjuvekar lzfelix zhiyuanzheng ankurpandey42 jrcondenast zashuna zaenal1981abidin envibus bright1993ff66 nonva caoxu915683474 nunofernandes-plight lyriclee michaelzhouwang xiongfeihtp dmowery arnaudmkonan laisun waveli123 tracy6465 puttkraidej cosecant-csc xielm12 statml zhaixi94 binnong lujunru luohq09 tompxu chenchengyu afcarl jc-wang jkhlot eminemrain tonghuikang zhouyonglong wangzhicong abdulsaleh tarangill04 bellagao1023 sszzsupersupersupersuper

sif's Issues

The algorithm derivation mentioned in the original paper does not match the algorithm itself.

This code is one of the worst of the code series I have seen from the original author. Whether it's code correctness, operational efficiency, variable name naming, or code structure, there is nothing to be desired. It’s incredible that the original text of the paper is so clear and concise that it’s so badly implemented.

About GPU and BLAS

I would like to run against a nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 (docker). I have

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   38C    P8    17W / 125W |      0MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

and I could install eventually a BLAS like Intel MKL (that could speed up numpy), etc. Is the current implementation using any gpu low level routine?

Thank you

MemoryError in sif_embedding.py

Hi,

I downloaded and unzipped glove.840B.300d.zip file and than I run python sif_embedding.py.
After 30 minutes I receive this error:

Traceback (most recent call last):
  File "sif_embedding.py", line 15, in <module>
    (words, We) = data_io.getWordmap(wordfile)
  File "../src/data_io.py", line 22, in getWordmap
    return (words, np.array(We))
MemoryError

Have you ever seen this error?

Nb: I drop the process after Termial shows this error because the process goes in deadlock/loop (I'm sure it doesn't go forward).

what's the meaning of p(w)

p(w) is word frequency of a document or of corpus??

Simpler interface to infer SIF vectors and passing principal components as parameter

Hi !

As I was using your library to benchmark SIF embeddings against other representations, I stumbled upon few issues I corrected and ended up writing some extra code to simplify the inference of sentence embeddings, as well as adding new functionalities.

In the current state of the library, the principal components are computed each time vectors are requested, which means that the same sentence can have different embeddings. To correct this I modified the code to pass the principal components as parameters.

Here is the interface I'm suggesting:

class SIFCtrl():
    """Infer SIF embeddings."""
    
    def __init__(self, wordfile, weightfile, pc_path=None, weightpara=1e-3, rmpc=1):
        self.rmpc = rmpc
        self.words, self.We = data_io.getWordmap(wordfile)
        self.word2weight = data_io.getWordWeight(weightfile, weightpara)
        self.weight4ind = data_io.getWeight(self.words, self.word2weight)
        self.params = params.params()
        self.params.rmpc = self.rmpc
        if pc_path is not None:
            with open(pc_path, 'rb') as fs:
                self.pc = pickle.load(fs)
        else:
            self.pc = None
        
    def compute_pc(self, sentences):
        x, m  = data_io.sentences2idx(sentences, self.words)
        w = data_io.seq2weight(x, m, self.weight4ind)
        emb = SIF_embedding.get_weighted_average(self.We, x, w)
        pc = SIF_embedding.compute_pc(emb, self.params.rmpc)
        self.pc = pc
        
    def save_pc(self, filename):
        if self.pc is not None:
            with open(filename, 'wb') as fs:
                pickle.dump(self.pc, fs)

    def get_embeddings(self, sentences):
        x, m  = data_io.sentences2idx(sentences, self.words)
        w = data_io.seq2weight(x, m, self.weight4ind)
        return SIF_embedding.SIF_embedding(self.We, x, w, self.params, self.pc)

How to use:

WORD_FILE_PATH = "/path/to/embs/glove.840B.300d.txt"
WEIGHT_FILE_PATH = "/path/to/freqs/enwiki_vocab_min200.txt"

sentences = [...] #some sentences used to computed the pcs

SIF_ctrl = SIFCtrl(WORD_FILE_PATH, WEIGHT_FILE_PATH)
SIF_ctrl.compute_pc(sentences) 
SIF_ctrl.save_pc('my_pc.pkl') #save the pcs so we can directly load them later
embs1 = SIF_ctrl.get_embeddings(['once upon a time .', 'the northern wind is cold .'])
embs2 = SIF_ctrl.get_embeddings(['once upon a time .', 'and now for something completely different .'])

Now the embeddings of once upon a time . from the two calls above will be the same. The default behavior (computing the pcs from the input) is obtained simply by not calling compute_pc.

For SIF embeddings, the principal components are computed to obtain a representation of the syntax, which can be approximated from any general English corpus, I would suggest shipping the library with some precomputed pcs.

If you're interested I can bundle all this in a pr? Let me know, and thanks for the nice work.

Other pre-trained vectors rather then glove

Hi, I want to know can I use other pre-trained word vectors rather than glove. It runs fine with glove 50d and 300d but I want to use these embeddings with Pubmed vectors. How to use that, it is showing me error (when I am trying to use PubMed vectors)

pre-trained SIF!

Are there any pre-trained SIF available to be used out of the box ?

Encoding Error

Getting this error when I try to run sif_embeddings, but I think the issue is with data io.
How are the files supplied meant to be used, besides running the demo? I'd like to use SIF for evaluating similarity of sentences I supply. There is no training needed if I were to just use the Glove embeddings correct? What are the neural nets in src used for then?

Thanks!

File "C:\Users\gdev\git\SIF\examples\sif_embedding.py", line 13, in <module>
    (words, We) = data_io.getWordmap(wordfile)
  File "../src\data_io.py", line 12, in getWordmap
    lines = f.readlines()
  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 962: character maps to <undefined>

你好，看您好像是华人，我就用汉语提问了。

算法1中的u是将v s 做列向量得到的矩阵的奇艺向量，而之后您又提到
“In other words, the final sentence embedding is obtained by subtracting the projection of ˜ c s ’s to their first principal component. This is summarized in Algorithm 1.”，
如果只看算法1好像并没有用到cs这个向量，还是说u就是通过cs得到的呢？
train.sh文件是只针对有监督任务的对吧？

Is a check for numeric type needed in getWordmap method?

I am getting the following error:
File "sif_embedding.py", line 13, in
(words, We) = data_io.getWordmap(wordfile)
File "../src/data_io.py", line 18, in getWordmap
v.append(float(i[j]))
ValueError: could not convert string to float: '.'

Is it a bug or do we always expect numerical data for the line v.append(float(i[j]))?

uuTvs怎么能代表vs在u上的投影？投影应该是vsu/|u|或者|vs|cosine。

您代码中的XX = X - X.dot(pc.transpose()) * pc代表u*uT*vs，但是u*uT*vs怎么能代表vs在u上的投影？投影应该是vs*u/|u|或者|vs|*cosine。

data preprocessing

I'm just wondering what type of data preprocessing for SIF embedding I need to do for the sentences. For example,

do I need to remove punctuations? In the example, sentences don't have punctuations.
should I tokenize negations?
what other preprocessing needs to be done?
Thanks a lot!!

A potential information reveal problem

I am following your framework and extend your work by adding "attention". However, when I review your code, I am confuse that whether you calculated the first principle component with training data or not? If you calculate the first principle component with current dataset (test data), it seems that you introduce some information to create the sentence embeddings for test datasets. In that case, your result may not be accepted.

number of sentences

In the example, MSRpar2012 has only 750 lines of sentences.
The theory works fine with small volume of data.
But for big data, for example, the number of sentences is near 400,000.Then the calculation of pca could be a big problem.
Can SIF handle big data?

Simple re-implementation of SIF

As this code is not maintained anymore I have re implemented SIF heavily reuses this implementation as I needed it for a project. I focused on generating the embedding using SIF. I thank the authors of SIF for such a wonderful paper.

I am sharing my code here.

How to interpret results on the MSRpar

I have run the three tasks in the demo on the whole Glove word embedding:

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=-1.000000
remove the first 0 principal components
                    MSRpar2012   0.454328

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=-1.000000
remove the first 1 principal components
                    MSRpar2012   0.364071

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=0.001000
remove the first 0 principal components
                    MSRpar2012   0.436383

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=0.001000
remove the first 1 principal components
                    MSRpar2012   0.356372

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from idf
remove the first 0 principal components
                    MSRpar2012   0.531127

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from idf
remove the first 1 principal components
                    MSRpar2012   0.416723

not updating word vectors, removing 0 pc, layersize 2400
GloVe vectors

and for the three tasks:

similarity
Epoch  20 Cost  0.0195494797081
total time: 219.104348898


entailment
total time: 116.860952854

sentiment
Epoch  20 Cost  0.188378542662
total time: 931.339022875

How to intepret the MRSpar values from the logs, considering the STSbenchmarks in the results table here?

Dataset problem

Hi everyone:
I can not find Twitter'15 datasets, Can someone send me a download link?
best wish！

the list seq1 not declared

SIF/src/data_io.py

Line 199 in 84b5b4c

for i in sentences:

AttributeError: 'params' object has no attribute 'params'

I am trying the the code below

# set parameters
params = params.params()
params.rmpc = rmpc
# get SIF embedding
embedding = SIF_embedding.SIF_embedding(We, x, w, params)

an get this error

AttributeError: 'params' object has no attribute 'params'

and I verified the module params and its true, please help me how to deal with this?

How to get embedding for a new sentence(Outside the sample)？

hi,
I have read your work and I have a question. If I have completed the embedding training on the corpus, I want to get a new sentence embedding outside the corpus. How can i get it? From algorithm 1 in paper, i think we can make size of sentence set S to 1, and i don't know whether is right. Looking forward to your answer. thanks.

ValueError: zero-size array to reduction operation maximum which has no identity

Hi there, I am trying to take the sentence embeddings from a bunch of sentences, but the error reported as above, can any one give any hints on it. Thank you.

Python 3 not supported

Does this project only support Python 2 now?

How to normalize SIF vectors?

I obtain the vector like this:
[ -13501.50134185, 777983.48873463, 436946.65958192, ...,
-384546.65110645, 296474.44854087, 412937.7170413 ]
the scale of the values spans a large range.
How to normalize the values? I need normalize them for downstream application.
Does the normalization have impact on the performance of SIF?

phrase embedding

Is it suitable to embed phrases with this model?

OOV tokens

Hi,

Thanks for sharing the code! I have some questions about the handling of OOV tokens:

How is the sentence embedding computed if any of its words does not have a pretrained vector in GloVe?
Say we encounter a word at test time which was not present in the corpus used for estimating word frequencies. How is the weight for this word computed?
Do you remove infrequent words when computing the word frequencies from the corpus? I ask because the vocab file name (enwiki_vocab_min200.txt) seems to suggest so. If yes, how are the removed tokens weighted?

Thanks in advance for your time :)

Best,
Bhuwan

Is saving and restoring trained models supported?

Is there functionality within the current scripts for saving and restoring models?

something about STS dataset: score in the file

Hi,

I download the STS2012 test data, the file only contains two sentences each line, I didn't fine score after the two senetnce, then I can't compute the pearson without given score;

I just wonder what's the meaning of the score, and where can I download such SIS file which is with score?

[Question] enwiki_vocab_min200.txt

How did you get the file enwiki_vocab_min200.txt?

error [glove.840B.300d.zip]: reported length of central directory is -76 bytes too long

Hi, thank you for sharing your code!
When running ./demo.sh I get the following error:

error [glove.840B.300d.zip]:  reported length of central directory is
  -76 bytes too long

Any steps in the right direction would be appreciated.
Detailed logs are below.
Thanks,
FTK

mac-os-x$ ./demo.sh
--2017-05-20 18:28:48--  http://nlp.stanford.edu/data/glove.840B.300d.zip
Resolving nlp.stanford.edu... 171.64.67.140
Connecting to nlp.stanford.edu|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.840B.300d.zip [following]
--2017-05-20 18:28:48--  https://nlp.stanford.edu/data/glove.840B.300d.zip
Connecting to nlp.stanford.edu|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2176768927 (2.0G) [application/zip]
Saving to: ‘glove.840B.300d.zip’

glove.840B.300d.zip           100%[=================================================>]   2.03G  1.82MB/s    in 22m 58s 

2017-05-20 18:51:46 (1.51 MB/s) - ‘glove.840B.300d.zip’ saved [2176768927/2176768927]

Archive:  glove.840B.300d.zip
warning [glove.840B.300d.zip]:  76 extra bytes at beginning or within zipfile
  (attempting to process anyway)
error [glove.840B.300d.zip]:  reported length of central directory is
  -76 bytes too long (Atari STZip zipfile?  J.H.Holm ZIPSPLIT 1.1
  zipfile?).  Compensating...
   skipping: glove.840B.300d.txt     need PK compat. v4.5 (can do v2.1)

note:  didn't find end-of-central-dir signature at end of central dir.
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)

AttributeError: 'params' object has no attribute 'nonlinearity'

Using cuDNN version 6021 on context None
Mapped name None to device cuda: Tesla K80 (1489:00:00.0)
['train.py', '-wordfile', '../data/glove.840B.300d.txt', '-npc', '1', '-dim', '300', '-traindata', '../data/sentiment-train', '-devdata', '../data/sentiment-dev', '-testdata', '../data/sentiment-test', '-layersize', '300', '-nntype', 'proj_sentiment', '-epochs', '10', '-batchsize', '25', '-LW', '1e-06', '-LC', '1e-06', '-memsize', '300', '-learner', 'adam', '-eta', '0.001', '-task', 'sentiment']
Traceback (most recent call last):
File "train.py", line 237, in
model = proj_model_sentiment(We, params)
File "~iclr2017/SIF/src/proj_model_sentiment.py", line 37, in init
l_out = lasagne.layers.DenseLayer(l_average, params.layersize, nonlinearity=params.nonlinearity)
AttributeError: 'params' object has no attribute 'nonlinearity'

Paper and Code disparity. Columns or rows for SVD?

In the algorithm of the paper we can see on line 4 that each v_s should be a column vector of our Matrix X.

In the code we can see on line 22, that each datapoint should be a row vector of our Matrix X.

License?

Could you add a license file saying what license it is possible to use this code under? If you have no strong opinion, may I suggest Apache 2.0?

Mention proper version of dependency in requirements.text file

Hi,
Can you please mention the versions also in the dependencies because sometimes there is a conflict of methods in different versions and I face the same in theano. It will be really helpful.

Secondly, it will be good if you also mention the minimum memory requirements for the GPU also to run the basic demo script.

Negative similarity with rmpc = 1

I tried the mini demo for SIF https://github.com/YingyuLiang/SIF_mini_demo
The sentence similarity score is : -1 with one principal component remove.
PrincetonML/SIF_mini_demo#1

Does anyone have idea on that?
Thanks

sentences2idx returns only 2 values

SIF/examples/sif_embedding.py

Line 18 in 343b6eb

 x, m, _ = data_io.sentences2idx(sentences, words) # x is the array of word indices, m is the binary mask indicating whether there is a word in that location 

STS preprocessing script

Do you have the scripts for preprocessing / converting the STS to the same format as MSRpar2012 ? I have the original STS 2012 files but not in the correct format. I asked here: jwieting/iclr2016#4 but haven't gotten a reply. If not, do you have any details on how the data was tokenized?

seq1 not defined in data_io.sentences2idx (line 200)

No module named 'tree'

tried pip install tree and pip install dm-tree. Still doesn't work

Example demo hanging

I have installed all dependencies, and

cd examples/
./demo.sh

The script has downloaded GloVe glove.840B.300d.txt, but it seems to hangs. No files written in the log folder and no python running on top. Any hint?

No such file Error: ../data/MSRvid2012

When I run the demo.sh in examples directory, this error occurred:

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=-1.000000
remove the first 0 principal components
Traceback (most recent call last):
  File "sim_sif.py", line 28, in <module>
    parr, sarr = eval.sim_evaluate_all(We, words, weight4ind, sim_algo.weighted_average_sim_rmpc, params)
  File "../src/eval.py", line 64, in sim_evaluate_all
    p,s = sim_getCorrelation(We, words, prefix+i, weight4ind, scoring_function, params)
  File "../src/eval.py", line 13, in sim_getCorrelation
    f = open(f,'r')
IOError: [Errno 2] No such file or directory: '../data/MSRvid2012'

Would you please tell me that:

Is this error detrimental to the model training ?
Where to download the missing data file ?

Thanks a lot !

PSL?

Hi, thanks for sharing the code.

Would you please point out which part of the code is refer to the "PSL" in your paper ? I only manage to found two weighting functions:

def getWeight(words, word2weight):
...
def getIDFWeight(wordfile, save_file=''):

Dataset glove.840B.300d.txt character issue

The involved dataset, at line 52343, presents what it seems to be ". . .", but it's not.
At this line, the code of the example sif_embedding.py breaks because the split() at line 15 of auxiliary_data/data_io.py splits wrongly the word and its embedding.
After a debugging on that line it turned out that the dots of ". . ." are actually dots while the spaces are the code 160 of the extended ASCII table.
Probably this file is not encoded in ASCII but in Unicode, however (for practical reasons) the test has been made with ord() so the output is an ASCII code, but the problem doesn't change.