princetonml / sif Goto Github PK
View Code? Open in Web Editor NEWsentence embedding by Smooth Inverse Frequency weighting scheme
License: MIT License
sentence embedding by Smooth Inverse Frequency weighting scheme
License: MIT License
This code is one of the worst of the code series I have seen from the original author. Whether it's code correctness, operational efficiency, variable name naming, or code structure, there is nothing to be desired. It’s incredible that the original text of the paper is so clear and concise that it’s so badly implemented.
I would like to run against a nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
(docker). I have
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 38C P8 17W / 125W | 0MiB / 4036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
and I could install eventually a BLAS like Intel MKL (that could speed up numpy), etc. Is the current implementation using any gpu low level routine?
Thank you
Hi,
I downloaded and unzipped glove.840B.300d.zip
file and than I run python sif_embedding.py
.
After 30 minutes I receive this error:
Traceback (most recent call last):
File "sif_embedding.py", line 15, in <module>
(words, We) = data_io.getWordmap(wordfile)
File "../src/data_io.py", line 22, in getWordmap
return (words, np.array(We))
MemoryError
Have you ever seen this error?
Nb: I drop the process after Termial shows this error because the process goes in deadlock/loop (I'm sure it doesn't go forward).
p(w) is word frequency of a document or of corpus??
Hi !
As I was using your library to benchmark SIF embeddings against other representations, I stumbled upon few issues I corrected and ended up writing some extra code to simplify the inference of sentence embeddings, as well as adding new functionalities.
In the current state of the library, the principal components are computed each time vectors are requested, which means that the same sentence can have different embeddings. To correct this I modified the code to pass the principal components as parameters.
Here is the interface I'm suggesting:
class SIFCtrl():
"""Infer SIF embeddings."""
def __init__(self, wordfile, weightfile, pc_path=None, weightpara=1e-3, rmpc=1):
self.rmpc = rmpc
self.words, self.We = data_io.getWordmap(wordfile)
self.word2weight = data_io.getWordWeight(weightfile, weightpara)
self.weight4ind = data_io.getWeight(self.words, self.word2weight)
self.params = params.params()
self.params.rmpc = self.rmpc
if pc_path is not None:
with open(pc_path, 'rb') as fs:
self.pc = pickle.load(fs)
else:
self.pc = None
def compute_pc(self, sentences):
x, m = data_io.sentences2idx(sentences, self.words)
w = data_io.seq2weight(x, m, self.weight4ind)
emb = SIF_embedding.get_weighted_average(self.We, x, w)
pc = SIF_embedding.compute_pc(emb, self.params.rmpc)
self.pc = pc
def save_pc(self, filename):
if self.pc is not None:
with open(filename, 'wb') as fs:
pickle.dump(self.pc, fs)
def get_embeddings(self, sentences):
x, m = data_io.sentences2idx(sentences, self.words)
w = data_io.seq2weight(x, m, self.weight4ind)
return SIF_embedding.SIF_embedding(self.We, x, w, self.params, self.pc)
How to use:
WORD_FILE_PATH = "/path/to/embs/glove.840B.300d.txt"
WEIGHT_FILE_PATH = "/path/to/freqs/enwiki_vocab_min200.txt"
sentences = [...] #some sentences used to computed the pcs
SIF_ctrl = SIFCtrl(WORD_FILE_PATH, WEIGHT_FILE_PATH)
SIF_ctrl.compute_pc(sentences)
SIF_ctrl.save_pc('my_pc.pkl') #save the pcs so we can directly load them later
embs1 = SIF_ctrl.get_embeddings(['once upon a time .', 'the northern wind is cold .'])
embs2 = SIF_ctrl.get_embeddings(['once upon a time .', 'and now for something completely different .'])
Now the embeddings of once upon a time .
from the two calls above will be the same. The default behavior (computing the pcs from the input) is obtained simply by not calling compute_pc
.
For SIF embeddings, the principal components are computed to obtain a representation of the syntax, which can be approximated from any general English corpus, I would suggest shipping the library with some precomputed pcs.
If you're interested I can bundle all this in a pr? Let me know, and thanks for the nice work.
Hi, I want to know can I use other pre-trained word vectors rather than glove. It runs fine with glove 50d and 300d but I want to use these embeddings with Pubmed vectors. How to use that, it is showing me error (when I am trying to use PubMed vectors)
Are there any pre-trained SIF available to be used out of the box ?
Getting this error when I try to run sif_embeddings, but I think the issue is with data io.
How are the files supplied meant to be used, besides running the demo? I'd like to use SIF for evaluating similarity of sentences I supply. There is no training needed if I were to just use the Glove embeddings correct? What are the neural nets in src used for then?
Thanks!
File "C:\Users\gdev\git\SIF\examples\sif_embedding.py", line 13, in <module>
(words, We) = data_io.getWordmap(wordfile)
File "../src\data_io.py", line 12, in getWordmap
lines = f.readlines()
File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 962: character maps to <undefined>
算法1中的u是将v s 做列向量得到的矩阵的奇艺向量,而之后您又提到
“In other words, the final sentence embedding is obtained by subtracting the projection of ˜ c s ’s to their first principal component. This is summarized in Algorithm 1.”,
如果只看算法1好像并没有用到cs这个向量,还是说u就是通过cs得到的呢?
train.sh文件是只针对有监督任务的对吧?
I am getting the following error:
File "sif_embedding.py", line 13, in
(words, We) = data_io.getWordmap(wordfile)
File "../src/data_io.py", line 18, in getWordmap
v.append(float(i[j]))
ValueError: could not convert string to float: '.'
Is it a bug or do we always expect numerical data for the line v.append(float(i[j]))?
您代码中的XX = X - X.dot(pc.transpose()) * pc代表u*uT*vs,但是u*uT*vs怎么能代表vs在u上的投影?投影应该是vs*u/|u|或者|vs|*cosine。
I'm just wondering what type of data preprocessing for SIF embedding I need to do for the sentences. For example,
I am following your framework and extend your work by adding "attention". However, when I review your code, I am confuse that whether you calculated the first principle component with training data or not? If you calculate the first principle component with current dataset (test data), it seems that you introduce some information to create the sentence embeddings for test datasets. In that case, your result may not be accepted.
In the example, MSRpar2012 has only 750 lines of sentences.
The theory works fine with small volume of data.
But for big data, for example, the number of sentences is near 400,000.Then the calculation of pca could be a big problem.
Can SIF handle big data?
As this code is not maintained anymore I have re implemented SIF heavily reuses this implementation as I needed it for a project. I focused on generating the embedding using SIF. I thank the authors of SIF for such a wonderful paper.
I am sharing my code here.
I have run the three tasks in the demo
on the whole Glove word embedding:
word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=-1.000000
remove the first 0 principal components
MSRpar2012 0.454328
word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=-1.000000
remove the first 1 principal components
MSRpar2012 0.364071
word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=0.001000
remove the first 0 principal components
MSRpar2012 0.436383
word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=0.001000
remove the first 1 principal components
MSRpar2012 0.356372
word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from idf
remove the first 0 principal components
MSRpar2012 0.531127
word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from idf
remove the first 1 principal components
MSRpar2012 0.416723
not updating word vectors, removing 0 pc, layersize 2400
GloVe vectors
and for the three tasks:
similarity
Epoch 20 Cost 0.0195494797081
total time: 219.104348898
entailment
total time: 116.860952854
sentiment
Epoch 20 Cost 0.188378542662
total time: 931.339022875
How to intepret the MRSpar values from the logs, considering the STSbenchmarks
in the results table here?
Hi everyone:
I can not find Twitter'15 datasets, Can someone send me a download link?
best wish!
Line 199 in 84b5b4c
I am trying the the code below
# set parameters
params = params.params()
params.rmpc = rmpc
# get SIF embedding
embedding = SIF_embedding.SIF_embedding(We, x, w, params)
an get this error
AttributeError: 'params' object has no attribute 'params'
and I verified the module params and its true, please help me how to deal with this?
hi,
I have read your work and I have a question. If I have completed the embedding training on the corpus, I want to get a new sentence embedding outside the corpus. How can i get it? From algorithm 1 in paper, i think we can make size of sentence set S to 1, and i don't know whether is right. Looking forward to your answer. thanks.
Hi there, I am trying to take the sentence embeddings from a bunch of sentences, but the error reported as above, can any one give any hints on it. Thank you.
Does this project only support Python 2 now?
I obtain the vector like this:
[ -13501.50134185, 777983.48873463, 436946.65958192, ...,
-384546.65110645, 296474.44854087, 412937.7170413 ]
the scale of the values spans a large range.
How to normalize the values? I need normalize them for downstream application.
Does the normalization have impact on the performance of SIF?
Is it suitable to embed phrases with this model?
Hi,
Thanks for sharing the code! I have some questions about the handling of OOV tokens:
Thanks in advance for your time :)
Best,
Bhuwan
Is there functionality within the current scripts for saving and restoring models?
Hi,
I download the STS2012 test data, the file only contains two sentences each line, I didn't fine score after the two senetnce, then I can't compute the pearson without given score;
I just wonder what's the meaning of the score, and where can I download such SIS file which is with score?
How did you get the file enwiki_vocab_min200.txt?
Hi, thank you for sharing your code!
When running ./demo.sh I get the following error:
error [glove.840B.300d.zip]: reported length of central directory is
-76 bytes too long
Any steps in the right direction would be appreciated.
Detailed logs are below.
Thanks,
FTK
mac-os-x$ ./demo.sh
--2017-05-20 18:28:48-- http://nlp.stanford.edu/data/glove.840B.300d.zip
Resolving nlp.stanford.edu... 171.64.67.140
Connecting to nlp.stanford.edu|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.840B.300d.zip [following]
--2017-05-20 18:28:48-- https://nlp.stanford.edu/data/glove.840B.300d.zip
Connecting to nlp.stanford.edu|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2176768927 (2.0G) [application/zip]
Saving to: ‘glove.840B.300d.zip’
glove.840B.300d.zip 100%[=================================================>] 2.03G 1.82MB/s in 22m 58s
2017-05-20 18:51:46 (1.51 MB/s) - ‘glove.840B.300d.zip’ saved [2176768927/2176768927]
Archive: glove.840B.300d.zip
warning [glove.840B.300d.zip]: 76 extra bytes at beginning or within zipfile
(attempting to process anyway)
error [glove.840B.300d.zip]: reported length of central directory is
-76 bytes too long (Atari STZip zipfile? J.H.Holm ZIPSPLIT 1.1
zipfile?). Compensating...
skipping: glove.840B.300d.txt need PK compat. v4.5 (can do v2.1)
note: didn't find end-of-central-dir signature at end of central dir.
(please check that you have transferred or created the zipfile in the
appropriate BINARY mode and that you have compiled UnZip properly)
Using cuDNN version 6021 on context None
Mapped name None to device cuda: Tesla K80 (1489:00:00.0)
['train.py', '-wordfile', '../data/glove.840B.300d.txt', '-npc', '1', '-dim', '300', '-traindata', '../data/sentiment-train', '-devdata', '../data/sentiment-dev', '-testdata', '../data/sentiment-test', '-layersize', '300', '-nntype', 'proj_sentiment', '-epochs', '10', '-batchsize', '25', '-LW', '1e-06', '-LC', '1e-06', '-memsize', '300', '-learner', 'adam', '-eta', '0.001', '-task', 'sentiment']
Traceback (most recent call last):
File "train.py", line 237, in
model = proj_model_sentiment(We, params)
File "~iclr2017/SIF/src/proj_model_sentiment.py", line 37, in init
l_out = lasagne.layers.DenseLayer(l_average, params.layersize, nonlinearity=params.nonlinearity)
AttributeError: 'params' object has no attribute 'nonlinearity'
Could you add a license file saying what license it is possible to use this code under? If you have no strong opinion, may I suggest Apache 2.0?
Hi,
Can you please mention the versions also in the dependencies because sometimes there is a conflict of methods in different versions and I face the same in theano. It will be really helpful.
Secondly, it will be good if you also mention the minimum memory requirements for the GPU also to run the basic demo script.
I tried the mini demo for SIF https://github.com/YingyuLiang/SIF_mini_demo
The sentence similarity score is : -1 with one principal component remove.
PrincetonML/SIF_mini_demo#1
Does anyone have idea on that?
Thanks
Line 18 in 343b6eb
Do you have the scripts for preprocessing / converting the STS to the same format as MSRpar2012 ? I have the original STS 2012 files but not in the correct format. I asked here: jwieting/iclr2016#4 but haven't gotten a reply. If not, do you have any details on how the data was tokenized?
tried pip install tree and pip install dm-tree. Still doesn't work
I have installed all dependencies, and
cd examples/
./demo.sh
The script has downloaded GloVe glove.840B.300d.txt
, but it seems to hangs. No files written in the log
folder and no python running on top
. Any hint?
When I run the demo.sh
in examples
directory, this error occurred:
word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=-1.000000
remove the first 0 principal components
Traceback (most recent call last):
File "sim_sif.py", line 28, in <module>
parr, sarr = eval.sim_evaluate_all(We, words, weight4ind, sim_algo.weighted_average_sim_rmpc, params)
File "../src/eval.py", line 64, in sim_evaluate_all
p,s = sim_getCorrelation(We, words, prefix+i, weight4ind, scoring_function, params)
File "../src/eval.py", line 13, in sim_getCorrelation
f = open(f,'r')
IOError: [Errno 2] No such file or directory: '../data/MSRvid2012'
Would you please tell me that:
Thanks a lot !
Hi, thanks for sharing the code.
Would you please point out which part of the code is refer to the "PSL" in your paper ? I only manage to found two weighting functions:
def getWeight(words, word2weight):
...
def getIDFWeight(wordfile, save_file=''):
The involved dataset, at line 52343, presents what it seems to be ". . .", but it's not.
At this line, the code of the example sif_embedding.py
breaks because the split()
at line 15 of auxiliary_data/data_io.py
splits wrongly the word and its embedding.
After a debugging on that line it turned out that the dots of ". . ." are actually dots while the spaces are the code 160 of the extended ASCII table.
Probably this file is not encoded in ASCII but in Unicode, however (for practical reasons) the test has been made with ord()
so the output is an ASCII code, but the problem doesn't change.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.