Giter VIP home page Giter VIP logo

sif's People

Contributors

loretoparisi avatar yingyuliang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sif's Issues

About GPU and BLAS

I would like to run against a nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 (docker). I have

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   38C    P8    17W / 125W |      0MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

and I could install eventually a BLAS like Intel MKL (that could speed up numpy), etc. Is the current implementation using any gpu low level routine?

Thank you

MemoryError in sif_embedding.py

Hi,

I downloaded and unzipped glove.840B.300d.zip file and than I run python sif_embedding.py.
After 30 minutes I receive this error:

Traceback (most recent call last):
  File "sif_embedding.py", line 15, in <module>
    (words, We) = data_io.getWordmap(wordfile)
  File "../src/data_io.py", line 22, in getWordmap
    return (words, np.array(We))
MemoryError

Have you ever seen this error?

Nb: I drop the process after Termial shows this error because the process goes in deadlock/loop (I'm sure it doesn't go forward).

Simpler interface to infer SIF vectors and passing principal components as parameter

Hi !

As I was using your library to benchmark SIF embeddings against other representations, I stumbled upon few issues I corrected and ended up writing some extra code to simplify the inference of sentence embeddings, as well as adding new functionalities.

In the current state of the library, the principal components are computed each time vectors are requested, which means that the same sentence can have different embeddings. To correct this I modified the code to pass the principal components as parameters.

Here is the interface I'm suggesting:

class SIFCtrl():
    """Infer SIF embeddings."""
    
    def __init__(self, wordfile, weightfile, pc_path=None, weightpara=1e-3, rmpc=1):
        self.rmpc = rmpc
        self.words, self.We = data_io.getWordmap(wordfile)
        self.word2weight = data_io.getWordWeight(weightfile, weightpara)
        self.weight4ind = data_io.getWeight(self.words, self.word2weight)
        self.params = params.params()
        self.params.rmpc = self.rmpc
        if pc_path is not None:
            with open(pc_path, 'rb') as fs:
                self.pc = pickle.load(fs)
        else:
            self.pc = None
        
    def compute_pc(self, sentences):
        x, m  = data_io.sentences2idx(sentences, self.words)
        w = data_io.seq2weight(x, m, self.weight4ind)
        emb = SIF_embedding.get_weighted_average(self.We, x, w)
        pc = SIF_embedding.compute_pc(emb, self.params.rmpc)
        self.pc = pc
        
    def save_pc(self, filename):
        if self.pc is not None:
            with open(filename, 'wb') as fs:
                pickle.dump(self.pc, fs)

    def get_embeddings(self, sentences):
        x, m  = data_io.sentences2idx(sentences, self.words)
        w = data_io.seq2weight(x, m, self.weight4ind)
        return SIF_embedding.SIF_embedding(self.We, x, w, self.params, self.pc)

How to use:

WORD_FILE_PATH = "/path/to/embs/glove.840B.300d.txt"
WEIGHT_FILE_PATH = "/path/to/freqs/enwiki_vocab_min200.txt"

sentences = [...] #some sentences used to computed the pcs

SIF_ctrl = SIFCtrl(WORD_FILE_PATH, WEIGHT_FILE_PATH)
SIF_ctrl.compute_pc(sentences) 
SIF_ctrl.save_pc('my_pc.pkl') #save the pcs so we can directly load them later
embs1 = SIF_ctrl.get_embeddings(['once upon a time .', 'the northern wind is cold .'])
embs2 = SIF_ctrl.get_embeddings(['once upon a time .', 'and now for something completely different .'])

Now the embeddings of once upon a time . from the two calls above will be the same. The default behavior (computing the pcs from the input) is obtained simply by not calling compute_pc.

For SIF embeddings, the principal components are computed to obtain a representation of the syntax, which can be approximated from any general English corpus, I would suggest shipping the library with some precomputed pcs.

If you're interested I can bundle all this in a pr? Let me know, and thanks for the nice work.

Other pre-trained vectors rather then glove

Hi, I want to know can I use other pre-trained word vectors rather than glove. It runs fine with glove 50d and 300d but I want to use these embeddings with Pubmed vectors. How to use that, it is showing me error (when I am trying to use PubMed vectors)

pre-trained SIF!

Are there any pre-trained SIF available to be used out of the box ?

Encoding Error

Getting this error when I try to run sif_embeddings, but I think the issue is with data io.
How are the files supplied meant to be used, besides running the demo? I'd like to use SIF for evaluating similarity of sentences I supply. There is no training needed if I were to just use the Glove embeddings correct? What are the neural nets in src used for then?

Thanks!

File "C:\Users\gdev\git\SIF\examples\sif_embedding.py", line 13, in <module>
    (words, We) = data_io.getWordmap(wordfile)
  File "../src\data_io.py", line 12, in getWordmap
    lines = f.readlines()
  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 962: character maps to <undefined>

你好,看您好像是华人,我就用汉语提问了。

算法1中的u是将v s 做列向量得到的矩阵的奇艺向量,而之后您又提到
“In other words, the final sentence embedding is obtained by subtracting the projection of ˜ c s ’s to their first principal component. This is summarized in Algorithm 1.”,
如果只看算法1好像并没有用到cs这个向量,还是说u就是通过cs得到的呢?
train.sh文件是只针对有监督任务的对吧?

Is a check for numeric type needed in getWordmap method?

I am getting the following error:
File "sif_embedding.py", line 13, in
(words, We) = data_io.getWordmap(wordfile)
File "../src/data_io.py", line 18, in getWordmap
v.append(float(i[j]))
ValueError: could not convert string to float: '.'

Is it a bug or do we always expect numerical data for the line v.append(float(i[j]))?

data preprocessing

I'm just wondering what type of data preprocessing for SIF embedding I need to do for the sentences. For example,

  1. do I need to remove punctuations? In the example, sentences don't have punctuations.
  2. should I tokenize negations?
  3. what other preprocessing needs to be done?
    Thanks a lot!!

A potential information reveal problem

I am following your framework and extend your work by adding "attention". However, when I review your code, I am confuse that whether you calculated the first principle component with training data or not? If you calculate the first principle component with current dataset (test data), it seems that you introduce some information to create the sentence embeddings for test datasets. In that case, your result may not be accepted.

number of sentences

In the example, MSRpar2012 has only 750 lines of sentences.
The theory works fine with small volume of data.
But for big data, for example, the number of sentences is near 400,000.Then the calculation of pca could be a big problem.
Can SIF handle big data?

Simple re-implementation of SIF

As this code is not maintained anymore I have re implemented SIF heavily reuses this implementation as I needed it for a project. I focused on generating the embedding using SIF. I thank the authors of SIF for such a wonderful paper.

I am sharing my code here.

How to interpret results on the MSRpar

I have run the three tasks in the demo on the whole Glove word embedding:

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=-1.000000
remove the first 0 principal components
                    MSRpar2012   0.454328

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=-1.000000
remove the first 1 principal components
                    MSRpar2012   0.364071

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=0.001000
remove the first 0 principal components
                    MSRpar2012   0.436383

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=0.001000
remove the first 1 principal components
                    MSRpar2012   0.356372

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from idf
remove the first 0 principal components
                    MSRpar2012   0.531127

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from idf
remove the first 1 principal components
                    MSRpar2012   0.416723

not updating word vectors, removing 0 pc, layersize 2400
GloVe vectors

and for the three tasks:

similarity
Epoch  20 Cost  0.0195494797081
total time: 219.104348898


entailment
total time: 116.860952854

sentiment
Epoch  20 Cost  0.188378542662
total time: 931.339022875

How to intepret the MRSpar values from the logs, considering the STSbenchmarks in the results table here?

Dataset problem

Hi everyone:
I can not find Twitter'15 datasets, Can someone send me a download link?
best wish!

AttributeError: 'params' object has no attribute 'params'

I am trying the the code below

# set parameters
params = params.params()
params.rmpc = rmpc
# get SIF embedding
embedding = SIF_embedding.SIF_embedding(We, x, w, params) 

an get this error

AttributeError: 'params' object has no attribute 'params'

and I verified the module params and its true, please help me how to deal with this?

How to get embedding for a new sentence(Outside the sample)?

hi,
I have read your work and I have a question. If I have completed the embedding training on the corpus, I want to get a new sentence embedding outside the corpus. How can i get it? From algorithm 1 in paper, i think we can make size of sentence set S to 1, and i don't know whether is right. Looking forward to your answer. thanks.

How to normalize SIF vectors?

I obtain the vector like this:
[ -13501.50134185, 777983.48873463, 436946.65958192, ...,
-384546.65110645, 296474.44854087, 412937.7170413 ]
the scale of the values spans a large range.
How to normalize the values? I need normalize them for downstream application.
Does the normalization have impact on the performance of SIF?

OOV tokens

Hi,

Thanks for sharing the code! I have some questions about the handling of OOV tokens:

  1. How is the sentence embedding computed if any of its words does not have a pretrained vector in GloVe?
  2. Say we encounter a word at test time which was not present in the corpus used for estimating word frequencies. How is the weight for this word computed?
  3. Do you remove infrequent words when computing the word frequencies from the corpus? I ask because the vocab file name (enwiki_vocab_min200.txt) seems to suggest so. If yes, how are the removed tokens weighted?

Thanks in advance for your time :)

Best,
Bhuwan

something about STS dataset: score in the file

Hi,

I download the STS2012 test data, the file only contains two sentences each line, I didn't fine score after the two senetnce, then I can't compute the pearson without given score;

I just wonder what's the meaning of the score, and where can I download such SIS file which is with score?

error [glove.840B.300d.zip]: reported length of central directory is -76 bytes too long

Hi, thank you for sharing your code!
When running ./demo.sh I get the following error:

error [glove.840B.300d.zip]:  reported length of central directory is
  -76 bytes too long

Any steps in the right direction would be appreciated.
Detailed logs are below.
Thanks,
FTK


mac-os-x$ ./demo.sh
--2017-05-20 18:28:48--  http://nlp.stanford.edu/data/glove.840B.300d.zip
Resolving nlp.stanford.edu... 171.64.67.140
Connecting to nlp.stanford.edu|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.840B.300d.zip [following]
--2017-05-20 18:28:48--  https://nlp.stanford.edu/data/glove.840B.300d.zip
Connecting to nlp.stanford.edu|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2176768927 (2.0G) [application/zip]
Saving to: ‘glove.840B.300d.zip’

glove.840B.300d.zip           100%[=================================================>]   2.03G  1.82MB/s    in 22m 58s 

2017-05-20 18:51:46 (1.51 MB/s) - ‘glove.840B.300d.zip’ saved [2176768927/2176768927]

Archive:  glove.840B.300d.zip
warning [glove.840B.300d.zip]:  76 extra bytes at beginning or within zipfile
  (attempting to process anyway)
error [glove.840B.300d.zip]:  reported length of central directory is
  -76 bytes too long (Atari STZip zipfile?  J.H.Holm ZIPSPLIT 1.1
  zipfile?).  Compensating...
   skipping: glove.840B.300d.txt     need PK compat. v4.5 (can do v2.1)

note:  didn't find end-of-central-dir signature at end of central dir.
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)

AttributeError: 'params' object has no attribute 'nonlinearity'

Using cuDNN version 6021 on context None
Mapped name None to device cuda: Tesla K80 (1489:00:00.0)
['train.py', '-wordfile', '../data/glove.840B.300d.txt', '-npc', '1', '-dim', '300', '-traindata', '../data/sentiment-train', '-devdata', '../data/sentiment-dev', '-testdata', '../data/sentiment-test', '-layersize', '300', '-nntype', 'proj_sentiment', '-epochs', '10', '-batchsize', '25', '-LW', '1e-06', '-LC', '1e-06', '-memsize', '300', '-learner', 'adam', '-eta', '0.001', '-task', 'sentiment']
Traceback (most recent call last):
File "train.py", line 237, in
model = proj_model_sentiment(We, params)
File "~iclr2017/SIF/src/proj_model_sentiment.py", line 37, in init
l_out = lasagne.layers.DenseLayer(l_average, params.layersize, nonlinearity=params.nonlinearity)
AttributeError: 'params' object has no attribute 'nonlinearity'

Paper and Code disparity. Columns or rows for SVD?

In the algorithm of the paper we can see on line 4 that each v_s should be a column vector of our Matrix X.
alog
In the code we can see on line 22, that each datapoint should be a row vector of our Matrix X.
code

License?

Could you add a license file saying what license it is possible to use this code under? If you have no strong opinion, may I suggest Apache 2.0?

Mention proper version of dependency in requirements.text file

Hi,
Can you please mention the versions also in the dependencies because sometimes there is a conflict of methods in different versions and I face the same in theano. It will be really helpful.

Secondly, it will be good if you also mention the minimum memory requirements for the GPU also to run the basic demo script.

STS preprocessing script

Do you have the scripts for preprocessing / converting the STS to the same format as MSRpar2012 ? I have the original STS 2012 files but not in the correct format. I asked here: jwieting/iclr2016#4 but haven't gotten a reply. If not, do you have any details on how the data was tokenized?

Example demo hanging

I have installed all dependencies, and

cd examples/
./demo.sh

The script has downloaded GloVe glove.840B.300d.txt, but it seems to hangs. No files written in the log folder and no python running on top. Any hint?

No such file Error: ../data/MSRvid2012

When I run the demo.sh in examples directory, this error occurred:

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=-1.000000
remove the first 0 principal components
Traceback (most recent call last):
  File "sim_sif.py", line 28, in <module>
    parr, sarr = eval.sim_evaluate_all(We, words, weight4ind, sim_algo.weighted_average_sim_rmpc, params)
  File "../src/eval.py", line 64, in sim_evaluate_all
    p,s = sim_getCorrelation(We, words, prefix+i, weight4ind, scoring_function, params)
  File "../src/eval.py", line 13, in sim_getCorrelation
    f = open(f,'r')
IOError: [Errno 2] No such file or directory: '../data/MSRvid2012'

Would you please tell me that:

  1. Is this error detrimental to the model training ?
  2. Where to download the missing data file ?

Thanks a lot !

PSL?

Hi, thanks for sharing the code.

Would you please point out which part of the code is refer to the "PSL" in your paper ? I only manage to found two weighting functions:

def getWeight(words, word2weight):
...
def getIDFWeight(wordfile, save_file=''):

Dataset glove.840B.300d.txt character issue

The involved dataset, at line 52343, presents what it seems to be ". . .", but it's not.
At this line, the code of the example sif_embedding.py breaks because the split() at line 15 of auxiliary_data/data_io.py splits wrongly the word and its embedding.
After a debugging on that line it turned out that the dots of ". . ." are actually dots while the spaces are the code 160 of the extended ASCII table.
Probably this file is not encoded in ASCII but in Unicode, however (for practical reasons) the test has been made with ord() so the output is an ASCII code, but the problem doesn't change.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.