Giter VIP home page Giter VIP logo

eeap-examples's People

Contributors

sujitpal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

eeap-examples's Issues

Some suggestions on attention and document similarity

Hey,

First, thanks for the kind words in various places :). I came across your posts, which led me here.

I also spent quite some time working on similarity models. I think they're surprisingly difficult to implement correctly in most deep learning toolkits. There are two problems:

  1. It's pretty hard to maintain the symmetry. We'd like to guarantee that the Siamese network always maps a pair of identical sentences into identical vectors. Intuitively identical inputs should give 1.0 similarity, right? But lots of thing can go wrong to prevent this. Here are some of the problems I've had in different implementations:

i. Dropout needs to be synchronised across the two 'halves' of the network. If we redraw the weights for the two sentences, we'll end up with different vectors for the same input. This makes the model converge very slowly.

ii. Batch normalization. I don't remember exact results, but I do remember I ended up not wanting to use batch norm in Siamese networks, because I found it too difficult to reason about.

iii. In one of my models, I assigned random vectors to OOV words, without ensuring the same word always mapped to the same vector.

  1. Most libraries make you pad your inputs with zeros. Most attention layers then do some sort of pooling operation. If you're averaging, you need to normalize by only the input tokens, and exclude the padding tokens. We also want to make sure we're not sending gradient through the padding tokens too. I could never get this correct in Keras. You can find my effort to replicate Parikh et al.'s decomposable attention model here: https://github.com/explosion/spaCy/tree/master/examples/keras_parikh_entailment . Other people have worked on the code since, but as far as I know it's still not correct.

Finally, an extra tip that should help your similarity models :). I notice you're using pre-trained embeddings, and are using a fixed-size vocabulary. This means that all words outside your vocabulary will be mapped to the same representation. If you think about it, this is pretty bad: if our input sentences match on some rare word, that's a great feature! I think the best solution is to augment the static vectors with a learned component. Here's an example network that does this: https://github.com/explosion/thinc/blob/master/examples/text-pair/glove_mwe_multipool_siamese.py#L162

The network in that example uses a trickier "Embed" step, that is the sum of the static vectors, and then learned vectors from my HashEmbed class, which uses the "hashing trick" that has been popular in sparse linear models. The insight is similar to Bloom filters, so a recent paper has called this "Bloom embeddings". Basically you just mod the key into a fixed size table, and compute multiple conditionally independent keys per word. This way allows the table to map a very large number of vocabulary items to (mostly) distinct representations, with relatively few rows.

This hash embedding trick isn't the only solution to the OOV problem. Using a character LSTM to create the OOV word features would probably work well too --- but much more slowly.

ng-vocab.tsv file not found

Hi,

I am running this notebook: https://github.com/sujitpal/eeap-examples/blob/master/src/04c-ng-clf-eeap.ipynb

When I run "Load Vocabulary" cell, it gives this error:

<ipython-input-4-e0b7b98e6728> in <module>()
      1 word2id = {"PAD": 0, "UNK": 1}
----> 2 fvocab = open(VOCAB_FILE, "rb")
      3 for i, line in enumerate(fvocab):
      4     word, count = line.strip().split("\t")
      5     if int(count) <= MIN_OCCURS:

FileNotFoundError: [Errno 2] No such file or directory: `'../data/ng-vocab.tsv'

Missing data file creating problem with result verification

ng-vocab.tsv file is missing. I checked this issue . Added this code to generate the file and verify the result. It gave me accuracy score of 55% (approx). Then instead of using this I used this dataset. Result improved to 59%. I think the dataset is playing an important role here. So I would request to post the original file, if not then atleast the process of getting or generating it, so that we can run it and get the result that you mentioned. (71% accuracy). I ran this code.

ng-sim-datagen.py throwing error.

Python3 : TypeError: a bytes-like object is required, not 'str'
Python2 : IOError: [Errno socket error] [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:590)

Can someone put the dataset on drive if they are able to run this script?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.