Giter VIP home page Giter VIP logo

Comments (4)

ni9elf avatar ni9elf commented on May 14, 2024 2

Isn't line 174 eij = K.tanh(K.dot(x, self.W)) in textClassifierHATT.py implementing the one-layer MLP to get u_it as hidden representation of h_it before computing the attention weights?
If yes, then why is another one-layer MLP being used in TimeDistributed in the form of Dense()? Please elaborate on the use of Dense() - there is no equivalent logic of the Dense layer being described in the paper. Is this your addition or from the paper originally?
Thanks.

from textclassifier.

richliao avatar richliao commented on May 14, 2024

I used Keras TimeDistributed(Dense()) layer to implement this verticle neurons interaction of LSTM output before feeding to attention layer. However, I really doubt if it's useful.

from textclassifier.

philippschaefer4 avatar philippschaefer4 commented on May 14, 2024

I have problems at the same point.

I can see that the "one layer MLP" is split in your code into the dense layer and line 174 in the attention-layer. The MLP in the article you mention in your blog consists of 1 input-layer (that are the 200 nodes for the 2*100 values of each wordvector-output of the Double-GRUs with 100 nodes each of the layer before) and one hidden layer (those are the 200 nodes of the dense layer) and one final layer. The final layer has ONE node and I think that is the one value per word that is computed at the beginning of the attention-layer in the code, isn't it? The attention-layer of course gets all MAX_WORDS=100 words as 200-length-vectors each at the same time as 100x200-matrix. So self.W in line 174 would a 200-length-vector of weights that is used on all words equally. So there are independant weights of the MLP from input to hidden (by use of TimeDistributed) and a common weight-vector from hidden to output.
By line 174 all 100 200-length-word-vectors are shriveled down to 100 scalars. The activation-function is tanh. The final values are the e's in the articles. The MLP is the function a(h) with e = a(h) and h as the "word-vector in context" from the bidirectional-GRUs.
The following part of the attention-layer just does exp (176) and softmax (177) which gives 100 percent-values as "attentions" per word (alpha in the articles). Each wordvector is then weighed with the words attentions (179) and the weighted vectors are summed up as the "sentence vector" (180) with length 200.

Now the questions are:
First: Am I right? Is this what happens here? Is there only ONE attention-layer with a shared weight? Or are the 100 TimeDistributed-DenseLayers connected to 100 implicitly TimeDistributed-Attention-Layers?
Second: If I am right, I can't see that the weights from hidden to output are shared among all words in the article(s) you refer to. Is it just different from that or didn't I understand the papers?
Third: It also seemes that you don't use the outputs of the Bidirectional GRUs as word-context-vectors to be weighted (as it is also shown in the picture from the article that is also on your blog) but the hidden layers of the MLPs. Shouldn't get the attention-layer somehow have access to l_lstm so it could use the original words-in-context to be weighed?

from textclassifier.

philippschaefer4 avatar philippschaefer4 commented on May 14, 2024

I just looked up the TimeDistributed Layer-Wrapper again and realized that it means that the same weights are also shared among the input-hidden-layer-connection of the MLP. And I think I can also remember reading something about "shared weights" in the articles. So questions 1 and 2 are answered for me.

It is just the 3rd one that is left: why use the hidden-layers of the MLP as word-context and not the output of l_lstm?

from textclassifier.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.