Hello, Thank you for the excellent code for CNN,LSTM and HAN,I have learnt a lot from

there is no one-layer MLP in attention Layer about textclassifier HOT 4 OPEN

richliao commented on May 14, 2024

there is no one-layer MLP in attention Layer

from textclassifier.

Comments (4)

ni9elf commented on May 14, 2024 2

Isn't line 174 eij = K.tanh(K.dot(x, self.W)) in textClassifierHATT.py implementing the one-layer MLP to get u_it as hidden representation of h_it before computing the attention weights?
If yes, then why is another one-layer MLP being used in TimeDistributed in the form of Dense()? Please elaborate on the use of Dense() - there is no equivalent logic of the Dense layer being described in the paper. Is this your addition or from the paper originally?
Thanks.

from textclassifier.

richliao commented on May 14, 2024

I used Keras TimeDistributed(Dense()) layer to implement this verticle neurons interaction of LSTM output before feeding to attention layer. However, I really doubt if it's useful.

from textclassifier.

philippschaefer4 commented on May 14, 2024

I have problems at the same point.

I can see that the "one layer MLP" is split in your code into the dense layer and line 174 in the attention-layer. The MLP in the article you mention in your blog consists of 1 input-layer (that are the 200 nodes for the 2*100 values of each wordvector-output of the Double-GRUs with 100 nodes each of the layer before) and one hidden layer (those are the 200 nodes of the dense layer) and one final layer. The final layer has ONE node and I think that is the one value per word that is computed at the beginning of the attention-layer in the code, isn't it? The attention-layer of course gets all MAX_WORDS=100 words as 200-length-vectors each at the same time as 100x200-matrix. So self.W in line 174 would a 200-length-vector of weights that is used on all words equally. So there are independant weights of the MLP from input to hidden (by use of TimeDistributed) and a common weight-vector from hidden to output.
By line 174 all 100 200-length-word-vectors are shriveled down to 100 scalars. The activation-function is tanh. The final values are the e's in the articles. The MLP is the function a(h) with e = a(h) and h as the "word-vector in context" from the bidirectional-GRUs.
The following part of the attention-layer just does exp (176) and softmax (177) which gives 100 percent-values as "attentions" per word (alpha in the articles). Each wordvector is then weighed with the words attentions (179) and the weighted vectors are summed up as the "sentence vector" (180) with length 200.

Now the questions are:
First: Am I right? Is this what happens here? Is there only ONE attention-layer with a shared weight? Or are the 100 TimeDistributed-DenseLayers connected to 100 implicitly TimeDistributed-Attention-Layers?
Second: If I am right, I can't see that the weights from hidden to output are shared among all words in the article(s) you refer to. Is it just different from that or didn't I understand the papers?
Third: It also seemes that you don't use the outputs of the Bidirectional GRUs as word-context-vectors to be weighted (as it is also shown in the picture from the article that is also on your blog) but the hidden layers of the MLPs. Shouldn't get the attention-layer somehow have access to l_lstm so it could use the original words-in-context to be weighed?

from textclassifier.

philippschaefer4 commented on May 14, 2024

I just looked up the TimeDistributed Layer-Wrapper again and realized that it means that the same weights are also shared among the input-hidden-layer-connection of the MLP. And I think I can also remember reading something about "shared weights" in the articles. So questions 1 and 2 are answered for me.

It is just the 3rd one that is left: why use the hidden-layers of the MLP as word-context and not the output of l_lstm?

from textclassifier.

there is no one-layer MLP in attention Layer about textclassifier HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent