Comments (53)
Being confused about why attention can learn info about specific index in input sequence, I went on and read the code in official tensorflow implementation. I was wrong about the attention_score_vec
dense layer which is a.k.a "memory layer" in TF implementation. The weight matrix W
is not a (time_steps, time_steps) sized but rather (hidden_size, hidden_size) as shown here. The correct implementation should be:
def attention_3d_block(hidden_states):
# hidden_states.shape = (batch_size, time_steps, hidden_size)
hidden_size = int(hidden_states.shape[2])
# Inside dense layer
# hidden_states dot W => score_first_part
# (batch_size, time_steps, hidden_size) dot (hidden_size, hidden_size) => (batch_size, time_steps, hidden_size)
# W is the trainable weight matrix of attention
# Luong's multiplicative style score
score_first_part = Dense(hidden_size, use_bias=False, name='attention_score_vec')(hidden_states)
# score_first_part dot last_hidden_state => attention_weights
# (batch_size, time_steps, hidden_size) dot (batch_size, hidden_size) => (batch_size, time_steps)
h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(
hidden_states)
score = dot([score_first_part, h_t], [2, 1], name='attention_score')
attention_weights = Activation('softmax', name='attention_weight')(score)
# (batch_size, time_steps, hidden_size) dot (batch_size, time_steps) => (batch_size, hidden_size)
context_vector = dot([hidden_states, attention_weights], [1, 1], name='context_vector')
pre_activation = concatenate([context_vector, h_t], name='attention_output')
attention_vector = Dense(128, use_bias=False, activation='tanh',
name='attention_vector')(
pre_activation)
return attention_vector
score_first_part
stands for , as part of
.
Surprisingly, even without any hard information on the index of sequence, attention model still managed to learn the importance of 10th element. Now I am super confused.
My guess is somehow LSTM learned to "count" to 10 in its hidden state. And that "count" is captured by attention. I will need to visualize the inner parameters of LSTM to be sure.
An interesting finding I made is how attention is learnt through time:
Full code (except attention_3d_block
), showing here just for reference:
from keras.layers import concatenate, dot
from keras.layers.core import *
from keras.layers.recurrent import LSTM
from keras.models import *
from attention_utils import get_activations, get_data_recurrent
INPUT_DIM = 100
TIME_STEPS = 20
# if True, the attention vector is shared across the input_dimensions where the attention is applied.
SINGLE_ATTENTION_VECTOR = True
APPLY_ATTENTION_BEFORE_LSTM = False
def attention_3d_block(hidden_states):
# same as above
def model_attention_applied_after_lstm():
inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
lstm_units = 32
lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
attention_mul = attention_3d_block(lstm_out)
# attention_mul = Flatten()(attention_mul)
output = Dense(INPUT_DIM, activation='sigmoid', name='output')(attention_mul)
model = Model(input=[inputs], output=output)
return model
if __name__ == '__main__':
N = 300000
# N = 300 -> too few = no training
inputs_1, outputs = get_data_recurrent(N, TIME_STEPS, INPUT_DIM)
if APPLY_ATTENTION_BEFORE_LSTM:
m = model_attention_applied_before_lstm()
else:
m = model_attention_applied_after_lstm()
m.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
print(m.summary())
m.fit([inputs_1], outputs, epochs=1, batch_size=64, validation_split=0)
attention_vectors = []
for i in range(10):
testing_inputs_1, testing_outputs = get_data_recurrent(1, TIME_STEPS, INPUT_DIM)
activations = get_activations(m, testing_inputs_1, print_shape_only=True, layer_name='attention_weight')
attention_vec = np.mean(activations[0], axis=0).squeeze()
print('attention =', attention_vec)
assert (np.sum(attention_vec) - 1.0) < 1e-5
attention_vectors.append(attention_vec)
attention_vector_final = np.mean(np.array(attention_vectors), axis=0)
# plot part.
import matplotlib.pyplot as plt
import pandas as pd
pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',
title='Attention Mechanism as '
'a function of input'
' dimensions.')
plt.show()
from keras-attention.
@Wangzihaooooo Because attention was first introduced in a Sequence to Sequence model, where attention score is computed based on both h_t
and all h_s
. In a language/classification model (sequence to one), we don't have the h_t
to represent the information of the current outputting Y. Therefore I just used the last hidden state as h_t
.
To be fair, you can totally remove h_t
from the score computation, which then just becomes score = W * h_s
. And it is essentially self-attention. It is different from traditional attention that self-attention only gives a score based on how important a hidden_state is globally, without the information of the current state of LSTM.
from keras-attention.
you are passing the raw input without making them pass through the LSTM first
The input does pass through LSTM first. Layer is an abstract concept of how the tensor should be calculated, not the actual tensor to be calculated. The relationship is more like "class" and "instance" if you are familiar with OOP.
the outputs
is the actual tensor (instance) of attention_weight
layer, which has already been connected to previous tensors (computational graph) by attention_weights = Activation('softmax', name='attention_weight')(score)
. It is not this specific tensor that takes the testing_inputs_1
, it is the computaional graph, which initially begins from inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
.
from keras-attention.
@farahshamout Here is a rather complete explanation on attention over sequence to sequence model. The original idea of attention uses the output of the decoder as h_t
, representing "current decoding state". If you think of the "many-to-one" problem as a special case of the "many-to-many" problem, h_t
becomes the last hidden state of the encoder.
from keras-attention.
Ok i'm figuring something out. Last question now i tried something like this:
att_weights = []
for i in range(10):
activations = get_activations(mymodel,np.reshape(x,(1,100,30)),print_shape_only=True,layer_name='attention_weight')
attention_vec = np.mean(activations[0], axis=0).squeeze()
print('attention =', attention_vec)
assert (np.sum(attention_vec) - 1.0) < 1e-5
att_weights.append(attention_vec)
attention_vector_final = np.mean(np.array(att_weights),axis=0)
where x is my input and actually i have my attention vector but it is filled with ones , maybe i'm still doing something wrong, why is there 10 in the for ?
EDIT: sorry i have understimated the relevance of return_sequences=True
in the LSTM now i'm able to plot the attention map @felixhao28 thank you !!!!
from keras-attention.
@OmniaZayed My implementation is similar to AttentionMV in Sujit Pal's code except that ctx
is the last hidden state.
from keras-attention.
@patebel the shape of h_t should be (batch_size, hidden_size, 1), you are missing the final "1" dimension. Keras used to reshape the output of lambda layer to your output shape, maybe adding h_t = Reshape((hidden_size, 1))(h_t)
will fix it.
from keras-attention.
Hi @patebel, try to use the squeeze layer after the score vector:
from keras import backend as K
from keras.layers import Lambda
score = dot([score_first_part, h_t], [2, 1], name='attention_score')
score = Lambda(lambda x: K.squeeze(x, -1))(score)
It can be that the dimension of your score before applying the softmax function is (None, time_steps,1)
when should be (None,time_steps)
.
from keras-attention.
@Labaien96 Thanks for the super fast reply, you were right!
If anyone else is experiencing the same issue: after squeezing the score and feeding it to "attention_weights" you need to Reshape "attention_weights" like the following to be able to compute the context_vector:
attention_weights = Reshape((attention_weights.shape[1], 1))(attention_weights)
from keras-attention.
@fmehralian Try to use gradient clipping. You can use clipnorm and clipvalue. I experienced exploding gradients and was able to solve it using the clipping.
from keras-attention.
- Yes, 128 is a just hyper-parameter you can fine tune later.
attention_mul
has the size ofBATCH_SIZE * ATTENTION_SIZE
, so theoutput
has the size ofBATCH_SIZE * INPUT_DIM
. It means for each batch, theoutput
generates a probability for each category of allINPUT_DIM
categories. The sigmoid applies to the last dimension, so the sum of probability for each batch is 1. This is not a part of attention mechanism but rather a typical category-classification output network.
from keras-attention.
@junhuang-ifast yes
from keras-attention.
I updated the repo with all the comments of this thread. Thank you all!
from keras-attention.
Hi, thanks for all of uers' comments. I have learned a lot from that. But can I ask a question. If we use an RNN(or some variants of it), we can get the hidden states of each time_step which can then be used to compute the score. But if I did not use Lstm to be as an encoder, alternately, I use a 1D CNN as an encoder, what should I do when I want to apply attention. For example, I would like to handle some textual messages, so I first used an embedding layer and then used a 1DConv layer. Is there some methods I can use to apply the attention mechanism to my model. Thanks so much.
from keras-attention.
I implemented my own version of attention + LSTM. Since we don't have h_t
in a regular RNN, I just used the last hidden state as h_t
, which works just fine.
INPUT_DIM = 100
TIME_STEPS = 20
# if True, the attention vector is shared across the input_dimensions where the attention is applied.
SINGLE_ATTENTION_VECTOR = True
APPLY_ATTENTION_BEFORE_LSTM = False
ATTENTION_SIZE = 128
def attention_3d_block(hidden_states):
# hidden_states.shape = (batch_size, time_steps, hidden_size)
hidden_size = int(hidden_states.shape[2])
# _t stands for transpose
hidden_states_t = Permute((2, 1), name='attention_input_t')(hidden_states)
# hidden_states_t.shape = (batch_size, hidden_size, time_steps)
# this line is not useful. It's just to know which dimension is what.
hidden_states_t = Reshape((hidden_size, TIME_STEPS), name='attention_input_reshape')(hidden_states_t)
# Inside dense layer
# a (batch_size, hidden_size, time_steps) dot W (time_steps, time_steps) => (batch_size, hidden_size, time_steps)
# W is the trainable weight matrix of attention
# Luong's multiplicative style score
score_first_part = Dense(TIME_STEPS, use_bias=False, name='attention_score_vec')(hidden_states_t)
score_first_part_t = Permute((2, 1), name='attention_score_vec_t')(score_first_part)
# score_first_part_t dot last_hidden_state => attention_weights
# (batch_size, time_steps, hidden_size) dot (batch_size, hidden_size, 1) => (batch_size, time_steps, 1)
h_t = Lambda(lambda x: x[:, :, -1], output_shape=(hidden_size, 1), name='last_hidden_state')(hidden_states_t)
score = dot([score_first_part_t, h_t], [2, 1], name='attention_score')
attention_weights = Activation('softmax', name='attention_weight')(score)
# if SINGLE_ATTENTION_VECTOR:
# a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
# a = RepeatVector(hidden_size)(a)
# (batch_size, hidden_size, time_steps) dot (batch_size, time_steps, 1) => (batch_size, hidden_size, 1)
context_vector = dot([hidden_states_t, attention_weights], [2, 1], name='context_vector')
context_vector = Reshape((hidden_size,))(context_vector)
h_t = Reshape((hidden_size,))(h_t)
pre_activation = concatenate([context_vector, h_t], name='attention_output')
attention_vector = Dense(ATTENTION_SIZE, use_bias=False, activation='tanh', name='attention_vector')(pre_activation)
return attention_vector
The interface remained same except you don't need Flatten layer anymore:
def model_attention_applied_after_lstm():
inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
lstm_units = 32
lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
attention_mul = attention_3d_block(lstm_out)
# attention_mul = Flatten()(attention_mul)
output = Dense(INPUT_DIM, activation='sigmoid')(attention_mul)
model = Model(input=[inputs], output=output)
return model
The results seems even better than your original implementation:
The process of building attention myself has brought me more questions than answers:
- What is SINGLE_ATTENTION_VECTOR? And how could you use K.mean as dimension reduction while all parameters in
a
are defined in a Dense layer? Doesn't that just mean all weight parameters have the same gradient for each batch and behaves just like one parameter vector and wastes GPU memory for storing the full matrix? - I understand your intuition behind APPLY_ATTENTION_BEFORE_LSTM, but that is not what attention is for and you can pretty much achieve the same results by sending fixed-length input into a fully-connected layer and use the output from that layer as the input of a LSTM layer. "The data at index 10 being important" is not a good feature to learn through attention. Exact index of timestamp should be transparent to attention mechanism.
P.S. I have modified get_data_recurrent function a little bit to produce one-hot data as it is more similar to my actual needs.
def get_data_recurrent(n, time_steps, input_dim, attention_column=10):
"""
Data generation. x is purely random except that it's first value equals the target y.
In practice, the network should learn that the target = x[attention_column].
Therefore, most of its attention should be focused on the value addressed by attention_column.
:param n: the number of samples to retrieve.
:param time_steps: the number of time steps of your series.
:param input_dim: the number of dimensions of each element in the series.
:param attention_column: the column linked to the target. Everything else is purely random.
:return: x: model inputs, y: model targets
"""
x = np.random.randint(input_dim, size=(n, time_steps))
x = np.eye(input_dim)[x]
y = x[:, attention_column, :]
return x, y
from keras-attention.
I am actually thinking you were trying to implement self attention, which is used in text classification. But nonetheless the weight parameters should be sized (hidden_size, hidden_size) instead of (time_steps, time_steps).
from keras-attention.
@felixhao28 why do you use the layer that was named “last_hidden_state”?
from keras-attention.
@felixhao28 thank you ,I learned a lot from your code.
from keras-attention.
@felixhao28 thank you very much. This is very well explained and removes the complexity around Attention Layer. I implemented the code inline for Seq2Seq model and able to grab attention matrix directly. Thanks once again for your help.
Regards
Rajeev
from keras-attention.
@felixhao28 I'm a bit confused in this part of the code:
attention_vectors = []
for i in range(10):
testing_inputs_1, testing_outputs = get_data_recurrent(1, TIME_STEPS, INPUT_DIM)
activations = get_activations(m, testing_inputs_1, print_shape_only=True, layer_name='attention_weight')
get_activations()
effectively passes testin_inputs_1 through the layer 'attention_weight' and outputs the softmax probabilities for each. However, you are passing the raw input without making them pass through the LSTM first; is that on purpose? If so, can you explain why? Since in the model the inputs to the attention layers are the output of the LSTM layer(s), I would expect to have to do the same here.
Thanks!
from keras-attention.
@felixhao28 I see, thanks for the explanation!
from keras-attention.
I implemented my own version of attention + LSTM. Since we don't have
h_t
in a regular RNN, I just used the last hidden state ash_t
, which works just fine.INPUT_DIM = 100 TIME_STEPS = 20 # if True, the attention vector is shared across the input_dimensions where the attention is applied. SINGLE_ATTENTION_VECTOR = True APPLY_ATTENTION_BEFORE_LSTM = False ATTENTION_SIZE = 128 def attention_3d_block(hidden_states): # hidden_states.shape = (batch_size, time_steps, hidden_size) hidden_size = int(hidden_states.shape[2]) # _t stands for transpose hidden_states_t = Permute((2, 1), name='attention_input_t')(hidden_states) # hidden_states_t.shape = (batch_size, hidden_size, time_steps) # this line is not useful. It's just to know which dimension is what. hidden_states_t = Reshape((hidden_size, TIME_STEPS), name='attention_input_reshape')(hidden_states_t) # Inside dense layer # a (batch_size, hidden_size, time_steps) dot W (time_steps, time_steps) => (batch_size, hidden_size, time_steps) # W is the trainable weight matrix of attention # Luong's multiplicative style score score_first_part = Dense(TIME_STEPS, use_bias=False, name='attention_score_vec')(hidden_states_t) score_first_part_t = Permute((2, 1), name='attention_score_vec_t')(score_first_part) # score_first_part_t dot last_hidden_state => attention_weights # (batch_size, time_steps, hidden_size) dot (batch_size, hidden_size, 1) => (batch_size, time_steps, 1) h_t = Lambda(lambda x: x[:, :, -1], output_shape=(hidden_size, 1), name='last_hidden_state')(hidden_states_t) score = dot([score_first_part_t, h_t], [2, 1], name='attention_score') attention_weights = Activation('softmax', name='attention_weight')(score) # if SINGLE_ATTENTION_VECTOR: # a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a) # a = RepeatVector(hidden_size)(a) # (batch_size, hidden_size, time_steps) dot (batch_size, time_steps, 1) => (batch_size, hidden_size, 1) context_vector = dot([hidden_states_t, attention_weights], [2, 1], name='context_vector') context_vector = Reshape((hidden_size,))(context_vector) h_t = Reshape((hidden_size,))(h_t) pre_activation = concatenate([context_vector, h_t], name='attention_output') attention_vector = Dense(ATTENTION_SIZE, use_bias=False, activation='tanh', name='attention_vector')(pre_activation) return attention_vectorThe interface remained same except you don't need Flatten layer anymore:
def model_attention_applied_after_lstm(): inputs = Input(shape=(TIME_STEPS, INPUT_DIM,)) lstm_units = 32 lstm_out = LSTM(lstm_units, return_sequences=True)(inputs) attention_mul = attention_3d_block(lstm_out) # attention_mul = Flatten()(attention_mul) output = Dense(INPUT_DIM, activation='sigmoid')(attention_mul) model = Model(input=[inputs], output=output) return modelThe results seems even better than your original implementation:
The process of building attention myself has brought me more questions than answers:
- What is SINGLE_ATTENTION_VECTOR? And how could you use K.mean as dimension reduction while all parameters in
a
are defined in a Dense layer? Doesn't that just mean all weight parameters have the same gradient for each batch and behaves just like one parameter vector and wastes GPU memory for storing the full matrix?- I understand your intuition behind APPLY_ATTENTION_BEFORE_LSTM, but that is not what attention is for and you can pretty much achieve the same results by sending fixed-length input into a fully-connected layer and use the output from that layer as the input of a LSTM layer. "The data at index 10 being important" is not a good feature to learn through attention. Exact index of timestamp should be transparent to attention mechanism.
P.S. I have modified get_data_recurrent function a little bit to produce one-hot data as it is more similar to my actual needs.
def get_data_recurrent(n, time_steps, input_dim, attention_column=10): """ Data generation. x is purely random except that it's first value equals the target y. In practice, the network should learn that the target = x[attention_column]. Therefore, most of its attention should be focused on the value addressed by attention_column. :param n: the number of samples to retrieve. :param time_steps: the number of time steps of your series. :param input_dim: the number of dimensions of each element in the series. :param attention_column: the column linked to the target. Everything else is purely random. :return: x: model inputs, y: model targets """ x = np.random.randint(input_dim, size=(n, time_steps)) x = np.eye(input_dim)[x] y = x[:, attention_column, :] return x, y
Hi, can you clarify what you mean by "Since we don't have h_t
in a regular RNN, I just used the last hidden state as h_t
, which works just fine."
from keras-attention.
@felixhao28 I see, thanks!
from keras-attention.
Hi, i was trying to use your implementation, but i would like to save an attention heat map during the training (once for epoch), i tried to add return attention_vector,attention_weights
but it is not what i wanted.
Do you have any suggestion?
from keras-attention.
@Bertorob I assume you added attention_weights
to the outputs of the model. Sadly there is a limitation in Keras that every output needs to be paired with a "ground-truth y" and calculated by a loss function. So if you intend to collect attention_weights
for every batch, you need to provide an empty but same-sized numpy array as the second "ground-truth y" in model.fit
, and a custom loss function for attention_weights
that always return 0.
If you only need attention heat map once per epoch instead of once per batch, model.train_on_batch
is what you need to replace model.fit
from keras-attention.
@felixhao28 Thank you for the answer. However if i want to plot the attention after training, I suppose i don't need to add the second ''ground-truth y'' but i don't get how you are able to do it. Could you please explain how can you do that?
from keras-attention.
This part of the code calculates the attention heat map:
attention_vectors = []
for i in range(10):
... # lines ommited
pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',
title='Attention Mechanism as '
'a function of input'
' dimensions.')
plt.show()
The attention_weights are not directly fetched during training. It isn't run until later after model.fit
.
m.fit([inputs_1], outputs, epochs=1, batch_size=64, validation_split=0)
You see this line above just run one epoch. If you create a loop around it and change plt.show
to plt.savefig
then you get a series of images of the attention weights. Ultimately the code looks like this:
for epoch_i in range(n_epochs):
m.fit([inputs_1], outputs, epochs=1, batch_size=64, validation_split=0)
attention_vectors = []
for i in range(10):
... # lines ommited
pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',
title='Attention Mechanism as '
'a function of input'
' dimensions.')
plt.savefig(f'attention-weights-{epoch_i}.png')
Edit: here I am still using model.fit
instead of model.train_on_batch
because the data here is really small and constant within each epochs. In reality though, you might want to use model.train_on_batch
for better flexibility.
from keras-attention.
@felixhao28 Both this repo and my version of attention are intended for sequence-to-one networks (although it can be easily tweaked for seq2seq by replacing h_t with current state of the decoder step).
Could you please show the detail of implementing seq2seq networks? I would so appropriate that. Is that just setting the return_sequences=True
?
from keras-attention.
@LZQthePlane No it is more complicated than that. The basic idea is to replace the h_t with current state of the decoder step. You might want to find another ready-to-use seq2seq attention code.
from keras-attention.
Hi @felixhao28, Thank you so much for your code and explanations above.
I am new to learning attention and I want to use it after LSTM for a classification problem. I understood the concepts of attention from this presentation [1] by Sujit Pal :
[1] https://www.slideshare.net/PyData/sujit-pal-applying-the-fourstep-embed-encode-attend-predict-framework-to-predict-document-similarity
I got confused after reading your code about the type of attention (the theory behind it and how is it called in papers). does it compute an attention vector on an incoming matrix using a learned context vector?
hope you could help!
from keras-attention.
@felixhao28
Thank you so much for your code and explanation. I think it is quite right except a slight problem. In my opinion, score_first_part shouldn't relate with h_t, which means the inputs of attention_score_vec layer shouldn't include h_t. How do you think?
from keras-attention.
@Goofy321 How do you calculate the attention score then?
from keras-attention.
@Goofy321 How do you calculate the attention score then?
I mean the input of attention_score_vec layer change into hidden_states[:,:-1,:]. And the calculation of the attention score is the same as yours.
from keras-attention.
@Goofy321 I think that works too.
from keras-attention.
@felixhao28 : When i try to run your code I get following error when calculating the score:
score = dot([score_first_part, h_t], axes=[2, 1], name='attention_score')
ValueError: Shape must be rank 2 but is rank 3 for 'attention_score/MatMul' (op: 'MatMul') with input shapes: [?,20,32], [?,32]
Currently I can't figure out why the dimensions don't match, any idea? Did anyone else experience the same issues?
from keras-attention.
@felixhao28 Oh yes, I didn't recognize, thank you!
from keras-attention.
Hi @felixhao28, thanks for your insights and helpfulness in this issue! Reading the original paper by Bahdanau et. al. and comparing the operations to this repository, I was really confused until I saw this.
I have a question for you and other people on this thread. I have a language model that gets fed in a sequence of length 50 in batch sizes of 32, and tries to predict the next token where the vocabulary size is 35. Hence, it is an application of many-to-one for text generation. Below is the version that generates logical output.
However, when I apply the attention layer as you have suggested before the final dense layer for prediction with attension size of 256, I get extremely gibberish output, certain letters being repeated back to back in a nonsensical way. Below is that version.
Any ideas why this approach fails? I have also tried without stacking LSTM layers, and it still fails. The only thing I can think of is that the token-level for this language model is characters, whereas I have seen attention applied mostly to word-level language models. Any help will be appreciated!
UPDATE: Solved it, turns out I didn't set one of the Dense layers to be trainable.
from keras-attention.
@felixhao28 and the others: When I'm running the example the activation weights are either all 1 (which i don't understand cause it's not possible by definition) or nan (which I understand neither :D). Did anyone else experience this behavior?
from keras-attention.
@felixhao28 Thanks for you well-documented code and clarifications on the theory and implementation of Attention mechanism.
I'm using this code for a similar problem and obviously, the model requires training more than 1 epoch. However, when the training pass ~25 epochs, the loss changes to NAN. Since there is no problem in the data, I think it might be about the model architecture and following the online recommended solutions for the similar issue, I couldn't solve it. Did anyone else experience this behavior?
from keras-attention.
Hi I have two questions,
- from
attention_3d_block()
attention_vector = Dense(128, use_bias=False, activation='tanh',name='attention_vector')(pre_activation)
for this line, the output unit is 128, is this based on something or just arbitrary/(based on intuition)
- from
model_attention_applied_after_lstm()
output = Dense(INPUT_DIM, activation='sigmoid', name='output')(attention_mul)
for this line, must the output units be same as INPUT_DIM
? doesn't this defeat the purpose of activation='sigmoid'
?
if @felixhao28 or anyone else could help it'll be much appreciated
from keras-attention.
@felixhao28 thanks for the quick responds. I have one other question regarding
I implemented my own version of attention + LSTM. Since we don't have h_t in a regular RNN, I just used the last hidden state as h_t, which works just fine.
which many have already asked you.
if we were to take only the last hidden state, would it be in a way saying that we are focusing on one specific part (last part in this case) of the lstm output to do the many-to-one problem. What if however, the intuition was that the whole input sequence were important in predicting the one output, would it be more suitable to use the mean along the time axis instead?
so something like
h_t = Lambda(lambda x: tf.reduce_mean(x, axis=1), output_shape=(unit,), name='mean_hidden_state')
PS: using the mean is just an example, it could be any other function depending on the problem
from keras-attention.
@felixhao28 thanks a ton for your useful comments! I haven't had time to work on this repo since then. I was pretty new to deep learning when I wrote it. I'm going to invest some time to integrate some of your suggestions and fix the things that need to be fixed :)
from keras-attention.
@junhuang-ifast In my application I was using attention in a sequence prediction model, which just focuses on the very next token in the sequence. Taking only the last hidden state worked fine due to the locality nature of sequences.
I am not an expert on applications other than sequence prediction. But if I have to guess, you can omit h_t
all together (for example h_t = I
, identity matrix). This will produce a self-attention vector.
Averaging all hidden states feels strange because by using attention, you are assuming not all elements in the sequence are equal. It is attentions' job to figure out which ones are more important and by how much. Using the mean of all states erases that difference. Unless there is a global information which differs by each sequence, hiding in each element and you want sum it up, I don't feel averaging is the way to go. I might be wrong though.
from keras-attention.
@philipperemy No problem. We are all learning it as we discuss it.
from keras-attention.
@felixhao28 just to be clear, when u say
h_t = I, identity matrix
would be the equivalent to not calculating h_t or the first dot product ie
h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(hidden_states)
score = dot([score_first_part, h_t], [2, 1], name='attention_score')
and just letting score = score_first_part
?
from keras-attention.
@felixhao28 Do you have the link to the paper of this attention that was described in the TensorFlow tutorial?
from keras-attention.
@philipperemy the original link is gone but I think they are:
https://arxiv.org/abs/1409.0473
and
https://arxiv.org/abs/1508.04025
from keras-attention.
Actually, There are three different versions of attention. felixhao28' version is called global attention and philipperemy ' version is called self-attention. The rest one is called local attention, a little different with global attention.
from keras-attention.
Actually, There are three different versions of attention. felixhao28' version is called global attention and philipperemy ' version is called self-attention. The rest one is called local attention, a little different with global attention.
Do you know a good implementation for local attention?
from keras-attention.
Do you know how I can apply the attention module to a 2D shaped input , I would like to apply to apply attention after the LSTM layer-
Layer (type) Output Shape Param # Connected to
features (InputLayer) (None, 16, 1816) 0
__________________________________________________________________________________________________
lstm_1 (LSTM) (None, 2048) 31662080 features[0][0]
__________________________________________________________________________________________________
dense_2 (Dense) (None, 1024) 2098176 lstm_1[0][0]
__________________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 1024) 0 dense_2[0][0]
__________________________________________________________________________________________________
dense_3 (Dense) (None, 120) 123000 leaky_re_lu_2[0][0]
__________________________________________________________________________________________________
feature_weights (InputLayer) (None, 120) 0
__________________________________________________________________________________________________
multiply_1 (Multiply) (None, 120) 0 dense_3[0][0]
feature_weights[0][0]
Total params: 33,883,256
Trainable params: 33,883,256
Non-trainable params: 0
__________________________________________________________________________________________________
Would really appreciate your suggestion on how to modify attention_3D block to make it work for a 2D input as well. thanks.
from keras-attention.
@raghavgurbaxani I answered you in your thread.
from keras-attention.
Hi @philipperemy and @felixhao28 . I am trying to apply attention model on top of an LSTM, where my input training data is a nd array. How should I fit my model in this case? I get the following error because of my data being a nd array
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).
What changes should I make? Would appreciate your help! Thank you
from keras-attention.
@AnanyaO did you have a look at the examples here: https://github.com/philipperemy/keras-attention-mechanism/tree/master/examples?
from keras-attention.
Related Issues (20)
- Hiddent state parameter, what really should be passed? HOT 1
- pip install and numpy, keras packages are forced to be uninstalled HOT 1
- Use this repository for CNN HOT 1
- 2D attention HOT 6
- weird attention weights when adding sequence of numbers. HOT 1
- attention when using more than one feature HOT 1
- get_config HOT 14
- Using attention with multivariate timeseries data
- Loading model problems HOT 5
- Interpreting attention weights for more than one input features. HOT 2
- Add guidance to README to use Functional API for saving models that use this layer HOT 4
- Attention Mechanism not working HOT 10
- what do the h_t mean in the Attention model? HOT 1
- Output with multiple time steps HOT 1
- Attention not working for MLP HOT 2
- TypeError: Expected `trainable` argument to be a boolean, but got: 64 HOT 3
- Please update version HOT 1
- TypeError: __call__() takes 2 positional arguments but 3 were given HOT 2
- Number of parameters in Attention layer HOT 2
- Does it support causal mask? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from keras-attention.