lsdefine / attention-is-all-you-need-keras Goto Github PK
View Code? Open in Web Editor NEWA Keras+TensorFlow Implementation of the Transformer: Attention Is All You Need
A Keras+TensorFlow Implementation of the Transformer: Attention Is All You Need
Do you think this model is suitable for timeseries forecasting?
I tried translating different english sentences,but got same translation.
Tried decode_sequence and decode_sequence_fast too.
trained for 2 epochs,it that a problem?
In
def get_loss(args):
y_pred, y_true = args
y_true = tf.cast(y_true, 'int32')
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
mask = tf.cast(tf.not_equal(y_true, 0), 'float32')
loss = tf.reduce_sum(loss * mask, -1) / tf.reduce_sum(mask, -1)
loss = K.mean(loss)
return loss
loss = tf.reduce_sum(loss * mask, -1) / tf.reduce_sum(mask, -1)
produce single element, it's mean doesn't make difference.
self.target_layer = TimeDistributed(Dense(o_tokens.num(), use_bias=False))
change to:
self.target_layer = TimeDistributed(Dense(o_tokens.num(), activation='softmax', use_bias=False))
Hi,
I was using only the LayerNormalization from your code in mine. I didn't change anything from the code, apart from overriding the compute_mask
function, as my input is an Embedding with mask_zero=True
Code
class LayerNormalization(Layer):
def __init__(self, eps=1e-6, **kwargs):
self.eps = eps
super(LayerNormalization, self).__init__(**kwargs)
def build(self, input_shape):
self.gamma = self.add_weight(name='gamma', shape=input_shape[-1:],
initializer=Ones(), trainable=True)
self.beta = self.add_weight(name='beta', shape=input_shape[-1:],
initializer=Zeros(), trainable=True)
super(LayerNormalization, self).build(input_shape)
def call(self, x):
mean = K.mean(x, axis=-1, keepdims=True)
std = K.std(x, axis=-1, keepdims=True)
return self.gamma * (x - mean) / (std + self.eps) + self.beta
def compute_output_shape(self, input_shape):
return input_shape
def compute_mask(self, inputs, input_mask=None):
return input_mask
but strangely I get all nan
for all the measurements I do while training and tuning (loss function and others). I tried using other implementations of the LayerNormalization layer (e.g. https://github.com/CyberZHG/keras-layer-normalization), and everything works without problem. I was wondering whether you have any clue about that.
hello, I suspect that the mask you used for decoder is not correct.
In decoder, the mask you used is a matrix of which elements in the right upper triangle are one.
mask = K.cumsum(tf.eye(len_s, batch_shape=bs), 1)
In [4]: np.cumsum(np.eye(5), 1) Out[4]: array([[1., 1., 1., 1., 1.], [0., 1., 1., 1., 1.], [0., 0., 1., 1., 1.], [0., 0., 0., 1., 1.], [0., 0., 0., 0., 1.]])
That means, when you compute self attention, the first word will take the entire output sequence into account by WmaskV. That is not correct during training. And this problem could also impact the prediction.
I'm running into a lot of errors attempting to run the Transformer.py file for testing purposes.
The issue begins with:
(100000, 7) (100000, 9)
X: [[ 2 11 12 ... 7 4 3]
[ 2 10 11 ... 12 12 3]
[ 2 5 5 ... 13 6 3]
...
[ 2 13 11 ... 12 5 3]
[ 2 7 12 ... 6 11 3]
[ 2 6 4 ... 7 13 3]]
Y: [[ 2 4 20 ... 19 14 3]
[ 2 4 20 ... 11 8 3]
[ 2 4 20 ... 8 3 0]
...
[ 2 4 20 ... 19 9 3]
[ 2 4 20 ... 19 3 0]
[ 2 4 20 ... 15 3 0]]
2018-07-10 18:46:13.676502: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Traceback (most recent call last):
File "transformer.py", line 586, in <module>
s2s.compile('adam')
File "transformer.py", line 396, in compile
enc_output = self.encoder(src_seq, src_pos, active_layers=active_layers)
File "transformer.py", line 306, in __call__
mask = Lambda(lambda x:GetPadMask(emb, emb))(src_seq)
File "/Users/user/anaconda2/envs/tfdeeplearning/lib/python3.6/site-packages/keras/engine/base_layer.py", line 460, in __call__
output = self.call(inputs, **kwargs)
File "/Users/user/anaconda2/envs/tfdeeplearning/lib/python3.6/site-packages/keras/layers/core.py", line 693, in call
return self.function(inputs, **arguments)
File "transformer.py", line 306, in <lambda>
mask = Lambda(lambda x:GetPadMask(emb, emb))(src_seq)
File "transformer.py", line 255, in GetPadMask
ones = K.expand_dims(K.ones_like(Q, 'float32'), -1)
AttributeError: 'Tensor' object has no attribute 'expand_dims'
What version of Keras and Tensorflow are you using to develop?
Could you add that info to a requirements.txt file or possibly to the readme info.
I am wondering if this is an issue between conflicting versions.
I am using:
tensorflow 1.8.0 Keras 2.2.0
I've tried wrapping the operations in Lambda Layers which works for the first two lines in
GetPadMask Function but I am running into issues again with the K.batch_dot Operation.
An Ideas? I am relatively new to the Keras framework.
Hi, I'm a beginner, and i found
attn = Lambda(lambda x:K.batch_dot(x[0],x[1],axes=[2,2])/self.temper)([q, k])
is equal to
attn = Lambda(lambda x:tf.matmul(x[0],x[1],transpose_b=True)/self.temper)([q, k])
Sorry to disturb you.
Hello,
after embedding layer, the new token embedding includes learned token embedding and static postional embedding. Of course, the have the postional embedding value. So, before the embedding are entered into the encoder or decoder, if the embedding sequences are needed to multiply padding mask to delete the influence of embeddings?
Tks,
Look forward to your reply.
want to play around with the transformer, but I'm confused with shapes.
print(train[0]) [ 2 4 1 283 51 283 986 6 284 8 226 227 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
train.shape is (1000, 57)
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None, 57) 0
__________________________________________________________________________________________________
embedding_2 (Embedding) (None, 57, 300) 865200 input_1[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 57, 300) 90000 embedding_2[0][0]
__________________________________________________________________________________________________
dense_2 (Dense) (None, 57, 300) 90000 embedding_2[0][0]
__________________________________________________________________________________________________
lambda_3 (Lambda) (None, 57) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_4 (Lambda) (None, None, None) 0 dense_1[0][0]
__________________________________________________________________________________________________
lambda_5 (Lambda) (None, None, None) 0 dense_2[0][0]
__________________________________________________________________________________________________
lambda_7 (Lambda) (None, 57) 0 lambda_3[0][0]
__________________________________________________________________________________________________
input_2 (InputLayer) (None, 57) 0
__________________________________________________________________________________________________
lambda_8 (Lambda) (None, None, None) 0 lambda_4[0][0]
lambda_5[0][0]
__________________________________________________________________________________________________
lambda_9 (Lambda) (None, 57) 0 lambda_7[0][0]
__________________________________________________________________________________________________
lambda_1 (Lambda) (None, 56) 0 input_2[0][0]
__________________________________________________________________________________________________
add_1 (Add) (None, None, None) 0 lambda_8[0][0]
lambda_9[0][0]
__________________________________________________________________________________________________
embedding_3 (Embedding) (None, 56, 300) 865200 lambda_1[0][0]
__________________________________________________________________________________________________
lambda_12 (Lambda) (None, 56, 56) 0 lambda_1[0][0]
__________________________________________________________________________________________________
lambda_13 (Lambda) (None, None, None) 0 lambda_1[0][0]
__________________________________________________________________________________________________
activation_1 (Activation) (None, None, None) 0 add_1[0][0]
__________________________________________________________________________________________________
dense_3 (Dense) (None, 57, 300) 90000 embedding_2[0][0]
__________________________________________________________________________________________________
dense_5 (Dense) (None, 56, 300) 90000 embedding_3[0][0]
__________________________________________________________________________________________________
dense_6 (Dense) (None, 56, 300) 90000 embedding_3[0][0]
__________________________________________________________________________________________________
lambda_14 (Lambda) (None, 56, 56) 0 lambda_12[0][0]
lambda_13[0][0]
__________________________________________________________________________________________________
dropout_1 (Dropout) (None, None, None) 0 activation_1[0][0]
__________________________________________________________________________________________________
lambda_6 (Lambda) (None, None, None) 0 dense_3[0][0]
__________________________________________________________________________________________________
lambda_16 (Lambda) (None, None, None) 0 dense_5[0][0]
__________________________________________________________________________________________________
lambda_17 (Lambda) (None, None, None) 0 dense_6[0][0]
__________________________________________________________________________________________________
lambda_19 (Lambda) (None, 56, 56) 0 lambda_14[0][0]
__________________________________________________________________________________________________
lambda_10 (Lambda) (None, None, None) 0 dropout_1[0][0]
lambda_6[0][0]
__________________________________________________________________________________________________
lambda_20 (Lambda) (None, None, None) 0 lambda_16[0][0]
lambda_17[0][0]
__________________________________________________________________________________________________
lambda_21 (Lambda) (None, 56, 56) 0 lambda_19[0][0]
__________________________________________________________________________________________________
lambda_11 (Lambda) (None, None, 300) 0 lambda_10[0][0]
__________________________________________________________________________________________________
add_4 (Add) (None, None, None) 0 lambda_20[0][0]
lambda_21[0][0]
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, None, 300) 90300 lambda_11[0][0]
__________________________________________________________________________________________________
activation_2 (Activation) (None, None, None) 0 add_4[0][0]
__________________________________________________________________________________________________
dense_7 (Dense) (None, 56, 300) 90000 embedding_3[0][0]
__________________________________________________________________________________________________
dropout_6 (Dropout) (None, None, 300) 0 time_distributed_1[0][0]
__________________________________________________________________________________________________
dropout_3 (Dropout) (None, None, None) 0 activation_2[0][0]
__________________________________________________________________________________________________
lambda_18 (Lambda) (None, None, None) 0 dense_7[0][0]
__________________________________________________________________________________________________
add_2 (Add) (None, None, 300) 0 embedding_2[0][0]
dropout_6[0][0]
__________________________________________________________________________________________________
lambda_22 (Lambda) (None, None, None) 0 dropout_3[0][0]
lambda_18[0][0]
__________________________________________________________________________________________________
layer_normalization_2 (LayerNor (None, None, 300) 600 add_2[0][0]
__________________________________________________________________________________________________
lambda_23 (Lambda) (None, None, 300) 0 lambda_22[0][0]
__________________________________________________________________________________________________
conv1d_1 (Conv1D) (None, None, 512) 154112 layer_normalization_2[0][0]
__________________________________________________________________________________________________
time_distributed_2 (TimeDistrib (None, None, 300) 90300 lambda_23[0][0]
__________________________________________________________________________________________________
conv1d_2 (Conv1D) (None, None, 300) 153900 conv1d_1[0][0]
__________________________________________________________________________________________________
dropout_7 (Dropout) (None, None, 300) 0 time_distributed_2[0][0]
__________________________________________________________________________________________________
dropout_2 (Dropout) (None, None, 300) 0 conv1d_2[0][0]
__________________________________________________________________________________________________
add_5 (Add) (None, None, 300) 0 embedding_3[0][0]
dropout_7[0][0]
__________________________________________________________________________________________________
add_3 (Add) (None, None, 300) 0 dropout_2[0][0]
layer_normalization_2[0][0]
__________________________________________________________________________________________________
layer_normalization_4 (LayerNor (None, None, 300) 600 add_5[0][0]
__________________________________________________________________________________________________
layer_normalization_1 (LayerNor (None, None, 300) 600 add_3[0][0]
__________________________________________________________________________________________________
dense_9 (Dense) (None, None, 300) 90000 layer_normalization_4[0][0]
__________________________________________________________________________________________________
dense_10 (Dense) (None, None, 300) 90000 layer_normalization_1[0][0]
__________________________________________________________________________________________________
lambda_15 (Lambda) (None, 56, 57) 0 lambda_1[0][0]
input_1[0][0]
__________________________________________________________________________________________________
lambda_24 (Lambda) (None, None, None) 0 dense_9[0][0]
__________________________________________________________________________________________________
lambda_25 (Lambda) (None, None, None) 0 dense_10[0][0]
__________________________________________________________________________________________________
lambda_27 (Lambda) (None, 56, 57) 0 lambda_15[0][0]
__________________________________________________________________________________________________
lambda_28 (Lambda) (None, None, None) 0 lambda_24[0][0]
lambda_25[0][0]
__________________________________________________________________________________________________
lambda_29 (Lambda) (None, 56, 57) 0 lambda_27[0][0]
__________________________________________________________________________________________________
add_6 (Add) (None, None, None) 0 lambda_28[0][0]
lambda_29[0][0]
__________________________________________________________________________________________________
activation_3 (Activation) (None, None, None) 0 add_6[0][0]
__________________________________________________________________________________________________
dense_11 (Dense) (None, None, 300) 90000 layer_normalization_1[0][0]
__________________________________________________________________________________________________
dropout_4 (Dropout) (None, None, None) 0 activation_3[0][0]
__________________________________________________________________________________________________
lambda_26 (Lambda) (None, None, None) 0 dense_11[0][0]
__________________________________________________________________________________________________
lambda_30 (Lambda) (None, None, None) 0 dropout_4[0][0]
lambda_26[0][0]
__________________________________________________________________________________________________
lambda_31 (Lambda) (None, None, 300) 0 lambda_30[0][0]
__________________________________________________________________________________________________
time_distributed_3 (TimeDistrib (None, None, 300) 90300 lambda_31[0][0]
__________________________________________________________________________________________________
dropout_8 (Dropout) (None, None, 300) 0 time_distributed_3[0][0]
__________________________________________________________________________________________________
add_7 (Add) (None, None, 300) 0 layer_normalization_4[0][0]
dropout_8[0][0]
__________________________________________________________________________________________________
layer_normalization_5 (LayerNor (None, None, 300) 600 add_7[0][0]
__________________________________________________________________________________________________
conv1d_3 (Conv1D) (None, None, 512) 154112 layer_normalization_5[0][0]
__________________________________________________________________________________________________
conv1d_4 (Conv1D) (None, None, 300) 153900 conv1d_3[0][0]
__________________________________________________________________________________________________
dropout_5 (Dropout) (None, None, 300) 0 conv1d_4[0][0]
__________________________________________________________________________________________________
add_8 (Add) (None, None, 300) 0 dropout_5[0][0]
layer_normalization_5[0][0]
__________________________________________________________________________________________________
layer_normalization_3 (LayerNor (None, None, 300) 600 add_8[0][0]
__________________________________________________________________________________________________
time_distributed_4 (TimeDistrib (None, None, 57) 17100 layer_normalization_3[0][0]
==================================================================================================
Total params: 3,447,424
Trainable params: 3,447,424
Non-trainable params: 0
__________________________________________________________________________________________________```
I wanna input the train data and output the exact same sentence as input.
how do I do it?
It takes to change
super().__init__(**kwargs)
on
super(LayerNormalization, self).__init__(**kwargs)
and
super().build(input_shape)
on
super(LayerNormalization, self).build(input_shape)
in class
class LayerNormalization(Layer):
Python 2.7
def GetSubMask(s):
len_s = tf.shape(s)[1]
bs = tf.shape(s)[:1]
mask = K.cumsum(tf.eye(len_s, batch_shape=bs), 1)
return mask
if the input is (5,4,3)
wouldn't tf.eye here creates a lower triangle tensor of 5, 4, 4 instead of 5,4,3 because of [:1]
after run "python pinyin.py train" and "python pinyin.py test",i get a result like this:
" 天中,方后人将化的面要给物业,以确保学生安全。"
why have this wrong answer?
First of all, thanks for this awesome repository.
This is not really an issue, but rather a doubt.
Can any one tell me the difference between decode_sequence_readout and decode_sequence_fast?
def reshape1(x):
s = tf.shape(x) # [batch_size, len_q, n_head * d_k]
x = tf.reshape(x, [s[0], s[1], n_head, d_k])
x = tf.transpose(x, [2, 0, 1, 3])
x = tf.reshape(x, [-1, s[1], d_k]) # [n_head * batch_size, len_q, d_k]
This function is used for value vector also which will have dimension batch_size, len_q, n_head * d_v.
It will pop up an error if d_k and d_v are not same.
The above code is used in transformer.py: MultiHeadAttention
Thank you for your excellent work. When will the test demo open
So I want to change below Keras bidirectional LSTM layer into Transformer encoder:
lstmLayer = keras.layers.Bidirectional( keras.layers.CuDNNLSTM(args.rnnSize, return_sequences = True, recurrent_initializer = 'glorot_uniform' ) )(inputLayer)
so can this be accomplished using your library? The rest of the code remains same, I just want to replace bidirectional LSTM layers with Transformer.
I would really appreciate your help. Thanks.
The paper specified that it wasn't the Z (output) that gets passed. it is actually the K and V got passed to the decoder. In the code, it simply intakes the output.
What is the license for your shared code?
When I run the code I got this error on line 100 of transformer.py:
ValueError: Axis 0 of input tensor should have a defined dimension, but is None. Full tensor shape: (None, None, None). Typically you need to pass a fully-defined input_shape
argument to your first layer.
could you specify the versions of keras and tensorflow that you used for your test?
Hi
I had managed to train the network using the your given dataset but don't have the idea to how to use the trained model to perform translation prediction.. Pls advise thanks
Hi!
It is strange to have n_head == 1, but it does not work in MultiHeadAttention class (mode=1)
To fix it, it is enough to change
head = Concatenate()(heads)
attn = Concatenate()(attns)
on
if n_head == 1:
head = heads[0]
attn = attns[0]
else:
head = Concatenate()(heads)
attn = Concatenate()(attns)
because
A `Concatenate` layer should be called on a list of at least 2 inputs
@lsdefine Thanks for your sharing, I use transformer to do seq2seq task. Like, input a article and predict the abstract. When I finish training, I get almost same output with different input. Code are same as your example, data should be right, because with same data, and use lstm block as seq2seq, I got the proper output.
Hope for your answer, Thanks.
Hi, thanks a lot for your code. It seems that I find a bug.
In the MultiHeadAttention
layer, the reshape1
function
x = tf.reshape(x, [s[0], s[1], n_head, s[2]//n_head])
x = tf.transpose(x, [2, 0, 1, 3])
x = tf.reshape(x, [-1, s[1], s[2]//n_head])
The transpose puts the head axis before the batch axis. After reshaping, the first axis should be like this (suppose N samples and only 2 heads):
sample_0_head_0
sample_1_head_0
sample_2_head_0
...
sample_N-1_head_0
sample_0_head_1
sample_1_head_1
sample_2_head_1
...
sample_N-1_head_1
But the repeats of mask
:
mask = Lambda(lambda x:K.repeat_elements(x, n_head, 0))(mask)
will return mask
like this:
mask_0,
mask_0,
mask_1,
mask_1,
...
mask_N,
mask_N,
(find the useage of repeat_elements here)
However, actually we want mask
to be like this:
mask_0,
mask_1,
...
mask_N-1,
mask_0,
mask_1,
...
mask_N-1
So I think the reshape function reshape1
should change x = tf.transpose(x, [2, 0, 1, 3])
into x = tf.transpose(x, [0, 2, 1, 3])
. And so does the reshape2
.
should a layer norm be at the end of encoder layer like below? if I search orginal paper there is norm layer after pos-ffn.
class EncoderLayer():
def init(self, d_model, d_inner_hid, n_head, dropout=0.1):
self.self_att_layer = MultiHeadAttention(n_head, d_model, dropout=dropout)
self.pos_ffn_layer = PositionwiseFeedForward(d_model, d_inner_hid, dropout=dropout)
self.norm_layer = LayerNormalization()
def call(self, enc_input, mask=None):
output, slf_attn = self.self_att_layer(enc_input, enc_input, enc_input, mask=mask)
output1 = self.norm_layer(Add()([enc_input, output]))
output = self.pos_ffn_layer(output1)
output = self.norm_layer(Add()([output1 , output]))
return output, slf_attn
Have you ever tried save/pickle your trained model? It does not seem to work on my side, reporting an error when I use model.to_json.
@lsdefine please can you tell me how can I use the transformer instead of an LSTM layer in a simple encoder ? as in this small example.
model = Sequential()
model.add(Embedding(top_words, 100, input_length=max_words, trainable=True))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))
Hello. I try to evaluate your script and got the following error message:
(base) C:\Users\cp\Python\attention-is-all-you-need-keras>python en2de_main.py
Using TensorFlow backend.
loading data/en2de_word.txt
loading data/en2de.h5
loading data/en2de.valid.h5
seq 1 words: 3369
seq 2 words: 3665
train shapes: (29000, 43) (29000, 47)
valid shapes: (1014, 34) (1014, 39)
2020-03-11 13:08:06.384900: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default
inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Traceback (most recent call last):
File "en2de_main.py", line 33, in <module>
s2s.compile(Adam(0.001, 0.9, 0.98, epsilon=1e-9))
File "C:\Users\cp\Python\attention-is-all-you-need-keras\transformer.py", line 452, in compile
loss = get_loss(final_output, tgt_true)
File "C:\Users\cp\Python\attention-is-all-you-need-keras\transformer.py", line 440, in get_loss
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 3537, in sparse_softmax_cros
s_entropy_with_logits_v2
labels=labels, logits=logits, name=name)
File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 3470, in sparse_softmax_cros
s_entropy_with_logits
array_ops.shape(logits)[:-1]))
File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\ops\check_ops.py", line 658, in assert_equal
data, summarize, message, name)
File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\ops\check_ops.py", line 333, in _binary_assert
if condition:
File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py", line 757, in __bool__
self._disallow_bool_casting()
File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py", line 526, in _disallow_bool_ca
sting
self._disallow_in_graph_mode("using a `tf.Tensor` as a Python `bool`")
File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py", line 515, in _disallow_in_grap
h_mode
" this function with @tf.function.".format(task))
tensorflow.python.framework.errors_impl.OperatorNotAllowedInGraphError: using a `tf.Tensor` as a Python `bool` is not al
lowed in Graph execution. Use Eager execution or decorate this function with @tf.function.
I use keras 2+ and tf 2+ as well but not using any gpu.
Hello,
Thanks for a great project, which helps me build model on top of that.
I was wondering one thing: it seems like you do not implement skip connection (residual network) in Transformer?
Is it because you implemented it and you didn't observe improvement?
Or is it just because you didn't implement it?
I asked because when I use more layers, I got worser performance actually. I am not sure whether it is what it is (i.e. having more layers does not help), or it is because I don't have skip connections, which usually helps build a deeper model.
Best,
UserWarning: Output "lambda_83" missing from loss dictionary. We assume this was done on purpose, and we will not be expecting any data to be passed to "lambda_83" during training.
self.model.compile(optimizer, None)
in transformer.py, line 87,
mask = Lambda(lambda x:K.repeat_elements(x, n_head, 0))(mask)
this line makes the mask shape (in readout_model) like (batch_sizen_head,x,x), but the shape of the result of reshape1 like (n_headbatch_size,x,x), it seems the same shape, but the elements not.
Maybe the repeat_elements could change to tile?
Hello, I check the source code and found the implementation of mask is define as following:
class ScaledDotProductAttention():
def __init__(self, d_model, attn_dropout=0.1):
self.temper = np.sqrt(d_model)
self.dropout = Dropout(attn_dropout)
def __call__(self, q, k, v, mask):
attn = Lambda(lambda x:K.batch_dot(x[0],x[1],axes=[2,2])/self.temper)([q, k])
if mask is not None:
mmask = Lambda(lambda x:(-1e+10)*(1-x))(mask)
attn = Add()([attn, mmask])
attn = Activation('softmax')(attn)
attn = self.dropout(attn)
output = Lambda(lambda x:K.batch_dot(x[0], x[1]))([attn, v])
return output, attn
as far as i am concerned, the "Add()([attn, mmask])" operation will broadcast "mmask" to the shape of "attn", which will mask some rows of "attn". But this may cause the following softmax operation to be meaningless as the softmax layer takes effect on each row. To be clearer,
'''
## we neglect the batch dimension
attn = [
a_11, a_12, a_13
a_21, a_22, a_23
a_31, a_32, a_33
](q=3, k=3)
mmask = [
0.0,
0.0,
-inf,
](q=3)
## after broadcasting:
attn += mmask
== [
a_11, a_12, a_13
a_21, a_22, a_23
-inf, -inf, -inf
](q=3, k=3)
attn = softmax(attn)
== [
softmax(a_11, a_12, a_13)
softmax(a_21, a_22, a_23)
1/3, 1/3, 1/3 <-------- is not what we want
](q=3, k=3)
'''
Am I missing something? or should the mask operation take effect after softmax layer?
Hello, inspired by openai/finetune-transformer-lm, I am now trying to make a language model based on your code. I got a question during implementation.
self.model = Model([src_seq_input, tgt_seq_input], loss)
self.model.add_loss([loss])
self.model.compile(optimizer, None)
Why don't you add the loss function through compile
api? I am not quite sure about the effect of api add_loss
.
By the way, I made a language model encoder based on your Encoder, but I added GetSubMask
as you did in Decoder. Then I would like to add a crf layer after the encoder (for sequence labelling, while openAi's model is for text classification). Finally, train the model based on the language model loss + crf loss. Do you have any implementation suggestion? Especially any idea for verifying the correctness of the code...
I saw you example data about pinyin and Chinese, are you Chinese?
Keras implements a BatchNormalization layer. Isn't the LayerNormalization class the same thing?
Ref: https://keras.io/layers/normalization/
(Or is the code for a version of Keras where BN was not implemented?)
I am trying to implement and test the approach for video encoding. I would like to have as input to the system sets of image frames from videos and just encode them using only the encoding part. Therefore, I am trying to comment out the decoder part and I am trying to figure out what modifications should I perform to make it work. I am a bit puzzled with the line 30 and 34 in pinyin_main.py:
gen = dd.S2SDataGenerator('data/pinyin.corpus.txt', itokens, otokens, batch_size=32, max_len=120) s2s.model.fit_generator(gen, steps_per_epoch=2000, epochs=5, callbacks=[lr_scheduler, model_saver])
Could I replace the gen object with a tensor easily? What exactly gen stands for?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.