hello, I suspect that the mask you used for decoder is not correct. In decoder, th

mask for decoder about attention-is-all-you-need-keras HOT 6 CLOSED

lsdefine commented on August 14, 2024

mask for decoder

from attention-is-all-you-need-keras.

Comments (6)

XiaoLiuAI commented on August 14, 2024 1

If I make a matrix with upper right parts zero and multiply element-wise with attention matrix. For example

mask = tf.matrix_band_part(tf.ones_like(q.shape[1], k.shape[1]), -1, 0)
...
attn = Multiply()([attn, mask])

Would it take equivalent effects?

from attention-is-all-you-need-keras.

XiaoLiuAI commented on August 14, 2024

Hi, thank you for your response. But I still want to make sure that I understand correctly. Let me put the attention block below:

       attn = Lambda(lambda x:K.batch_dot(x[0],x[1],axes=[2,2])/self.temper)([q, k])
        if mask is not None:
            mmask = Lambda(lambda x:(-1e+10)*(1-x))(mask)
            attn = Add()([attn, mmask])
        attn = Activation('softmax')(attn)
        attn = self.dropout(attn)
        output = Lambda(lambda x:K.batch_dot(x[0], x[1]))([attn, v])

Is the Activation layer assure that the mask is "column-based"? It applies softmax on the last dimension, which is the column of attention matrix?

from attention-is-all-you-need-keras.

lsdefine commented on August 14, 2024

Sorry, my previous answer is wrong and I have found the right answers.
The experiment shows that using 1-mask+eye. The training accu & dev accu quickly go to near 100% but the model cannot process any user inputs. This situation means the model is using the future information.
The problem is: the axis 1 is not the column because there is a "Batch" axis.

>>> K.eval(GetSubMask(q))   # mask = K.cumsum(tf.eye(len_s, batch_shape=bs), 1)
array([[[1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1.]]], dtype=float32)
>>> np.cumsum(np.eye(5), 1)   # Your question
array([[1., 1., 1., 1., 1.],
       [0., 1., 1., 1., 1.],
       [0., 0., 1., 1., 1.],
       [0., 0., 0., 1., 1.],
       [0., 0., 0., 0., 1.]])
>>> np.cumsum(np.eye(5), 0)   # If no "Batch" axis, the cum axis is 0
array([[1., 0., 0., 0., 0.],
       [1., 1., 0., 0., 0.],
       [1., 1., 1., 0., 0.],
       [1., 1., 1., 1., 0.],
       [1., 1., 1., 1., 1.]])

from attention-is-all-you-need-keras.

lsdefine commented on August 14, 2024

We surely need a lower left triangular mask, as our expectation.

from attention-is-all-you-need-keras.

XiaoLiuAI commented on August 14, 2024

Thanks, that is clear.

from attention-is-all-you-need-keras.

lsdefine commented on August 14, 2024

Hi, The training corpus is too large. The corpus in the repo is only used for showing the format. You may generate a large corpus for a good result. [image: image.png] Xiao.Liu <[email protected]> 于2018年8月21日周二下午3:51写道：

…

Hello, I tried the pinyin example with all the configuration untouched. But the final test result is very bad. The result for 'ji zhi hu die zai yang guang xia fei wu 。' is '斯对管道的资不仅能加中考。' — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADOKm9Z9_qvfQykEg_zquBbGfd-g0S0Jks5uS7v8gaJpZM4WClEA> .

from attention-is-all-you-need-keras.

mask for decoder about attention-is-all-you-need-keras HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent