Motivation and deion let's say we have an array of shape (em

The layer's documentation for the forward pass says: <div class="snippet-clipboard

The layer's documentation for the forward pass says: <div class="snip

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="use

can't use masks in multi-head-attention layer about flux.jl HOT 6 OPEN

alerem18 commented on June 24, 2024

can't use masks in multi-head-attention layer

from flux.jl.

Comments (6)

CarloLucibello commented on June 24, 2024 1

The layer's documentation for the forward pass says:

     (mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask])
...
mask: Input array broadcastable to size (kv_len, q_len, nheads, batch_size). 
      The mask is applied to the attention scores just before the softmax. 
      See NNlib.make_causal_mask for creating causal masks. Default nothing.

so I think you should reshape as. reshape(mask, (seq_len, 1, 1, batch_size)) or reshape(mask, (1, seq_len, 1, batch_size)). I'm not sure which one of the two is correct.

from flux.jl.

alerem18 commented on June 24, 2024

The layer's documentation for the forward pass says:
     (mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask])
...
mask: Input array broadcastable to size (kv_len, q_len, nheads, batch_size). 
      The mask is applied to the attention scores just before the softmax. 
      See NNlib.make_causal_mask for creating causal masks. Default nothing.
so I think you should reshape as. reshape(mask, (seq_len, 1, 1, batch_size)) or reshape(mask, (1, seq_len, 1, batch_size)). I'm not sure which one of the two is correct.

thanks now it's working

from flux.jl.

CarloLucibello commented on June 24, 2024

@alerem18 which of the two reshaping is correct in your case?

from flux.jl.

alerem18 commented on June 24, 2024

@alerem18 which of the two reshaping is correct in your case?

reshape(mask, (seq_len, 1, 1, batch_size))

from flux.jl.

alerem18 commented on June 24, 2024

@alerem18 which of the two reshaping is correct in your case?

reshape(mask, (seq_len, 1, 1, batch_size))

however masking is wrong
it should be in the shape (seq_len, seq_len, 1, batch_size)
but for the (1, seq_len, 1, batch_size) it'll return NaN so pad masking is not currently supported by the layer, i've tried that already

l = reduce(hcat, [[5, 2, 3, 1, 1], [4, 5, 6, 1, 1]])
mask = fill(true, 5, 5, 1, 2)
mask[4:5, :, :, :] .= 0
mask[:, 4:5, :, :] .= 0

emb_layer = Embedding(10, 128)
emb = emb_layer(l)
attn = MultiHeadAttention(128, nheads=2)
attn(emb, mask=mask)[2]

result
`5×5×2×2 Array{Float32, 4}:
[:, :, 1, 1] =
0.326395 0.362849 0.343025 NaN NaN
0.0660359 0.402627 0.0637925 NaN NaN
0.60757 0.234524 0.593183 NaN NaN
0.0 0.0 0.0 NaN NaN
0.0 0.0 0.0 NaN NaN

[:, :, 2, 1] =
0.486156 0.144888 0.532702 NaN NaN
0.2133 0.422068 0.0270071 NaN NaN
0.300544 0.433044 0.440291 NaN NaN
0.0 0.0 0.0 NaN NaN
0.0 0.0 0.0 NaN NaN

[:, :, 1, 2] =
0.0449472 0.396037 0.347837 NaN NaN
0.198215 0.455466 0.0415825 NaN NaN
0.756838 0.148497 0.610581 NaN NaN
0.0 0.0 0.0 NaN NaN
0.0 0.0 0.0 NaN NaN

[:, :, 2, 2] =
0.778366 0.164352 0.220597 NaN NaN
0.0780623 0.445108 0.702782 NaN NaN
0.143571 0.39054 0.0766214 NaN NaN
0.0 0.0 0.0 NaN NaN
0.0 0.0 0.0 NaN NaN`

from flux.jl.

alerem18 commented on June 24, 2024

masking with shape (seq_len, 1, 1, batch_size) is ok but with shape (1, seq_len, 1, batch_size) return NaN

from flux.jl.

can't use masks in multi-head-attention layer about flux.jl HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent