Comments (6)
The layer's documentation for the forward pass says:
(mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask])
...
mask: Input array broadcastable to size (kv_len, q_len, nheads, batch_size).
The mask is applied to the attention scores just before the softmax.
See NNlib.make_causal_mask for creating causal masks. Default nothing.
so I think you should reshape as. reshape(mask, (seq_len, 1, 1, batch_size))
or reshape(mask, (1, seq_len, 1, batch_size))
. I'm not sure which one of the two is correct.
from flux.jl.
The layer's documentation for the forward pass says:
(mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask]) ... mask: Input array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See NNlib.make_causal_mask for creating causal masks. Default nothing.
so I think you should reshape as.
reshape(mask, (seq_len, 1, 1, batch_size))
orreshape(mask, (1, seq_len, 1, batch_size))
. I'm not sure which one of the two is correct.
thanks now it's working
from flux.jl.
@alerem18 which of the two reshaping is correct in your case?
from flux.jl.
@alerem18 which of the two reshaping is correct in your case?
reshape(mask, (seq_len, 1, 1, batch_size))
from flux.jl.
@alerem18 which of the two reshaping is correct in your case?
reshape(mask, (seq_len, 1, 1, batch_size))
however masking is wrong
it should be in the shape (seq_len, seq_len, 1, batch_size)
but for the (1, seq_len, 1, batch_size) it'll return NaN so pad masking is not currently supported by the layer, i've tried that already
l = reduce(hcat, [[5, 2, 3, 1, 1], [4, 5, 6, 1, 1]])
mask = fill(true, 5, 5, 1, 2)
mask[4:5, :, :, :] .= 0
mask[:, 4:5, :, :] .= 0
emb_layer = Embedding(10, 128)
emb = emb_layer(l)
attn = MultiHeadAttention(128, nheads=2)
attn(emb, mask=mask)[2]
result
`5×5×2×2 Array{Float32, 4}:
[:, :, 1, 1] =
0.326395 0.362849 0.343025 NaN NaN
0.0660359 0.402627 0.0637925 NaN NaN
0.60757 0.234524 0.593183 NaN NaN
0.0 0.0 0.0 NaN NaN
0.0 0.0 0.0 NaN NaN
[:, :, 2, 1] =
0.486156 0.144888 0.532702 NaN NaN
0.2133 0.422068 0.0270071 NaN NaN
0.300544 0.433044 0.440291 NaN NaN
0.0 0.0 0.0 NaN NaN
0.0 0.0 0.0 NaN NaN
[:, :, 1, 2] =
0.0449472 0.396037 0.347837 NaN NaN
0.198215 0.455466 0.0415825 NaN NaN
0.756838 0.148497 0.610581 NaN NaN
0.0 0.0 0.0 NaN NaN
0.0 0.0 0.0 NaN NaN
[:, :, 2, 2] =
0.778366 0.164352 0.220597 NaN NaN
0.0780623 0.445108 0.702782 NaN NaN
0.143571 0.39054 0.0766214 NaN NaN
0.0 0.0 0.0 NaN NaN
0.0 0.0 0.0 NaN NaN`
from flux.jl.
masking with shape (seq_len, 1, 1, batch_size) is ok but with shape (1, seq_len, 1, batch_size) return NaN
from flux.jl.
Related Issues (20)
- Zero-sized arrays cannot be applied to Dense layers. HOT 4
- Adding Simple Recurrent Unit as a recurrent layer
- Collecting PyTorch -> Flux migration notes HOT 1
- tests are failing due to ComponentArrays HOT 2
- deprecate Flux.params HOT 7
- Significant time spent moving medium-size arrays to GPU, type instability HOT 10
- ConvTranspose errors with symmetric non-constant pad
- SamePad() for even sized filters.
- Dense layers with shared parameters HOT 5
- Implementation of `AdamW` differs from PyTorch HOT 10
- `gpu` should warn if cuDNN is not installed HOT 2
- Cannot take `gradient` of L2 regularization loss HOT 1
- Create a flag to use Enzyme as the AD in training/etc. HOT 14
- test Enzyme gradient for loss functions
- test Enzyme gpu support
- Enzyme fails with MultiHeadAttention layer HOT 13
- Enable github Discussions
- Stacked RNN in Flux.jl?
- Add option to throw error on passing wrong precision floats to layers HOT 3
- Potential bug of RNN training flow
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flux.jl.