I have noticed that in the original Transformers paper, in the decoder</strong

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Combining Attention & Self-Attention in Reformer ? about reformer-pytorch HOT 6 CLOSED

pabloppp commented on August 28, 2024 2

Combining Attention & Self-Attention in Reformer ?

from reformer-pytorch.

Comments (6)

pabloppp commented on August 28, 2024 1

Just read this. That's what I have!

Okay, so maybe I was not able to understand the implementation... You're saying that if the user passes both an input and a key parameter to the model like decoder(yi, keys = enc_keys) then your implementation will apply masked self-attention to yi, then apply non-masked regular attention over enc_keys?

If that's so, it's really awesome, and then this issue makes no sense and you could close it.

Again, thanks for your awesome work!! I am currently using your library for a couple of projects and it definitely works like a charm.

from reformer-pytorch.

pabloppp commented on August 28, 2024 1

Oh, seems like a clever approach 😮 thank you!

from reformer-pytorch.

lucidrains commented on August 28, 2024

@pabloppp Good observations! In the past, for simplicity's sake, I have combined the self-attention and "regular" attention, into one layer, which has worked for me for a number of tasks, although I have not done a head to head comparison with the original architecture. To keep the readme concise, and to make the reversible architecture work (it only accepts two functions F and G), I kept it the simplified way. However, I wouldn't oppose trying to be more faithful to the original architecture, but it would require some decisions on the reversible net, and which 2 of the now 3 components to be slotted into F and G.

from reformer-pytorch.

lucidrains commented on August 28, 2024

(The most faithful reproduction would be to do self-attention + feedforward and then regular-attention + feedforward, alternating in that manner.)

from reformer-pytorch.

lucidrains commented on August 28, 2024

Would it make sense to allow the choice to combine self-attention (k == v == q), where the mask should be applied if we want it to be causal, with regular attention(q != k == v) using the passed keys, where causality no longer makes a lot of sense because we might want to be able to focus on a word at the end of the sentence if the language has a different word ordering? Just read this. That's what I have!

from reformer-pytorch.

lucidrains commented on August 28, 2024

@pabloppp yes, exactly, except they are done in one layer in the same attention matrix. each token attends to all tokens of the past as well as the enc_keys

from reformer-pytorch.

Recommend Projects

Combining Attention & Self-Attention in Reformer ? about reformer-pytorch HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent