cyberzhg / keras-multi-head Goto Github PK

View Code? Open in Web Editor NEW

227.0 9.0 53.0 54 KB

A wrapper layer for stacking layers horizontally

Home Page: https://pypi.org/project/keras-multi-head/

License: MIT License

Shell 0.56% Python 99.44%

keras wrapper layer

keras-multi-head's Issues

MultiHeadAttention layer defaults to non-linear projections

By default the activation function for the MultiHeadAttention layer is ReLU. This makes the projections non-linear, but the paper describes the projections as linear, as can be seen in figure 2 (right) of Attention Is All You Need. Shouldn't the default value for the activation be None, so that the projections are by default linear?

When running MultiHeadAttention I get "AttributeError: module 'keras' has no attribute 'applications"

When runnning the code I get the error: AttributeError: module 'keras' has no attribute 'applications'

Version Info

I am using Keras version 2.4.3

Minimal Codes To Reproduce

import keras
from keras.layers import Input
from keras_multi_head import MultiHeadAttention

print(keras.getversion())

input_layer =Input(
    shape=(2, 3),
    name='Input',
)
att_layer = MultiHeadAttention(
    head_num=3,
    name='Multi-Head',
)(input_layer)
model = keras.models.Model(inputs=input_layer, outputs=att_layer)

Fitting MultiHeadAttention in memory for long sequences

I am trying to train my own sequence tagging model based on this repository implementation of MultiHeadAttention

import keras.layers as ll
from keras import Model
from keras_pos_embd import TrigPosEmbedding
from keras_multi_head import MultiHeadAttention

inputs = ll.Input(shape=(None,))
x = ll.Embedding(10000, 1024)(inputs)
x = TrigPosEmbedding(mode='add')(x)
x = MultiHeadAttention(head_num=8)(x)
x = ll.Dense(units = 512, activation='relu')(x)
x = ll.Dense(units = 4, activation='softmax')(x)
outputs = x
model = Model(inputs, outputs)
model.summary()

I have one big problem. The sequences in my training set are quite long (length upper bound by 20000), and when I attempt to train it I get an OOM.

The OOM happens when trying to allocate a [16, 20000, 20000] tensor. If my calculations are correct, just storing this vector would take >150 GB of RAM!

I was wondering if you have any suggestions on how to modify the code to make it work in a more serialized way, only loading in memory a context of the length specified by a custom parameter.

I tried going to a lower level with keras_self_attention.SeqSelfAttention and the configurable attention width, but in the end it would still try to allocate a very big tensor to my GPU.

PD: Awesome repo!

How to load a keras_multi_head model?

I have trained a model using MultiHead layer.
When I tried to load it, it raises an error as ValueError: Unknown layer: MultiHead.
I guess I have to add a custom_objects but I am not sure what it is .

feature_dim in muti_head_attention

I wonder if 'feature_dim' could be assigned by human. In your code, given the input, 'feature_dim' is fixed, so that the shape of 'Wq','Wk','Wv' is fixed.

def build(self, input_shape): if isinstance(input_shape, list): q, k, v = input_shape else: q = k = v = input_shape feature_dim = int(v[-1])
self.Wq = self.add_weight( shape=(int(q[-1]), feature_dim), self.Wk = self.add_weight( shape=(int(k[-1]), feature_dim), self.Wv = self.add_weight( shape=(int(v[-1]), feature_dim),

Is it possible to use cross-attention with the multi-head attention?

Is it possible to use cross-attention with multi-head attention? I could not find it in the code.

您好！请问这里多头注意力输入的只是一个tensor的话，是不是就是多头自注意力

还是需要另外写一个多头自注意力类来调用这个多头注意力呢，正在学习中，代码读的不是很明白，可以解答的话不胜感激

Multi-head Attention with 2 Input Layers

Currently, if I use two layers as input to the Multi-Head Attention layer like so:
csr_atten = MultiHeadAttention(head_num=2)([csr_doc_layer, csr_intents_layer])
it throws out the following error:

ValueError: not enough values to unpack (expected 3, got 2)

Is there a workaround for using two input layers? Or is an alternative under development?

get_custom_objects() static method implementation for MultiHeadAttention

Is your feature request related to a problem? Please describe.
Unable to load the model which includes 'MultiHeadAttention' layer.

Describe the solution you'd like
A similar staticmethod get_custom_objects() should be implemented as implemented in SeqSelfAttention

Example in simple time-series

hello, this is the heart of transformer, GPT, and BERT architectures. I have been trying to see how to apply these architectures directly in time-series problems (not on NPL problems). Just predict the next value in a sequence and/or classify a sequence of values.

It would be nice if you could provide some simple example of how to apply this block in a simple multivariable time-series scenario (with simple values in a sequence, with no embeddings, etc), if possible.

Thanks in advance!

AttributeError: 'Tensor' object has no attribute 'supports_masking'

Hello, I got an error while using MultiHead，code as follows:
l_att2=MultiHead(keras.layers.Dot((1, 1))([encoder_embed, attention_weightn]),layer_num=2,name='Multi-Dot')
How to solve this problem？

cyberzhg / keras-multi-head Goto Github PK

keras-multi-head's Issues

MultiHeadAttention layer defaults to non-linear projections

When running MultiHeadAttention I get "AttributeError: module 'keras' has no attribute 'applications"

Fitting MultiHeadAttention in memory for long sequences

How to load a keras_multi_head model?

feature_dim in muti_head_attention

Is it possible to use cross-attention with the multi-head attention?

您好！请问这里多头注意力输入的只是一个tensor的话，是不是就是多头自注意力

Multi-head Attention with 2 Input Layers

get_custom_objects() static method implementation for MultiHeadAttention

Example in simple time-series

AttributeError: 'Tensor' object has no attribute 'supports_masking'

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent