cyberzhg / keras-multi-head Goto Github PK
View Code? Open in Web Editor NEWA wrapper layer for stacking layers horizontally
Home Page: https://pypi.org/project/keras-multi-head/
License: MIT License
A wrapper layer for stacking layers horizontally
Home Page: https://pypi.org/project/keras-multi-head/
License: MIT License
I am trying to train my own sequence tagging model based on this repository implementation of MultiHeadAttention
import keras.layers as ll
from keras import Model
from keras_pos_embd import TrigPosEmbedding
from keras_multi_head import MultiHeadAttention
inputs = ll.Input(shape=(None,))
x = ll.Embedding(10000, 1024)(inputs)
x = TrigPosEmbedding(mode='add')(x)
x = MultiHeadAttention(head_num=8)(x)
x = ll.Dense(units = 512, activation='relu')(x)
x = ll.Dense(units = 4, activation='softmax')(x)
outputs = x
model = Model(inputs, outputs)
model.summary()
I have one big problem. The sequences in my training set are quite long (length upper bound by 20000), and when I attempt to train it I get an OOM.
The OOM happens when trying to allocate a [16, 20000, 20000] tensor. If my calculations are correct, just storing this vector would take >150 GB of RAM!
I was wondering if you have any suggestions on how to modify the code to make it work in a more serialized way, only loading in memory a context of the length specified by a custom parameter.
I tried going to a lower level with keras_self_attention.SeqSelfAttention
and the configurable attention width, but in the end it would still try to allocate a very big tensor to my GPU.
PD: Awesome repo!
Is it possible to use cross-attention with multi-head attention? I could not find it in the code.
还是需要另外写一个多头自注意力类来调用这个多头注意力呢,正在学习中,代码读的不是很明白,可以解答的话不胜感激
Currently, if I use two layers as input to the Multi-Head Attention layer like so:
csr_atten = MultiHeadAttention(head_num=2)([csr_doc_layer, csr_intents_layer])
it throws out the following error:
ValueError: not enough values to unpack (expected 3, got 2)
Is there a workaround for using two input layers? Or is an alternative under development?
I have trained a model using MultiHead layer.
When I tried to load it, it raises an error as ValueError: Unknown layer: MultiHead
.
I guess I have to add a custom_objects
but I am not sure what it is .
When runnning the code I get the error: AttributeError: module 'keras' has no attribute 'applications'
Version Info
I am using Keras version 2.4.3
Minimal Codes To Reproduce
import keras
from keras.layers import Input
from keras_multi_head import MultiHeadAttention
print(keras.getversion())
input_layer =Input(
shape=(2, 3),
name='Input',
)
att_layer = MultiHeadAttention(
head_num=3,
name='Multi-Head',
)(input_layer)
model = keras.models.Model(inputs=input_layer, outputs=att_layer)
I wonder if 'feature_dim' could be assigned by human. In your code, given the input, 'feature_dim' is fixed, so that the shape of 'Wq','Wk','Wv' is fixed.
def build(self, input_shape): if isinstance(input_shape, list): q, k, v = input_shape else: q = k = v = input_shape feature_dim = int(v[-1])
self.Wq = self.add_weight( shape=(int(q[-1]), feature_dim), self.Wk = self.add_weight( shape=(int(k[-1]), feature_dim), self.Wv = self.add_weight( shape=(int(v[-1]), feature_dim),
Is your feature request related to a problem? Please describe.
Unable to load the model which includes 'MultiHeadAttention' layer.
Describe the solution you'd like
A similar staticmethod get_custom_objects() should be implemented as implemented in SeqSelfAttention
Hello, I got an error while using MultiHead,code as follows:
l_att2=MultiHead(keras.layers.Dot((1, 1))([encoder_embed, attention_weightn]),layer_num=2,name='Multi-Dot')
How to solve this problem?
hello, this is the heart of transformer, GPT, and BERT architectures. I have been trying to see how to apply these architectures directly in time-series problems (not on NPL problems). Just predict the next value in a sequence and/or classify a sequence of values.
It would be nice if you could provide some simple example of how to apply this block in a simple multivariable time-series scenario (with simple values in a sequence, with no embeddings, etc), if possible.
Thanks in advance!
By default the activation function for the MultiHeadAttention layer is ReLU. This makes the projections non-linear, but the paper describes the projections as linear, as can be seen in figure 2 (right) of Attention Is All You Need. Shouldn't the default value for the activation be None, so that the projections are by default linear?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.