kyubyong / transformer Goto Github PK

A TensorFlow Implementation of the Transformer: Attention Is All You Need

License: Apache License 2.0

Python 93.36% Shell 0.18% Perl 6.45%

attention-mechanism translation attention-is-all-you-need implementation transformer

transformer's Introduction

[UPDATED] A TensorFlow Implementation of Attention Is All You Need

When I opened this repository in 2017, there was no official code yet. I tried to implement the paper as I understood, but to no surprise it had several bugs. I realized them mostly thanks to people who issued here, so I'm very grateful to all of them. Though there is the official implementation as well as several other unofficial github repos, I decided to update my own one. This update focuses on:

readable / understandable code writing
modularization (but not too much)
revising known bugs. (masking, positional encoding, ...)
updating to TF1.12. (tf.data, ...)
adding some missing components (bpe, shared weight matrix, ...)
including useful comments in the code.

I still stick to IWSLT 2016 de-en. I guess if you'd like to test on a big data such as WMT, you would rely on the official implementation. After all, it's pleasant to check quickly if your model works. The initial code for TF1.2 is moved to the tf1.2_lecacy folder for the record.

Requirements

python==3.x (Let's move on to python 3 if you still use python 2)
tensorflow==1.12.0
numpy>=1.15.4
sentencepiece==0.1.8
tqdm>=4.28.1

Training

STEP 1. Run the command below to download IWSLT 2016 German–English parallel corpus.

bash download.sh

It should be extracted to iwslt2016/de-en folder automatically.

STEP 2. Run the command below to create preprocessed train/eval/test data.

python prepro.py

If you want to change the vocabulary size (default:32000), do this.

python prepro.py --vocab_size 8000

It should create two folders iwslt2016/prepro and iwslt2016/segmented.

STEP 3. Run the following command.

python train.py

Check hparams.py to see which parameters are possible. For example,

python train.py --logdir myLog --batch_size 256 --dropout_rate 0.5

STEP 3. Or download the pretrained models.

wget https://dl.dropbox.com/s/4lom1czy5xfzr4q/log.zip; unzip log.zip; rm log.zip

Training Loss Curve

Learning rate

Bleu score on devset

Inference (=test)

python test.py --ckpt log/1/iwslt2016_E19L2.64-29146 (OR yourCkptFile OR yourCkptFileDirectory)

Results

Typically, machine translation is evaluated with Bleu score.
All evaluation results are available in eval/1 and test/1.

tst2013 (dev)	tst2014 (test)
28.06	23.88

Notes

Beam decoding will be added soon.
I'm going to update the code when TF2.0 comes out if possible.

transformer's People

Contributors

Stargazers

Watchers

Forkers

kastnerkyle adolfoeliazat jdc08161063 wanjinchang pustar gongqingyi-github zhenyangiacas yongluocode sungjinlees jamesblunt multipath tonydeep rzs840707 kimdwkimdw vanova 5iknowledge winnerineast chenxinglili huanghaocn yielenawang johndpope bloodd hitum-dev yanwunantian mtfelix benjamesbabala pandaczm hhy5277 machine-learning-openprojects practise2017 hoangcuong2011 johnsonc xhivaw jiangss mingmingyang kingjci chqiwang xc35 zxsted zerkh keven-cheng skybirdhe zhang-jian lakezhang caoqian1995 fence nlpguyz huachunwang batman2013 veinpy zpppy zaemyung ricelingz pfulop kelizhong henry2009 minghui yangliuy lsnu-acmer liyc duolinwang l11x0m7 eternalfeather ucaslyc fanyingnju melody-xiaomi jerklee lliu25 leezqcst fulquan chenzhi1992 joyeel lemaoliu shujian nku428 wangtjwork xujunrt yongyehuang text-classification shaohuikuang machine-intelligence fogside kunlqt codeaudit hanllu hydercps ngchc zminghua ucaswxk ner-eric lvyli preciousliu hncuong renzhong007 topworld2040 myoungs iamsile couragelfyang komonoli xibin

transformer's Issues

hello

I have a little problem, I think you do a little less and twice in each sublayer of the encoder and decoder, and do not know whether my understanding is right or wrong.

batch() got an unexpected keyword argument

Hi, I directly run the code and it shows as below:
File "C:\Users\hp\Desktop\transformer-master0\transformer-master\data_load.py", line 104, in get_batch_data
allow_smaller_final_batch=False)
TypeError: batch() got an unexpected keyword argument 'min_after_dequeue'

all my packages meet the requirements and I can find out why. Thank you.

<S> is begin, </S> is end, </S> is used, but why don’t you use <S>?

~~is begin,~~ is end, is used, but why don’t you use ?

Training process killed

I tried to train transformer model on my own parallel corpus (about 250MB).

But after the graph is constructed, the process is killed before session started.

Graph loaded
WARNING:tensorflow:From train.py:171: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-11-27 12:32:22.021904: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-27 12:32:22.279206: I tensorflow/compiler/xla/service/service.cc:149] XLA service 0x5607d324dc90 executing computations on platform CUDA. Devices:
2018-11-27 12:32:22.279319: I tensorflow/compiler/xla/service/service.cc:157]   StreamExecutor device (0): Tesla P100-PCIE-12GB, Compute Capability 6.0
2018-11-27 12:32:22.286826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:04:00.0
totalMemory: 11.91GiB freeMemory: 10.98GiB
2018-11-27 12:32:22.286958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2018-11-27 12:32:22.288905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-27 12:32:22.288978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2018-11-27 12:32:22.289007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2018-11-27 12:32:22.289527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10682 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:04:00.0, compute capability: 6.0)
Killed

Any ideas?

saver error occors

Caused by op u'save/Assign_136', defined at:
File "eval.py", line 81, in
eval()
File "eval.py", line 35, in eval
sv = tf.train.Supervisor()
File "/home/yinxiaoyi/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 300, in init
self._init_saver(saver=saver)
File "/home/yinxiaoyi/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 446, in _init_saver
saver = saver_mod.Saver()
File "/home/yinxiaoyi/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1056, in init
self.build()
File "/home/yinxiaoyi/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1086, in build
restore_sequentially=self._restore_sequentially)
File "/home/yinxiaoyi/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build
restore_sequentially, reshape)
File "/home/yinxiaoyi/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 419, in _AddRestoreOps
assign_ops.append(saveable.restore(tensors, shapes))
File "/home/yinxiaoyi/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 155, in restore
self.op.get_shape().is_fully_defined())
File "/home/yinxiaoyi/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/state_ops.py", line 270, in assign
validate_shape=validate_shape)
File "/home/yinxiaoyi/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 47, in assign
use_locking=use_locking, name=name)
File "/home/yinxiaoyi/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/home/yinxiaoyi/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/yinxiaoyi/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [9786,512] rhs shape= [9796,512]
[[Node: save/Assign_136 = Assign[T=DT_FLOAT, _class=["loc:@encoder/enc_embed/lookup_table"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](encoder/enc_embed/lookup_table, save/RestoreV2_136/_901)]]

Question on maskings

Hi @Kyubyong,

Can you help explain a bit on the following masking codes (the Key Masking and Query Masking) in the modules.py? Why we need them? We only need the causality, right?

# Key Masking
        key_masks = tf.sign(tf.abs(tf.reduce_sum(keys, axis=-1))) # (N, T_k)
        key_masks = tf.tile(key_masks, [num_heads, 1]) # (h*N, T_k)
        key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1]) # (h*N, T_q, T_k)
        
        paddings = tf.ones_like(outputs)*(-2**32+1)
        outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs) # (h*N, T_q, T_k)

# Query Masking
        query_masks = tf.sign(tf.abs(tf.reduce_sum(queries, axis=-1))) # (N, T_q)
        query_masks = tf.tile(query_masks, [num_heads, 1]) # (h*N, T_q)
        query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]]) # (h*N, T_q, T_k)
        outputs *= query_masks # broadcasting. (N, T_q, C)

Thanks!

Code of the original paper available now

This is not a issue, but a message to author/contributor and other visitors like me.
I was recently recommended about this repo alongside the original research paper. And I noticed this GitHub repo has been added to the conclusion section of newer version of that paper

why normalization variables are trainable

In function normalize(in modules.py), beta & gamma are set as variables. I don't know why should they be trainable. Couldn't I just use 0. & 1.?

def normalize(inputs,
              epsilon=1e-8,
              scope="ln",
              reuse=None):
    """Applies layer normalization.

    Args:
      inputs: A tensor with 2 or more dimensions, where the first dimension has
        `batch_size`.
      epsilon: A floating number. A very small number for preventing ZeroDivision Error.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.

    Returns:
      A tensor with the same shape and data dtype as `inputs`.
    """
    with tf.variable_scope(scope, reuse=reuse):
        inputs_shape = inputs.get_shape()
        params_shape = inputs_shape[-1:]

        mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
        beta = tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ((variance + epsilon) ** (.5))
        outputs = gamma * normalized + beta

    return outputs

eval.py file number of item diff between test and result

File eval.py line 48

for i in range(len(X) // hp.batch_size):

I am modify some of your code to fit with my data.
I do not know what it's mean. But i figure out that number of item in result file is not equal number of item in test file? It also right in your result and your test data.
Can u help me undertand this

This model cannot handle extremely large dataset

Just to point out that use
tf.convert_to_tensor -> tf.train.slice_input_producer -> tf.train.shuffle_batch
will meet an error

ValueError: Cannot create a tensor proto whose content is larger than 2GB.

if dataset is too large

How does multihead_attention works？

Dropbox link with pretrained model doesn't work

Please, can you update this link to download logdir?

decoder_inputs

how to use the wmt dataset rather than the IWSLT 2016 de-en dataset

Shape mismatch error in eval

I just downloaded the corpora and the trained model, and then ran the eval.py script. I'm getting the following error:

$ python eval.py 
Graph loaded
2017-07-31 22:29:04.547163: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-31 22:29:04.547205: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-31 22:29:04.547210: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-31 22:29:04.547215: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
WARNING:tensorflow:Standard services need a 'logdir' passed to the SessionManager
Traceback (most recent call last):
  File "eval.py", line 81, in <module>
    eval()
  File "eval.py", line 38, in eval
    sv.saver.restore(sess, tf.train.latest_checkpoint(hp.logdir))
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1548, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [9786,512] rhs shape= [9796,512]
	 [[Node: save/Assign_136 = Assign[T=DT_FLOAT, _class=["loc:@encoder/enc_embed/lookup_table"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](encoder/enc_embed/lookup_table, save/RestoreV2_136)]]

Caused by op u'save/Assign_136', defined at:
  File "eval.py", line 81, in <module>
    eval()
  File "eval.py", line 35, in eval
    sv = tf.train.Supervisor()
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 300, in __init__
    self._init_saver(saver=saver)
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 448, in _init_saver
    saver = saver_mod.Saver()
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1139, in __init__
    self.build()
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1170, in build
    restore_sequentially=self._restore_sequentially)
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build
    restore_sequentially, reshape)
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 419, in _AddRestoreOps
    assign_ops.append(saveable.restore(tensors, shapes))
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 155, in restore
    self.op.get_shape().is_fully_defined())
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/state_ops.py", line 271, in assign
    validate_shape=validate_shape)
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 45, in assign
    use_locking=use_locking, name=name)
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/Users/erick/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [9786,512] rhs shape= [9796,512]
	 [[Node: save/Assign_136 = Assign[T=DT_FLOAT, _class=["loc:@encoder/enc_embed/lookup_table"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](encoder/enc_embed/lookup_table, save/RestoreV2_136)]]

Any idea what might be wrong?

why "split" to get multi-head?

as the paper said or in some other implementation:
self.w_qs = nn.Linear(d_model, n_head * d_k)
the data size is larger.
but in this project, it is

       Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, C/h) 
       K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, C/h) 
       V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, C/h)

it's like using partial of Q/K/V to form one head.
Can anyone help to explain why it uses "split" and "concat" to get multi-head?

Thanks!

tqdm dependency

Not really a big issue: I believe the tqdm module also belongs in the requirements.txt file?

PS: I really like your implementation! It was quite "painless" to get this working, compared to many other seq2seq repo's out there

👍 🥇

where final output has no softmax

self.logits = tf.layers.dense(self.dec, len(en2idx))

Are the projection layers among multiple blocks shared?

Hi, I have a question about the codes.

        # Linear projections
        Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu) # (N, T_q, C)
        K = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
        V = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)

Is there a mechanism that tied these three layers between multiple blocks? It seems their parameters are not shared between different blocks. What should i do to tie them?

Thanks!

error running code

i followed all the steps but get the following error.

File "/home/ashishkr/projects/python/transformer/modules.py", line 227, in multihead_attention
tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense() # (T_q, T_k)
AttributeError: 'module' object has no attribute 'LinearOperatorTriL'

tf version 1.5.0 and 1.7.0
would be great if you can point out a solution

where are the data set from?

in hylerparams.py, the source_train and the target_train files seem come from de-en, but how to transfer

now ,my nltk，numpy ,regex and tensorflow all meet the requirements.i do not know why. I put the moduls of the "tf contrib. Linalg. LinearOperatorTriL " change to "tf.linalg.LinearOperatorLowerTriangular" ,the project can run.But I do not know if it is right or not.

now ,my nltk，numpy ,regex and tensorflow all meet the requirements.i do not know why.
I put the moduls of the "tf contrib. Linalg. LinearOperatorTriL " change to "tf.linalg.LinearOperatorLowerTriangular" ,the project can run.But I do not know if it is right or not.

train.py error

python train.py

Traceback (most recent call last):
File "train.py", line 147, in
g = Graph("train"); print("Graph loaded")
File "train.py", line 28, in init
self.decoder_inputs = tf.concat((tf.ones_like(self.y[:, :1])*2, self.y[:, :-1]), -1) # 2:
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 867, in concat
dtype=dtypes.int32).get_shape(
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 657, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 743, in _autopacking_conversion_function
return _autopacking_helper(v, inferred_dtype, name or "packed")
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 706, in _autopacking_helper
return gen_array_ops._pack(elems_as_tensors, name=scope)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1680, in _pack
result = _op_def_lib.apply_op("Pack", values=values, axis=axis, name=name)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2382, in create_op
set_shapes_for_outputs(ret)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1783, in set_shapes_for_outputs
shapes = shape_func(op)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 596, in call_cpp_shape_fn
raise ValueError(err.message)
ValueError: Dimension 1 in both shapes must be equal, but are 1 and 9
From merging shape 0 with other shapes.

~~how to fix this error ?~~

Why your `feedforward` layer use `tf.layers.conv1d` instead of the two liear with relu proposed in paper?

hi, why your feedforward layer use two tf.layers.conv1d instead of the two liear with relu proposed in paper? which is better?

REPLACE / PREVENT UNK AT INFERENCE TIME

How to prevent UNK from being generated at inference time?Probably replace the UNK with the right word from vocab.

Wrong Batch Normalization

In function normalize()
`

    with tf.variable_scope(scope, reuse=reuse):
        inputs_shape = inputs.get_shape()
        params_shape = inputs_shape[-1:]
        mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
        print ('mean.get_shape()',mean.get_shape())
        beta= tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
        outputs = gamma * normalized + beta

`
but i think the second parameter of tf.nn.moments() should not be [-1], since we need to consider the batch information.
After modification the code shown as below:

 with tf.variable_scope(scope, reuse=reuse):
        inputs_shape = inputs.get_shape()
        params_shape = inputs_shape[-1:]
        axis = list(range(len(inputs_shape) - 1))
        mean, variance = tf.nn.moments(inputs, axis, keep_dims=True)
        print ('mean.get_shape()',mean.get_shape())
        beta= tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
        outputs = gamma * normalized + beta

Overtranslation and Undertranslation

Does this transformer module handle over-translation and under-translation(https://arxiv.org/pdf/1601.04811.pdf) unlike seq2seq ?

There is a input sequence, how to calculate probability for the special output sequence？

Hi, transformer. I trained a model of transformer, the result sequence of decoder is so good, but I try to calculate the probability for a special outputs sequence, I found it's so hard for me.

Traing Problem

When I trained this model.something wrong with tensor

'Tensor' object has no attribute 'to_proto'
WARNING:tensorflow:Error encountered when serializing global_step.
Type is unsupported, or the types of the items don't match field type in CollectionDef.

How to solve it

beam search implementation

i really love the results of training my custom dataset.its simple and doesnt consume alot of gpu as no extra gpu is needed for dev and validation dataset.

Read the train sequences

How to deal with the training sequences with more than 10 words? In your codes, I found you have seemed to throw them away. I meat that if I set the length of sentence N, if the training sentence has N+3 words, the more 3 words have to be throw away?

A problem about decoder input

At training time, the input of decoder is right-shifted golden output embeddings. But at inference time, the input of decoder is zero embeddings. Is that right?

Wrong positional encoding

position_enc = np.array([
      [pos / np.power(10000, 2.*i/num_units) for i in range(num_units)]
      for pos in range(T)])

In the paper, section 3.5, the embedding (before the application of sine or cosine) for even positions in [0..num_units] indexed by 2*i is the same as that for odd positions.
The correct code should be

position_enc = np.array([
      [pos / np.power(10000, (i-i%2)/num_units) for i in range(num_units)]
      for pos in range(T)])

Why can't I run this file

Error said no tf contrib. Linalg. LinearOperatorTriL this function, I now use is tf1.8, be cancelled
？

Why doing Key and Query Masking ?

Nice Job!
But in paper doesn't describe Key and Query Masking , can you give me some hint about that ?
thanks!

About the .png files in the directory "fig"

How to generate these two figures based on my local transformer code? And what's the meaning of the x axis? Is it the training epoch?

Understanding load_train_data.

Hey,

I'm New to NMT. This is not an issue, just a newbie question to understand the function load_train_data()
I wanted to understand this pre-processing step,

de_sents = [re.sub("[^\s\p{Latin}']", "", line) for line in codecs.open(hp.source_train, 'r', 'utf-8').read().split("\n") if line and line[0] != "<"]

printing de_sents[1] gives me this "i n nn ini itn a in i n"

if I don't use the reg ex while creating de_sents,
de_sents = [line for line in codecs.open(hp.source_train, 'r', 'utf-8').read().split("\n") if line and line[0] != "<"]

de_sents[1] now becomes "Wir werden Ihnen einige Geschichten über das Meer in Videoform erzählen."

I wanted to know why we use the reg ex substitution step while creating de_sents or en_sents as the contents "i n nn ini itn a in i n" doesn't make much sense.

Thanks,

Evaluation

If I understood correctly, at evaluation you run

preds = np.zeros((hp.batch_size, hp.maxlen), np.int32)
for j in range(hp.maxlen):
_preds = sess.run(g.preds, {g.x: x, g.y: preds})
preds[:, j] = _preds[:, j]

Does that mean that the encoding part runs at every timestep of the decoding process ?

Thanks for the great work 👍

Embedding()

In the paper, the authors have specifically mentioned that they used learned embeddings to convert the input tokens and output tokens to vectors. Why did you learn these embeddings as opposed to using learned embeddings?

corpus_bleu module errors

the error occurs when running：
from nltk.translate.bleu_score import corpus_bleu

the logs is as show below：
File "C:\Users\10649\AppData\Roaming\Python\Python36\site-packages\sklearn\datasets\mldata.py", line 12, in <module> from urllib2 import HTTPError File "E:\python36\lib\site-packages\urllib2.py", line 220 raise AttributeError, attr ^ SyntaxError: invalid syntax

was this because python version or others？

The way feed data in training

Hello, In your code, 'batch_size' is the count of sentences . But in the paper, 'batch_size' means the count of words in a batch . so, Have you try that way described in paper ?

Add validation set

I can see the model is training for a fixed number of epochs and there is no validation set. How do you know when to stop model training, and also how to add one? Thank you!

Error for positional encoding

I am trying to run for sinusoid PE, but it throws the following error.

File "train.py", line 51, in __init__ scope="enc_pe") File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 885, in binary_op_wrapper y = ops.convert_to_tensor(y, dtype=x.dtype.base_dtype, name="y") File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 836, in convert_to_tensor as_ref=False) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 926, in internal_convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 774, in _TensorTensorConversionFunction (dtype.name, t.dtype.name, str(t))) ValueError: Tensor conversion requested dtype float32 for Tensor with dtype float64: 'Tensor("encoder/enc_pe/embedding_lookup:0", shape=(32, 49, 512), dtype=float64)'

About Train Time

Hi, when i run train.py it seem work , but log is still as below
0%| | 0/1703 [00:00<?, ?b/s] and not change

I think the training proccess seem not to go normally...
I dont know where wrong and i want to know the time of training is how long in one step
i used given dataset
But my environment is tensorflow version 1.8

Can anyone who run the training code nomally tell me?

PS GPU is 1070~~~~~

does the key masking work?

Hi @Kyubyong
as you can see the key masking code as following:

# Key Masking
key_masks = tf.sign(tf.abs(tf.reduce_sum(keys, axis=-1))) # (N, T_k)
key_masks = tf.tile(key_masks, [num_heads, 1]) # (h*N, T_k)
key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1]) # (h*N, T_q, T_k)

the params keys，is the sum of word_embedding and position_embedding. it means that even the word in a sentence is padding 0, as add postion_embedding to the word_embedding, there's no 0 vector for the final word_embedding. therefore, the key_masks must all be one, no zero! so I'm confused if the code works?

why do query-masking after softmax?

transformer/modules.py

Line 237 in 6672f93

query_masks = tf.sign(tf.abs(tf.reduce_sum(queries, axis=-1))) # (N, T_q)

This masking can do before softmax, just like key masking,why do this operation after softmax? I cannot understand

Linear transform with bias at multi-head attention

In the paper, Attention is All You Need, query, key, value are linear transformed without bias at the multi-head attention.
However, the variables in your code are transformed with bias. Is there any reason for using bias? Or is there something I do not know...?

Thanks.

transformer/modules.py

Lines 201 to 203 in 6672f93

 Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu) # (N, T_q, C) 

 K = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C) 

 V = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)

possible error in positional encoding computation

Hi, I was just looking through the positional encoding code, and I see this line:

transformer/modules.py

Line 148 in 37febce

 rad_block = tf.pow(tf.div(position_block, tf.multiply(10000, 1)), tf.div(unit_block, num_units // 2)) 

It looks wrong to me. Shouldn't it be something like the following?

 rad_block = tf.div(position_block, tf.pow(10000, tf.div(unit_block, num_units // 2)))

sinosuidal embeding vs positional what is the difference?

An easy question about linear

https://github.com/Kyubyong/transformer/blob/master/modules.py#L150-L153
Why use relu?
Is it that linear means no relu or no sigmoid?
@Kyubyong thank you thank you

	Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu) # (N, T_q, C)
	K = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
	V = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)

kyubyong / transformer Goto Github PK

transformer's Introduction

[UPDATED] A TensorFlow Implementation of Attention Is All You Need

Requirements

Training

Training Loss Curve

Learning rate

Bleu score on devset

Inference (=test)

Results

Notes

transformer's People

Contributors

Stargazers

Watchers

Forkers

transformer's Issues

Recommend Projects

Recommend Topics

Recommend Org