openai / finetune-transformer-lm Goto Github PK

Code and model for the paper "Improving Language Understanding by Generative Pre-Training"

Home Page: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

License: MIT License

Python 100.00%

paper

finetune-transformer-lm's Introduction

Status: Archive (code is provided as-is, no updates expected)

finetune-transformer-lm

Code and model for the paper "Improving Language Understanding by Generative Pre-Training"

Currently this code implements the ROCStories Cloze Test result reported in the paper by running: python train.py --dataset rocstories --desc rocstories --submit --analysis --data_dir [path to data here]

Note: The code is currently non-deterministic due to various GPU ops. The median accuracy of 10 runs with this codebase (using default hyperparameters) is 85.8% - slightly lower than the reported single run of 86.5% from the paper.

The ROCStories dataset can be downloaded from the associated website.

finetune-transformer-lm's People

Contributors

Stargazers

Watchers

Forkers

codeaudit jmrinaldi hal2001 stevenlol magic282 ghiblifield fundou timdengx86 hanfeijp yangxuefeng miguelperalvo shaunstanislauslau libertatis tusharbihani allensmile eternalfeather gitathrun yucoian airobotgui ricklentz huitingliu chiuyeelau ml-lab merajat evanwang2015 briando2005 fanfanba tarrysingh harrybraviner deneutoy threefoldo vseledkin little1tow nadiiach shumingma hfxunlp rs19hack akbari59 kreludan petabyte lakshmanboddoju meelement codercodercoder cyzhangathit ymohit harveyaot breadsh tonydeep yhzhao nottombrown bgfurfeature happylicio junailin feherbalazs sohuren akhilborkar marcwww johndpope zizai agbaezehenry dreamdrink chunlinx jamesmullenbach kartikperisetla mihirpanchal4 ras-al-ghul inistlwq ncammarata misoknisky pvcastro qitong mohamed-94 surajthemaker ianshan0915 lpriceparc widslagos xuy2 zhipengchen outcastofmusic mehdimashayekhi chengchingwen chcbin ryan2x csbhagav lzeey y12uc231 zfang abhishekraok wanghm92 gopchandani chenghuige ceshine alphacyc saraswat neocryan trungtrinh44 taoshen58 wykdg xiayuqing0622 fendaq

finetune-transformer-lm's Issues

CPU support

Does the code run on CPU?

Source Exhausted

I run this model on 1080ti
command: python train.py --dataset rocstories --desc rocstories --submit --analysis
Then I got ResourceExhaustedError
Do I need to make some changes to the code before running?
Thanks.

Question about "rf" parameter in conv1d() function

    if rf == 1:  # faster 1x1 conv
       c = tf.reshape(tf.matmul(tf.reshape(x, [-1, nx]), tf.reshape(w, [-1, nf])) + b, shape_list(x)[:-1] + [nf])
    else:  # was used to train LM
        c = tf.nn.conv1d(x, w, stride=1, padding=pad) + b

when training LM, rf = ? and why?

Error!! :(

hi
how can I solve this error??

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Anybody can help me??

How to use the current model for classification task?

can we add the classification layers in the model and train it for some steps for a perfect classification?
** also please share a data set for the classification task i want.

What is the meaning of M in the inputs

From what I can tell, for the RoCstories code, M is initialized to 1 here:
https://github.com/openai/finetune-transformer-lm/blob/master/train.py#L243

And the only place where it's used in the model (after a reshape) is here:
https://github.com/openai/finetune-transformer-lm/blob/master/train.py#L179

What is M supposed to represent/encode?

Why do we need to apply mask while fine tuning?

n attention class, you have the following code for masking. I understand the logic for pre training, but in fine tuning if we dont include language model loss we should have a check here for not applying the mask. Do we have to always apply the masking because the model was trained that way, is there an intuitive idea for this, because I dont see a necessity to do it experimentally

Is the LM pre-trained with a _start_ symbol?

Hi,

I was wondering if, during the pre-training of the LM alone, the sentences were prepended with a start symbol, just like they are during fine-tuning. If that's the case, could you please mention what is the name of the token in the learned vocab? If that's not the case, then wouldn't it introduce a bit of a mismatch wrt the pre-trained LM? Of course, the model is being fine-tuned so it will adapt, but then why would the start symbol be necessary for fine-tuning? Thanks!

lm_h = tf.reshape(h[:, :-1], [-1, args.n_embd])

does anyone know why the last one is excluded ?

Welcome to join this Wechat Group for discussion.

How to deal with logits from position indices in the output layer?

Dear guys,

I found that the position embeddings are concatenated with the word embeddings in the embedding layer.

finetune-transformer-lm/train.py

Line 411 in bd1cf7d

 init_params[0] = np.concatenate([init_params[1], (np.random.randn(n_special, n_embd)*0.02).astype(np.float32), init_params[0]], 0) 

and the output layer also shares weights with this embedding layer, so it outputs logits for both word indices and position indices.

finetune-transformer-lm/train.py

Line 176 in bd1cf7d

lm_logits = tf.matmul(lm_h, we, transpose_b=True)

My questions are:

During lm pretraining, did you mask out the logits from those position indices when computing the loss?
If I use the pretrained model as a LM to generate text, do I need to mask out these position indices' logits before softmax when sampling the next word?

BTW, I used the pytorch code ported by huggingface:
https://github.com/huggingface/pytorch-openai-transformer-lm
FYI, I also posted an issue there describing some details of my experiments:
huggingface/pytorch-openai-transformer-lm#36

Da Xiao

Using it as a Language model

was trying to use it as a language model to assign a score(could be PPL score) of a given sentence. Something like
P("He is go to school")=0.008
P("He is going to school")=0.08
Which is indicating that the probability of second sentence is higher than first sentence. Is there a way to get a score like this?

Thanks

great dataset

can anybody help me please?
how can i upload a huge amount of data for LM (for example about 2 G) ??

tnx

Where can I get these two datasets?

def rocstories(data_dir, n_train=1497, n_valid=374): storys, comps1, comps2, ys = _rocstories(os.path.join(data_dir, 'cloze_test_val__spring2016 - cloze_test_ALL_val.csv')) teX1, teX2, teX3, _ = _rocstories(os.path.join(data_dir, 'cloze_test_test__spring2016 - cloze_test_ALL_test.csv'))

zip argument error

in train.py, I get the following error:

File "train.py", line 201, in mgpu_train
for i, xs in enumerate(zip(*xs)):
TypeError: zip argument #1 must support iteration

I couldn't resolve this issue, any help would be really appreciated

Supported languages

Probably running ahead of things, but the generic model(s) you plan to release are for English only I suppose?

I would really be interested in ones for more languages, kind of like fasttext has done.

Otherwise I'd have to resort to training them myself on my single GPU :(

Any plans for more languages?

Concatenating context and embeddings?

Hi,

Congratulations on the paper! Those of us who actually worked on ROCStories know how difficult it is!!

I have a small question on how embeddings are handled in the code.

we = tf.get_variable("we", [n_vocab+n_special+n_ctx, n_embd], initializer=tf.random_normal_initializer(stddev=0.02))
e = tf.gather(we, X)
h = tf.reduce_sum(e, 2)

I believe this is equivalent to embeddings_look_up() that people normally use...so we is word embedding. My question is: what is n_ctx (context embedding)? May I ask how is this used in the model?

Thank you very much!

Now that I looked at the code closer, is it an artifact of the Transformer Decoder??

Do you ever try your model on ROCstory training dataset?

I use the training data to train this model. (I make the wrong ending by random)
And use the test data to test.
The result is only about 60%. While the common embedding model can reach 65+%.
I'm not sure whether I use this model in a right way.
Do you ever try this ?

The Conv1d over Linear?

finetune-transformer-lm/train.py

Line 106 in a69b5c4

 def conv1d(x, scope, nf, rf, w_init=tf.random_normal_initializer(stddev=0.02), b_init=tf.constant_initializer(0), pad='VALID', train=False): 

I see you use through the whole code 1-d convolutions, which technically should perform the same as Dense layers. The main difference I found on the internet was a shorter computing time for the Dense layer. Hence why I am wondering why you use 1D convolutions here?

Cheers

About the non-determinism due to GPU ops

Hi,

I understand that there is a non-determinism due to GPU ops and I observed this as well when running twice the same code on the same GPU gave significant different results. However, I was wondering why the pytorch re-implementation https://github.com/huggingface/pytorch-openai-transformer-lm is actually giving the same results when running twice in a raw. Could it be that I am using a "wrong" version of TF? I have tensorflow-gpu 1.4.0, python 3.6, cuda 8.0 and cudnn 6.0. Thanks!

implementation of ExponentialMovingAverage is not correct

def get_ema_if_exists(v, gvs):
    name = v.name.split(':')[0]
    ema_name = name+'/ExponentialMovingAverage:0'
    ema_v = [v for v in gvs if v.name == ema_name]
    if len(ema_v) == 0:
        ema_v = [v]
    return ema_v[0]

def get_ema_vars(*vs):
    if tf.get_variable_scope().reuse:
        gvs = tf.global_variables()
        vs = [get_ema_if_exists(v, gvs) for v in vs]
    if len(vs) == 1:
        return vs[0]
    else:
        return vs

g, b = get_ema_vars(g, b) I think g, b is the original tensor, not the ema

Language Model Generation Example

This is very exciting work! Can you provide the steps to generate a language model using your architecture?

Thanks.

Using conv1d with kernel size 1

Hi!
I've noticed that the training code using 1d convolution with kernel size 1 in all invocations. Do we need convolution at all here? Why not replace it with the fully_connected layer?

Unable to adapt language model

Thank you for the great research and code! Regarding this section in the research paper:

"For CoLA (linguistic acceptability), examples are scored as the average token log-probability the
generative model assigns and predictions are made by thresholding. For SST-2 (sentiment analysis),
we append the token very to each example and restrict the language model’s output distribution to only
the words positive and negative and guess the token it assigns higher probability to as the prediction.
For RACE (question answering), we pick the answer the generative model assigns the highest average
token log-probability when conditioned on the document and question. For DPRD [46] (winograd
schemas), we replace the definite pronoun with the two possible referrents and predict the resolution
that the generative model assigns higher average token log-probability to the rest of the sequence
after the substitution."

I have tried to adapt the language model part of the model in the code to perform the above mentioned tasks. For instance, for CoLA, I fed in the encoded sentences and evaluated the results by thresholding the lm_losses output from the language model. However, the best Matthews correlation coefficient obtained is 0.015, far short of the 0.479 achieved in the paper. How exactly was the model configured to perform the task? Was it purely through the language model output, or was the supervised classification head used?

how could i train the language with chinese

how could i train the language with chinese ，but the code only shows how to finetune the model. thanks

have you ever try a bigger corpus

I find BERT uses BookCorpus (800M words) and Wikipedia (2500M words) but GPT only uses BookCorpus, even BERT has complex model structure which may leads to effect representation ability, the difference in evaluation result may also comes from training corpus. Have you ever try a bigger corpus like wikipedia?

The compare result also can imply BERT #task 2 influences.

what does the variable n_ctx mean ?

what does the variable n_ctx mean ?
could you explain why it is computed like that ?
n_ctx = min(max(max(len(x[:max_len]) for x in X) for X in [trX, vaX]) + 3, args.n_ctx)

What is training accuracy of the language model?

I trained this architecture on Japanese and got only 50% training accuracy. Is it normal here?

I suspect it is not 100% because we predict for entire sequence in this model, not only for n-gram as Word2Vec.

Thank you very much.

What is the specific formula for learning rate used in adam optimizer during pre-train?

In your paper, the learning rate used in adam optimizer during pre-train is described as follows:
'We used the Adam optimization scheme [27] with a max learning rate of 2.5e-4. The learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule.'
but what is the specific formula for this learning rate?

BUG in utils.py

def get_ema_if_exists(v, gvs):      
    name = v.name.split(':')[0]  
    ema_name = name+'/ExponentialMovingAverage:0'  
    ema_v = [v for v in gvs if v.name == ema_name]  
    if len(ema_v) == 0:  
        ema_v = [v]  
    return ema_v[0]

Here is a piece of code in utils.py line 117 - 123
When you use the list comprehension, the variable v overwrites the argument v in the function.
So what returned is always the last tensor in gvs if the length of ema_v is 0.

This may cause problem. In this case, you are using the same variable for all normalization function.

transfer learning code

Have you released the code for language model training?

Convolution layer filter width

In the code that does the convolution, there is a separate implementation for convolution with filter width of 1 and convolution with other filter widths.

Convolution with a filter width of 1 is special cased due to a faster implementation and it is the only filter width used by the current code. Other implementation has a comment #was used to train LM.

Does it mean that a different convolution filter width was used for language modeling or was it just using a less efficient implementation but with the same filter width?

Position embedding matrix Wp was not used in the code?

Hey, it seems from the code that the position embedding matrix W_p was not used. Am I correct?

h_0 = UW_e +W_p 
h_l = transformer_block(h_{l−1})∀i ∈ [1, n]
P(u) = softmax(h_n W_e^T )

Thank you.

Question about the shape of `X_train`

X_train = tf.placeholder(tf.int32, [n_batch_train, 2, n_ctx, 2])
xmb[:, :, :, 1] = np.arange(n_vocab+n_special, n_vocab+n_special+n_ctx)
why there is a channel of additional tokens?

Why are the "wrong" sentences are learned during training via LM?

Maybe I don't interpret the model(...)-function correctly, but I see the following:

While training you put the correct and wrong rocstories together into the decoder. They both go through the embedding + decoder and then into the sparse_softmax_cross_entropy-function.

This means, though, that the model also learns to generate wrong sentences, or am I missing something?

My intuition would be to set all masks to 0 for the wrong sentences?!

Thanks and regards

Universal Transformer as base architecture

Hello,

First, I would like to thank the authors of this paper for releasing their source code.

Is there a plan to use the same approach using a Universal Transformer as base architecture? Would the adaptive computation time (ACT) mechanism transfer to other tasks?

And more importantly, if this new transformer can be used, do you think the gain would be noticeable?

0.77 dev accuracy on quora question pairs, but 0.91 is the state of the art

I ran quora_similarity.py, and didn't modify any codes, then I got 0.77 accuracy for 0.4 class balance, any ideas for this? Thanks.

Cannot reproduce this experiment

Hi,

I tried to run this code to reproduce the accuracy of 85%. But I have executed 3 times and got only 53% accuracy each time.
Could you please tell me some tips about the instructions to run this code or set parameters?
I did't change any code and used the command python train.py --dataset rocstories --desc rocstories --submit --analysis --data_dir [path to data here] in readme.

Thanks

help to understand bpe logic

Hello. Sorry, but i can't understand how this function work. In my tests in most cases the result is equal to original token parameter value.
https://github.com/openai/finetune-transformer-lm/blob/master/text_utils.py#L49

How to pretrain the model?

could u share the pretrain code?

Is it possible to edit the code based on this project to train from scratch?

Is it possible? To achieve maybe 90% of the experiment result in the paper.
Thank you very much!

Cannot reproduce RACE score

Hi,

We tried several settings based on your code and paper, but unfortunately we cannot reproduce the RACE score (the training loss decreases, but the dev accuracy only reaches 0.26).
Could you tell me some tips about the parameter or code modification for achieving the performance?

Thanks

Any timeline to release the code to train the LM + finetune on the other 11 tasks?

Thanks for the code release. Do you have any timeline to release the code to train the LM and finetune on the other 11 tasks?

openai / finetune-transformer-lm Goto Github PK

finetune-transformer-lm's Introduction

finetune-transformer-lm

finetune-transformer-lm's People

Contributors

Stargazers

Watchers

Forkers

finetune-transformer-lm's Issues

Recommend Projects

Recommend Topics

Recommend Org