Giter VIP home page Giter VIP logo

finetune-transformer-lm's Introduction

Status: Archive (code is provided as-is, no updates expected)

finetune-transformer-lm

Code and model for the paper "Improving Language Understanding by Generative Pre-Training"

Currently this code implements the ROCStories Cloze Test result reported in the paper by running: python train.py --dataset rocstories --desc rocstories --submit --analysis --data_dir [path to data here]

Note: The code is currently non-deterministic due to various GPU ops. The median accuracy of 10 runs with this codebase (using default hyperparameters) is 85.8% - slightly lower than the reported single run of 86.5% from the paper.

The ROCStories dataset can be downloaded from the associated website.

finetune-transformer-lm's People

Contributors

cberner avatar christopherhesse avatar newmu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

finetune-transformer-lm's Issues

Source Exhausted

I run this model on 1080ti
command: python train.py --dataset rocstories --desc rocstories --submit --analysis
Then I got ResourceExhaustedError
Do I need to make some changes to the code before running?
Thanks.

Question about "rf" parameter in conv1d() function

    if rf == 1:  # faster 1x1 conv
       c = tf.reshape(tf.matmul(tf.reshape(x, [-1, nx]), tf.reshape(w, [-1, nf])) + b, shape_list(x)[:-1] + [nf])
    else:  # was used to train LM
        c = tf.nn.conv1d(x, w, stride=1, padding=pad) + b  

when training LM, rf = ? and why?

Error!! :(

hi
how can I solve this error??

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Anybody can help me??

Why do we need to apply mask while fine tuning?

n attention class, you have the following code for masking. I understand the logic for pre training, but in fine tuning if we dont include language model loss we should have a check here for not applying the mask. Do we have to always apply the masking because the model was trained that way, is there an intuitive idea for this, because I dont see a necessity to do it experimentally

Is the LM pre-trained with a _start_ symbol?

Hi,

I was wondering if, during the pre-training of the LM alone, the sentences were prepended with a start symbol, just like they are during fine-tuning. If that's the case, could you please mention what is the name of the token in the learned vocab? If that's not the case, then wouldn't it introduce a bit of a mismatch wrt the pre-trained LM? Of course, the model is being fine-tuned so it will adapt, but then why would the start symbol be necessary for fine-tuning? Thanks!

How to deal with logits from position indices in the output layer?

Dear guys,

I found that the position embeddings are concatenated with the word embeddings in the embedding layer.

init_params[0] = np.concatenate([init_params[1], (np.random.randn(n_special, n_embd)*0.02).astype(np.float32), init_params[0]], 0)

and the output layer also shares weights with this embedding layer, so it outputs logits for both word indices and position indices.
lm_logits = tf.matmul(lm_h, we, transpose_b=True)

My questions are:

  1. During lm pretraining, did you mask out the logits from those position indices when computing the loss?
  2. If I use the pretrained model as a LM to generate text, do I need to mask out these position indices' logits before softmax when sampling the next word?

BTW, I used the pytorch code ported by huggingface:
https://github.com/huggingface/pytorch-openai-transformer-lm
FYI, I also posted an issue there describing some details of my experiments:
huggingface/pytorch-openai-transformer-lm#36

  • Da Xiao

Using it as a Language model

was trying to use it as a language model to assign a score(could be PPL score) of a given sentence. Something like
P("He is go to school")=0.008
P("He is going to school")=0.08
Which is indicating that the probability of second sentence is higher than first sentence. Is there a way to get a score like this?

Thanks

great dataset

hi

can anybody help me please?
how can i upload a huge amount of data for LM (for example about 2 G) ??

tnx

Where can I get these two datasets?

def rocstories(data_dir, n_train=1497, n_valid=374): storys, comps1, comps2, ys = _rocstories(os.path.join(data_dir, 'cloze_test_val__spring2016 - cloze_test_ALL_val.csv')) teX1, teX2, teX3, _ = _rocstories(os.path.join(data_dir, 'cloze_test_test__spring2016 - cloze_test_ALL_test.csv'))

zip argument error

in train.py, I get the following error:

File "train.py", line 201, in mgpu_train
for i, xs in enumerate(zip(*xs)):
TypeError: zip argument #1 must support iteration

I couldn't resolve this issue, any help would be really appreciated

Supported languages

Probably running ahead of things, but the generic model(s) you plan to release are for English only I suppose?

I would really be interested in ones for more languages, kind of like fasttext has done.

Otherwise I'd have to resort to training them myself on my single GPU :(

Any plans for more languages?

Concatenating context and embeddings?

Hi,

Congratulations on the paper! Those of us who actually worked on ROCStories know how difficult it is!!

I have a small question on how embeddings are handled in the code.

we = tf.get_variable("we", [n_vocab+n_special+n_ctx, n_embd], initializer=tf.random_normal_initializer(stddev=0.02))
e = tf.gather(we, X)
h = tf.reduce_sum(e, 2)

I believe this is equivalent to embeddings_look_up() that people normally use...so we is word embedding. My question is: what is n_ctx (context embedding)? May I ask how is this used in the model?

Thank you very much!


Now that I looked at the code closer, is it an artifact of the Transformer Decoder??

Do you ever try your model on ROCstory training dataset?

I use the training data to train this model. (I make the wrong ending by random)
And use the test data to test.
The result is only about 60%. While the common embedding model can reach 65+%.
I'm not sure whether I use this model in a right way.
Do you ever try this ?

The Conv1d over Linear?

def conv1d(x, scope, nf, rf, w_init=tf.random_normal_initializer(stddev=0.02), b_init=tf.constant_initializer(0), pad='VALID', train=False):

I see you use through the whole code 1-d convolutions, which technically should perform the same as Dense layers. The main difference I found on the internet was a shorter computing time for the Dense layer. Hence why I am wondering why you use 1D convolutions here?

Cheers

About the non-determinism due to GPU ops

Hi,

I understand that there is a non-determinism due to GPU ops and I observed this as well when running twice the same code on the same GPU gave significant different results. However, I was wondering why the pytorch re-implementation https://github.com/huggingface/pytorch-openai-transformer-lm is actually giving the same results when running twice in a raw. Could it be that I am using a "wrong" version of TF? I have tensorflow-gpu 1.4.0, python 3.6, cuda 8.0 and cudnn 6.0. Thanks!

implementation of ExponentialMovingAverage is not correct

def get_ema_if_exists(v, gvs):
    name = v.name.split(':')[0]
    ema_name = name+'/ExponentialMovingAverage:0'
    ema_v = [v for v in gvs if v.name == ema_name]
    if len(ema_v) == 0:
        ema_v = [v]
    return ema_v[0]

def get_ema_vars(*vs):
    if tf.get_variable_scope().reuse:
        gvs = tf.global_variables()
        vs = [get_ema_if_exists(v, gvs) for v in vs]
    if len(vs) == 1:
        return vs[0]
    else:
        return vs

g, b = get_ema_vars(g, b) I think g, b is the original tensor, not the ema

Using conv1d with kernel size 1

Hi!
I've noticed that the training code using 1d convolution with kernel size 1 in all invocations. Do we need convolution at all here? Why not replace it with the fully_connected layer?

Unable to adapt language model

Thank you for the great research and code! Regarding this section in the research paper:

"For CoLA (linguistic acceptability), examples are scored as the average token log-probability the
generative model assigns and predictions are made by thresholding. For SST-2 (sentiment analysis),
we append the token very to each example and restrict the language model’s output distribution to only
the words positive and negative and guess the token it assigns higher probability to as the prediction.
For RACE (question answering), we pick the answer the generative model assigns the highest average
token log-probability when conditioned on the document and question. For DPRD [46] (winograd
schemas), we replace the definite pronoun with the two possible referrents and predict the resolution
that the generative model assigns higher average token log-probability to the rest of the sequence
after the substitution."

I have tried to adapt the language model part of the model in the code to perform the above mentioned tasks. For instance, for CoLA, I fed in the encoded sentences and evaluated the results by thresholding the lm_losses output from the language model. However, the best Matthews correlation coefficient obtained is 0.015, far short of the 0.479 achieved in the paper. How exactly was the model configured to perform the task? Was it purely through the language model output, or was the supervised classification head used?

have you ever try a bigger corpus

I find BERT uses BookCorpus (800M words) and Wikipedia (2500M words) but GPT only uses BookCorpus, even BERT has complex model structure which may leads to effect representation ability, the difference in evaluation result may also comes from training corpus. Have you ever try a bigger corpus like wikipedia?

The compare result also can imply BERT #task 2 influences.

what does the variable n_ctx mean ?

what does the variable n_ctx mean ?
could you explain why it is computed like that ?
n_ctx = min(max(max(len(x[:max_len]) for x in X) for X in [trX, vaX]) + 3, args.n_ctx)

What is training accuracy of the language model?

I trained this architecture on Japanese and got only 50% training accuracy. Is it normal here?

I suspect it is not 100% because we predict for entire sequence in this model, not only for n-gram as Word2Vec.

Thank you very much.

What is the specific formula for learning rate used in adam optimizer during pre-train?

In your paper, the learning rate used in adam optimizer during pre-train is described as follows:
'We used the Adam optimization scheme [27] with a max learning rate of 2.5e-4. The learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule.'
but what is the specific formula for this learning rate?

BUG in utils.py

def get_ema_if_exists(v, gvs):      
    name = v.name.split(':')[0]  
    ema_name = name+'/ExponentialMovingAverage:0'  
    ema_v = [v for v in gvs if v.name == ema_name]  
    if len(ema_v) == 0:  
        ema_v = [v]  
    return ema_v[0]

Here is a piece of code in utils.py line 117 - 123
When you use the list comprehension, the variable v overwrites the argument v in the function.
So what returned is always the last tensor in gvs if the length of ema_v is 0.

This may cause problem. In this case, you are using the same variable for all normalization function.

Convolution layer filter width

In the code that does the convolution, there is a separate implementation for convolution with filter width of 1 and convolution with other filter widths.

Convolution with a filter width of 1 is special cased due to a faster implementation and it is the only filter width used by the current code. Other implementation has a comment #was used to train LM.

Does it mean that a different convolution filter width was used for language modeling or was it just using a less efficient implementation but with the same filter width?

Question about the shape of `X_train`

X_train = tf.placeholder(tf.int32, [n_batch_train, 2, n_ctx, 2])
xmb[:, :, :, 1] = np.arange(n_vocab+n_special, n_vocab+n_special+n_ctx)
why there is a channel of additional tokens?

Why are the "wrong" sentences are learned during training via LM?

Maybe I don't interpret the model(...)-function correctly, but I see the following:

While training you put the correct and wrong rocstories together into the decoder. They both go through the embedding + decoder and then into the sparse_softmax_cross_entropy-function.

This means, though, that the model also learns to generate wrong sentences, or am I missing something?

My intuition would be to set all masks to 0 for the wrong sentences?!

Thanks and regards

Universal Transformer as base architecture

Hello,

First, I would like to thank the authors of this paper for releasing their source code.

Is there a plan to use the same approach using a Universal Transformer as base architecture? Would the adaptive computation time (ACT) mechanism transfer to other tasks?

And more importantly, if this new transformer can be used, do you think the gain would be noticeable?

Cannot reproduce this experiment

Hi,

I tried to run this code to reproduce the accuracy of 85%. But I have executed 3 times and got only 53% accuracy each time.
Could you please tell me some tips about the instructions to run this code or set parameters?
I did't change any code and used the command python train.py --dataset rocstories --desc rocstories --submit --analysis --data_dir [path to data here] in readme.

Thanks

Cannot reproduce RACE score

Hi,

We tried several settings based on your code and paper, but unfortunately we cannot reproduce the RACE score (the training loss decreases, but the dev accuracy only reaches 0.26).
Could you tell me some tips about the parameter or code modification for achieving the performance?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.