princeton-nlp / dinkytrain Goto Github PK

View Code? Open in Web Editor NEW

108.0 5.0 8.0 12.22 MB

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

License: MIT License

Python 97.81% C++ 0.55% Cuda 0.99% Cython 0.34% Shell 0.21% Lua 0.11%

dinkytrain's People

Contributors

Stargazers

Watchers

Forkers

dumpmemory leyauu carlosejimenez ctlllll leoozy tesla3 abhishekpanigrahi1996 runtianz

dinkytrain's Issues

Why do you use both layer_norm for embedding and pre-norm at the same time?

Hi,

this is great work that makes BERT pre-training more transparent. Here I have a question about your architecture.

In https://github.com/princeton-nlp/DinkyTrain/blob/main/run_efficient_mlm_recipe.sh, you set the following two flags:

--arch roberta_large
--encoder-normalize-before

I understand you want to use pre-norm BERT. However, the default setting for roberta_large is with --layernorm-embedding=True, which means you will use two layernorm layers continuously after the word embedding layer. I think you also need to set --layernorm-embedding=False.

Perplexity numbers for masking rates reported in Table 3.

Hi,

Thanks for the wonderful work. I was wondering, if you have the PPL for all the masking rates reported in Table 3?

Thanks
@CodeCreator @gaotianyu1350 @danqi @a3616001

Could you please tell how to set the hyparameters of the GLUE?

Hello, I noticed that you gave a search space of hyparameters on GLUE dataset. I am confused that how you search the hyparameters. Did you train each task on each of the hyparameters with different seeds? There are about fifty combinations of parameters. Did you finetune GLUE with each of the parameters? Thank you.

Roberta recipe in your paper is different from the original recipe

Hi,

in Table 8 in https://arxiv.org/pdf/2202.08005.pdf, the recipe is different from the original Roberta recipe. Roberta_large uses a batch size of 8196, a peak learning rate of 4e-4 and trains for 100K steps. Your parameter setting seems from Table 3 in https://arxiv.org/pdf/1907.11692.pdf. But this recipe is used for Roberta_base instead of Roberta_large.

Prediction values of the STS-B test set are not in 0~5

Hi,

As we know, the STS-B task is a regression task where the targets are in [0, 5]. The .csv file submitted to the GLUE leaderboard is also required to be in [0, 5]. Otherwise, errors appear in the GLUE submission system.

During data preprocessing for GLUE data, fairseq script normalizes the target values to [0, 1]. MSE loss is applied to compute the difference between the logits and the normalized targets. For prediction, we need to multiply 5.

However, during prediction, how could we make sure the predicted values are restricted to [0, 1]? Since there is no activation function, like sigmoid, for the logits.

fairseq-train: error: argument --arch/-a: invalid choice: 'deepspeed_roberta_large'

Hello, Thank you for your code. I am trying to reproducing your results by "GPU=8 DATA_DIR=/dev/gbert/dataset DEEPSPEED=1 bash run_efficient_mlm_recipe.sh" But I got an error:

fairseq-train: error: argument --arch/-a: invalid choice: 'deepspeed_roberta_large'

I noticed that you added the prefix "deepspeed_${ARCH}" in run_efficient_mlm_recipe.sh. And this arch is not registered in fairseq I think. Could you please tell me why you add a prefix and how to solve the problem? Thank you!

Converting fairseq models to huggingface fails for models trained using DeepSpeed

Using scripts/convert_fs_ckpt_to_hf_ckpt.py with source model trained using deepspeed fails with mismatched model keys.

While install dependencies ERROR: Could not find a version that satisfies the requirement hydra-core<1.1,>=1.0.7 (from fairseq) (from versions: none) ERROR: No matching distribution found for hydra-core<1.1,>=1.0.7

Why do you use last checkpoint for validation rather than best checkpoint?

Hi,

in your script https://github.com/princeton-nlp/DinkyTrain/blob/main/finetune_glue.sh, it seems you use last checkpoint to validate. May I ask why don't you use the best checkpoint?

In addition, if you use the last checkpoint to validate, does it means that you also use it to test?

princeton-nlp / dinkytrain Goto Github PK

dinkytrain's People

Contributors

Stargazers

Watchers

Forkers

dinkytrain's Issues

Why do you use both layer_norm for embedding and pre-norm at the same time?

Perplexity numbers for masking rates reported in Table 3.

Could you please tell how to set the hyparameters of the GLUE?

Roberta recipe in your paper is different from the original recipe

Prediction values of the STS-B test set are not in 0~5

fairseq-train: error: argument --arch/-a: invalid choice: 'deepspeed_roberta_large'

Converting fairseq models to huggingface fails for models trained using DeepSpeed

While install dependencies ERROR: Could not find a version that satisfies the requirement hydra-core<1.1,>=1.0.7 (from fairseq) (from versions: none) ERROR: No matching distribution found for hydra-core<1.1,>=1.0.7

Why do you use last checkpoint for validation rather than best checkpoint?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent