google-research / adapter-bert Goto Github PK

License: Apache License 2.0

Python 100.00%

adapter-bert's Introduction

Adapter-BERT

Introduction

This repository contains a version of BERT that can be trained using adapters. Our ICML 2019 paper contains a full description of this technique: Parameter-Efficient Transfer Learning for NLP.

Adapters allow one to train a model to solve new tasks, but adjust only a few parameters per task. This technique yields compact models that share many parameters across tasks, whilst performing similarly to fine-tuning the entire model independently for every task.

The code here is forked from the original BERT repo. It provides our version of BERT with adapters, and the capability to train it on the GLUE tasks. For additional details on BERT, and support for additional tasks, see the original repo.

Tuning BERT with Adapters

The following command provides an example of tuning with adapters on GLUE.

Fine-tuning may be run on a GPU with at least 12GB of RAM, or a Cloud TPU. The same constraints apply as for full fine-tuning of BERT. For additional details, and instructions on downloading a pre-trained checkpoint and the GLUE tasks, see https://github.com/google-research/bert.

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/path/to/glue

python run_classifier.py \
  --task_name=MRPC \
  --do_train=true \
  --do_eval=true \
  --data_dir=$GLUE_DIR/MRPC \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=3e-4 \
  --num_train_epochs=5.0 \
  --output_dir=/tmp/adapter_bert_mrpc/

You should see an output like this:

***** Eval results *****
  eval_accuracy = 0.85784316
  eval_loss = 0.48347527
  global_step = 573
  loss = 0.48347527

This means that the Dev set accuracy was 85.78%. Small sets like MRPC have a high variance in the Dev set accuracy, even when starting from the same pre-training checkpoint. Therefore results may deviate from this by 2%.

Citation

Please use the following citation for this work:

@inproceedings{houlsby2019parameter,
  title = {Parameter-Efficient Transfer Learning for {NLP}},
  author = {Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and De Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain},
  booktitle = {Proceedings of the 36th International Conference on Machine Learning},
  year = {2019},
}

The paper is uploaded to ArXiv.

Disclaimer

This is not an official Google product.

Contact information

For personal communication, please contact Neil Houlsby ([email protected]).

adapter-bert's People

Contributors

Stargazers

Watchers

adapter-bert's Issues

How a near-identity initialization is implemented

Dear authors,

After reading the code, I find the default initialization behavior for adaptor parameters (w1, w2) is initialized with a small standard deviation.

Does this guarantee the projection is a near-identity initialization?

missing processors

Congratulations on the great paper!

One question, do you have additional processor classes?
At the moment, the code reads:

`processors = {
      "cola": ColaProcessor,
      "mnli": MnliProcessor,
      "mrpc": MrpcProcessor,
  }`

and later:

`if task_name not in processors:
    raise ValueError("Task not found: %s" % (task_name))`

meaning that only 3 datasets can be used for training.

It would be useful for replication purposes to have all processors available. Let me know if this is possible or if I have misunderstood!

In adapter-fine-tuning, why don't fix original params?

Hi guys,

I have a glance at the run_classifier.py code and didn't see the code for fixing original transformer parameters, so it's full fine-tune setting, and why? Thanks~

regarding the training speed and data amount requirement

Thanks for the great work here. I have a question, when I read though the paper, I can understand that fewer parameters training should bring speed benefit, and please correct me if this is wrong, since otherwise there is no value for training fewer parameters. If so, I think probably giving the traning time cost is very attractive.

Then my next question is, does using adapter-bert haso can utilize much less training data to do the transfer learning? When we have less data (~10k contexts) we probably want to freeze the whole bert layers and just train the customized layers above them (e.g, a mlp for text-categorization). If adapter-bert can achieve good performance by training on a small dataset, that could be awesome.

Adapters on large-datasets in GLUE could not get the same results

Hi
I am trying adapters on Bert-base. I am evaluating on GLUE. On smaller datasets like MRPC, RTE, COLA, I see good results, but on large datasets of GLUE like MNLI, QNLI, SST2 I am really struggling and this is getting very below BERT-base.

I have a deadline soon and need to compare fairly with your method, and very much appreciate your feedback on this. Any suggestions which can help the results on large-scale datasets?

thanks

freezing "layer_norm" and "head"

Hi
Could you confirm in the implementation of adapters, if layer_norm of the original model should be unfreezed? or only layer_norm inside adapter needs to be unfreezed?
How about the classifier's head? Does it need to be freezed? thanks

Hyperparameters of GLUE datasets

Hi, thanks for your great work!

I fail to reproduce the high results on the GLUE datasets.
Could you provide the hyperparameters w.r.t, training epoch, learning rate..., of the 9 tasks on the GLUE datasets corresponding to the best results claimed in the paper?

Efficient/low-latency serving with multiple adapters?

Thanks for releasing this paper. I must say that I've had similar results to you, where adapters beat linear combinations, conv heads, and every other head I can think of. As a bonus it nearly matches BERT's fine-tuned performance!

A couple of broad questions if that's ok:

The only downside is that if you use adapter models you can't use a single BERT service to provide features to multiple heads (e.g. bert-as-server). Do you have any ideas about efficiently serving multiple adapters? The best thing I can think of is keeping the main model in memory and quickly switching the adapter parameters based on the request.

Also, have you experimented with online/active learning with adapters? It seems like a fruitful area since we don't know how to do online learning well with transformers, but adapters allow you to train with high LR's and fewer parameters.

How to implement adapters in case of pre norm

Hi,
I am having a model in which normalization first happens and then there is add operation. In the paper, you discussed the post-norm case, could you tell me how I can implement adapters for this case? thank you.

I mark the lines normalization and add operation happen with ** to describe the model better:

class LayerFF(nn.Module):
    def __init__(self, config):
        super().__init__()
        if config.feed_forward_proj == "relu":
            self.DenseReluDense = DenseReluDense(config)
        elif config.feed_forward_proj == "gated-gelu":
            self.DenseReluDense = DenseGatedGeluDense(config)
        else:
            raise ValueError(
                f"{self.config.feed_forward_proj} is not supported. Choose between `relu` and `gated-gelu`"
            )

        self.layer_norm = LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)

    def forward(self, hidden_states):
        **forwarded_states = self.layer_norm(hidden_states)**
        forwarded_states = self.DenseReluDense(forwarded_states)
        **hidden_states = hidden_states + self.dropout(forwarded_states)**
        return hidden_states

class LayerSelfAttention(nn.Module):
    def __init__(self, config, has_relative_attention_bias=False):
        super().__init__()
        self.SelfAttention = Attention(config, has_relative_attention_bias=has_relative_attention_bias)
        self.layer_norm = LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        position_bias=None,
        head_mask=None,
        past_key_value=None,
        use_cache=False,
        output_attentions=False,
    ):
        **normed_hidden_states = self.layer_norm(hidden_states)**
        attention_output = self.SelfAttention(
            normed_hidden_states,
            mask=attention_mask,
            position_bias=position_bias,
            head_mask=head_mask,
            past_key_value=past_key_value,
            use_cache=use_cache,
            output_attentions=output_attentions,
        )
        **hidden_states = hidden_states + self.dropout(attention_output[0])**
        outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output them
        return outputs

ValueError: Tensor not found in checkpoint

Hi, I am trying to implement adapter modules into my copy of BERT, however I am running into problems with adding the layers. As far as I can tell, the model gets built correctly, but when I try to run run_pretraining.py I get the following error:

The problem is that it doesn't know what to do with the adapter layers since they aren't found in the checkpoint file - how can I work around this or get BERT to recognize that I want to add them in?

For reference, this is how I am running the script (I've modified it slightly to include the adapter modules as a flag):

python run_pretraining.py \  
--adapter=True \  
--input_file=/path/to/tfrecord/pretrained_iob2.tfrecord \  
--output_dir=/usr/bert/adapter \  
--do_train=True \  
--do_eval=True \  
--bert_config_file=/path/to/bert/multi_cased_L-12_H-768_A-12/bert_config.json \  
--init_checkpoint=/path/to/bert/multi_cased_L-12_H-768_A-12/bert_model.ckpt \  
--train_batch_size=16