Hi could you point me how I can use adapters with seq2seq models in huggingafce re

adapter with seq2seq models,about adapter-hub/adapters

Comments (10)

rabeehkarimimahabadi commented on May 18, 2024

Hi
I appreciate getting some help with adapter for custom models. thanks

from adapters.

arueckle commented on May 18, 2024

Happy that you are interested in AdapterHub. Could you elaborate a bit on what you are trying to achieve and where the blockers are? Which example are you referring to and how do you want to extend it?

from adapters.

rabeehkarimimahabadi commented on May 18, 2024

Hi Andreas Thanks for coming back to me. I have a couple of questions, (might be more as I understand the code better), I am basically looking for a minimal adapter implementation, without added details, here are some questions 1) In AdapterConfig, you have defined: original_ln_before: bool, original_ln_after: bool, ln_before: bool, ln_after: bool could you tell me what is the different between original_ln_* and ln_* and for the minimal Houlsby adapter implementation which one are necessary? 2) In adapter_modeling.py, you defined init_bert_weights, is this necessary for the minimal Houlsby implementation? 3) In the same file, in Adapter class, which of residual before,after, and layer norms are necessary for the minimal correct implementations? 4) In Adapter class, forward function, there is TODO for multiple layer norm application, could you explain them? 5) Assuming a neural model, uses pre norm like Figure 1 in https://openreview.net/pdf?id=B1x8anVFPr , could you tell me how the adapter can be applied in this case? In the provided examples with BERT we have post norm which is different. 6) In distilled BERT function, I see you have added adapters in the attention outputs too, in the original Housbly paper, this is applied always after feed-forward layer, could you explain this? thank you very much. Best Rabeeh

…

On Fri, Nov 27, 2020 at 9:45 AM Andreas Rücklé ***@***.***> wrote: Happy that you are interested in AdapterHub. Could you elaborate a bit on what you are trying to achieve and where the blockers are? Which example are you referring to and how do you want to extend it? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#92 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARPXHH5TP2TQARBY7RQ4XGLSR5RIZANCNFSM4UDOKALQ> .

from adapters.

arueckle commented on May 18, 2024

In AdapterConfig, you have defined: original_ln_before: bool, original_ln_after: bool, ln_before: bool, ln_after: bool could you tell me what is the different between original_ln_* and ln_* and for the minimal Houlsby adapter implementation which one are necessary?

Conveniently, the HoulsbyConfig is available as a pre-defined config

ln refers to the LayerNorm, either it is the one from BERT (which we call original here) or a newly trained one (as part of the adapter). We can configure to apply it either before or after the adapter (hence the naming).

In adapter_modeling.py, you defined init_bert_weights, is this necessary for the minimal Houlsby implementation?

I believe this is not necessary.

In the same file, in Adapter class, which of residual before,after, and
layer norms are necessary for the minimal correct implementations?

I don't exactly know what you mean by minimal implementations. I assume you are searching for the right parametrization, see answer to Q1.

In Adapter class, forward function, there is TODO for multiple layer norm application, could you explain them?

I am not sure what you are referring to. Maybe you provide a link to the exact line? If you mean L81, please see Activation_Function_Class

Assuming a neural model, uses pre norm like Figure 1 in https://openreview.net/pdf?id=B1x8anVFPr , could you tell me how the adapter can be applied in this case? In the provided examples with BERT we have post norm which is different.

Probably you want to either learn a new layer norm before the adapter (as part of the adapter) or deactivate all of them (re-using LN after the adapter is not possible because there is apparently no LN to re-use).

In distilled BERT function, I see you have added adapters in the attention outputs too, in the original Housbly paper, this is applied always after feed-forward layer, could you explain this?

This is a new feature, maybe @calpt can comment on this.

from adapters.

rabeehkarimimahabadi commented on May 18, 2024

Hi
thank you for the reply, was helpful for me, about 5 why layer norm should be placed before adapter? I am currently getting very low score with the models having pre norm, do you have possible suggestions? thanks

from adapters.

rabeehkarimimahabadi commented on May 18, 2024

Hi
To give the context, I am using a seq2seq model for classification, I have added adapter layers between the layers of encoder and decoder, and then I am getting 0 accuracy, since decoder output is used for classification, in the documentations you have something called add_classification_head, in case classification is done with a decoder, as a seq2seq model, could you assist me how I can solve the issue? and what components can be missing? thanks

from adapters.

rabeehkarimimahabadi commented on May 18, 2024

Hi about 5, here is my model for better explanation, I mark the lines normalization and add operation happen with ** to describe the model better. Could you tell me where to add adapter layers, and inside adapter where to add layer norms? I am getting low-performance and I appreciate telling me the correct position to add adapter and layernorm implementations. thanks

class LayerFF(nn.Module):
    def __init__(self, config):
        super().__init__()
        if config.feed_forward_proj == "relu":
            self.DenseReluDense = DenseReluDense(config)
        elif config.feed_forward_proj == "gated-gelu":
            self.DenseReluDense = DenseGatedGeluDense(config)
        else:
            raise ValueError(
                f"{self.config.feed_forward_proj} is not supported. Choose between `relu` and `gated-gelu`"
            )

        self.layer_norm = LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)

    def forward(self, hidden_states):
        **forwarded_states = self.layer_norm(hidden_states)**
        forwarded_states = self.DenseReluDense(forwarded_states)
        **hidden_states = hidden_states + self.dropout(forwarded_states)**
        return hidden_states

class LayerSelfAttention(nn.Module):
    def __init__(self, config, has_relative_attention_bias=False):
        super().__init__()
        self.SelfAttention = Attention(config, has_relative_attention_bias=has_relative_attention_bias)
        self.layer_norm = LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        position_bias=None,
        head_mask=None,
        past_key_value=None,
        use_cache=False,
        output_attentions=False,
    ):
        **normed_hidden_states = self.layer_norm(hidden_states)**
        attention_output = self.SelfAttention(
            normed_hidden_states,
            mask=attention_mask,
            position_bias=position_bias,
            head_mask=head_mask,
            past_key_value=past_key_value,
            use_cache=use_cache,
            output_attentions=output_attentions,
        )
        **hidden_states = hidden_states + self.dropout(attention_output[0])**
        outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output them
        return outputs

from adapters.

arueckle commented on May 18, 2024

We re-use the original layer norm of BERT and with that the output dist of the adapter is rescaled to exactly the mean and variance expected by the next BERT layer. If you now use an entirely different model architecture with pre-layer norm, you perhaps need a different adapter architecture (e.g. found by experimentation). You may want to have a look at the AdapterFusion paper, it provides some insight where to place the adapter or which components have more influence on the standard BERT.

from adapters.

calpt commented on May 18, 2024

In distilled BERT function, I see you have added adapters in the
attention outputs too, in the original Housbly paper, this is applied
always after feed-forward layer, could you explain this?

@rabeehkarimimahabadi Houlsby et al. add adapters in two places in every Transformers layer, once after the self-attention and once after the FF layer (see Figure 2 of their paper). As far as I'm aware, this is how it is implemented in DistilBERT, once for the self-attention output (sa_output) and once for the FF output (ffn_output). Please provide the specific code lines you're referring to if this is not what you meant.

from adapters.

rabeehkarimimahabadi commented on May 18, 2024

thank you for the explanations.

from adapters.

adapter with seq2seq models about adapters HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent