Giter VIP home page Giter VIP logo

Comments (10)

rabeehkarimimahabadi avatar rabeehkarimimahabadi commented on May 18, 2024

Hi
I appreciate getting some help with adapter for custom models. thanks

from adapters.

arueckle avatar arueckle commented on May 18, 2024

Happy that you are interested in AdapterHub. Could you elaborate a bit on what you are trying to achieve and where the blockers are? Which example are you referring to and how do you want to extend it?

from adapters.

rabeehkarimimahabadi avatar rabeehkarimimahabadi commented on May 18, 2024

from adapters.

arueckle avatar arueckle commented on May 18, 2024
  1. In AdapterConfig, you have defined: original_ln_before: bool, original_ln_after: bool, ln_before: bool, ln_after: bool could you tell me what is the different between original_ln_* and ln_* and for the minimal Houlsby adapter implementation which one are necessary?

Conveniently, the HoulsbyConfig is available as a pre-defined config

ln refers to the LayerNorm, either it is the one from BERT (which we call original here) or a newly trained one (as part of the adapter). We can configure to apply it either before or after the adapter (hence the naming).

  1. In adapter_modeling.py, you defined init_bert_weights, is this necessary for the minimal Houlsby implementation?

I believe this is not necessary.

  1. In the same file, in Adapter class, which of residual before,after, and
    layer norms are necessary for the minimal correct implementations?

I don't exactly know what you mean by minimal implementations. I assume you are searching for the right parametrization, see answer to Q1.

  1. In Adapter class, forward function, there is TODO for multiple layer norm application, could you explain them?

I am not sure what you are referring to. Maybe you provide a link to the exact line? If you mean L81, please see Activation_Function_Class

  1. Assuming a neural model, uses pre norm like Figure 1 in https://openreview.net/pdf?id=B1x8anVFPr , could you tell me how the adapter can be applied in this case? In the provided examples with BERT we have post norm which is different.

Probably you want to either learn a new layer norm before the adapter (as part of the adapter) or deactivate all of them (re-using LN after the adapter is not possible because there is apparently no LN to re-use).

  1. In distilled BERT function, I see you have added adapters in the attention outputs too, in the original Housbly paper, this is applied always after feed-forward layer, could you explain this?

This is a new feature, maybe @calpt can comment on this.

from adapters.

rabeehkarimimahabadi avatar rabeehkarimimahabadi commented on May 18, 2024

Hi
thank you for the reply, was helpful for me, about 5 why layer norm should be placed before adapter? I am currently getting very low score with the models having pre norm, do you have possible suggestions? thanks

from adapters.

rabeehkarimimahabadi avatar rabeehkarimimahabadi commented on May 18, 2024

Hi
To give the context, I am using a seq2seq model for classification, I have added adapter layers between the layers of encoder and decoder, and then I am getting 0 accuracy, since decoder output is used for classification, in the documentations you have something called add_classification_head, in case classification is done with a decoder, as a seq2seq model, could you assist me how I can solve the issue? and what components can be missing? thanks

from adapters.

rabeehkarimimahabadi avatar rabeehkarimimahabadi commented on May 18, 2024

Hi about 5, here is my model for better explanation, I mark the lines normalization and add operation happen with ** to describe the model better. Could you tell me where to add adapter layers, and inside adapter where to add layer norms? I am getting low-performance and I appreciate telling me the correct position to add adapter and layernorm implementations. thanks

class LayerFF(nn.Module):
    def __init__(self, config):
        super().__init__()
        if config.feed_forward_proj == "relu":
            self.DenseReluDense = DenseReluDense(config)
        elif config.feed_forward_proj == "gated-gelu":
            self.DenseReluDense = DenseGatedGeluDense(config)
        else:
            raise ValueError(
                f"{self.config.feed_forward_proj} is not supported. Choose between `relu` and `gated-gelu`"
            )

        self.layer_norm = LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)

    def forward(self, hidden_states):
        **forwarded_states = self.layer_norm(hidden_states)**
        forwarded_states = self.DenseReluDense(forwarded_states)
        **hidden_states = hidden_states + self.dropout(forwarded_states)**
        return hidden_states
class LayerSelfAttention(nn.Module):
    def __init__(self, config, has_relative_attention_bias=False):
        super().__init__()
        self.SelfAttention = Attention(config, has_relative_attention_bias=has_relative_attention_bias)
        self.layer_norm = LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        position_bias=None,
        head_mask=None,
        past_key_value=None,
        use_cache=False,
        output_attentions=False,
    ):
        **normed_hidden_states = self.layer_norm(hidden_states)**
        attention_output = self.SelfAttention(
            normed_hidden_states,
            mask=attention_mask,
            position_bias=position_bias,
            head_mask=head_mask,
            past_key_value=past_key_value,
            use_cache=use_cache,
            output_attentions=output_attentions,
        )
        **hidden_states = hidden_states + self.dropout(attention_output[0])**
        outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output them
        return outputs

from adapters.

arueckle avatar arueckle commented on May 18, 2024

We re-use the original layer norm of BERT and with that the output dist of the adapter is rescaled to exactly the mean and variance expected by the next BERT layer. If you now use an entirely different model architecture with pre-layer norm, you perhaps need a different adapter architecture (e.g. found by experimentation). You may want to have a look at the AdapterFusion paper, it provides some insight where to place the adapter or which components have more influence on the standard BERT.

from adapters.

calpt avatar calpt commented on May 18, 2024
  1. In distilled BERT function, I see you have added adapters in the
    attention outputs too, in the original Housbly paper, this is applied
    always after feed-forward layer, could you explain this?

@rabeehkarimimahabadi Houlsby et al. add adapters in two places in every Transformers layer, once after the self-attention and once after the FF layer (see Figure 2 of their paper). As far as I'm aware, this is how it is implemented in DistilBERT, once for the self-attention output (sa_output) and once for the FF output (ffn_output). Please provide the specific code lines you're referring to if this is not what you meant.

from adapters.

rabeehkarimimahabadi avatar rabeehkarimimahabadi commented on May 18, 2024

thank you for the explanations.

from adapters.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.