Comments (10)
Hi
I appreciate getting some help with adapter for custom models. thanks
from adapters.
Happy that you are interested in AdapterHub. Could you elaborate a bit on what you are trying to achieve and where the blockers are? Which example are you referring to and how do you want to extend it?
from adapters.
from adapters.
- In AdapterConfig, you have defined: original_ln_before: bool, original_ln_after: bool, ln_before: bool, ln_after: bool could you tell me what is the different between original_ln_* and ln_* and for the minimal Houlsby adapter implementation which one are necessary?
Conveniently, the HoulsbyConfig is available as a pre-defined config
ln refers to the LayerNorm, either it is the one from BERT (which we call original
here) or a newly trained one (as part of the adapter). We can configure to apply it either before or after the adapter (hence the naming).
- In adapter_modeling.py, you defined init_bert_weights, is this necessary for the minimal Houlsby implementation?
I believe this is not necessary.
- In the same file, in Adapter class, which of residual before,after, and
layer norms are necessary for the minimal correct implementations?
I don't exactly know what you mean by minimal implementations. I assume you are searching for the right parametrization, see answer to Q1.
- In Adapter class, forward function, there is TODO for multiple layer norm application, could you explain them?
I am not sure what you are referring to. Maybe you provide a link to the exact line? If you mean L81, please see Activation_Function_Class
- Assuming a neural model, uses pre norm like Figure 1 in https://openreview.net/pdf?id=B1x8anVFPr , could you tell me how the adapter can be applied in this case? In the provided examples with BERT we have post norm which is different.
Probably you want to either learn a new layer norm before the adapter (as part of the adapter) or deactivate all of them (re-using LN after the adapter is not possible because there is apparently no LN to re-use).
- In distilled BERT function, I see you have added adapters in the attention outputs too, in the original Housbly paper, this is applied always after feed-forward layer, could you explain this?
This is a new feature, maybe @calpt can comment on this.
from adapters.
Hi
thank you for the reply, was helpful for me, about 5 why layer norm should be placed before adapter? I am currently getting very low score with the models having pre norm, do you have possible suggestions? thanks
from adapters.
Hi
To give the context, I am using a seq2seq model for classification, I have added adapter layers between the layers of encoder and decoder, and then I am getting 0 accuracy, since decoder output is used for classification, in the documentations you have something called add_classification_head, in case classification is done with a decoder, as a seq2seq model, could you assist me how I can solve the issue? and what components can be missing? thanks
from adapters.
Hi about 5, here is my model for better explanation, I mark the lines normalization and add operation happen with ** to describe the model better. Could you tell me where to add adapter layers, and inside adapter where to add layer norms? I am getting low-performance and I appreciate telling me the correct position to add adapter and layernorm implementations. thanks
class LayerFF(nn.Module):
def __init__(self, config):
super().__init__()
if config.feed_forward_proj == "relu":
self.DenseReluDense = DenseReluDense(config)
elif config.feed_forward_proj == "gated-gelu":
self.DenseReluDense = DenseGatedGeluDense(config)
else:
raise ValueError(
f"{self.config.feed_forward_proj} is not supported. Choose between `relu` and `gated-gelu`"
)
self.layer_norm = LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
self.dropout = nn.Dropout(config.dropout_rate)
def forward(self, hidden_states):
**forwarded_states = self.layer_norm(hidden_states)**
forwarded_states = self.DenseReluDense(forwarded_states)
**hidden_states = hidden_states + self.dropout(forwarded_states)**
return hidden_states
class LayerSelfAttention(nn.Module):
def __init__(self, config, has_relative_attention_bias=False):
super().__init__()
self.SelfAttention = Attention(config, has_relative_attention_bias=has_relative_attention_bias)
self.layer_norm = LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
self.dropout = nn.Dropout(config.dropout_rate)
def forward(
self,
hidden_states,
attention_mask=None,
position_bias=None,
head_mask=None,
past_key_value=None,
use_cache=False,
output_attentions=False,
):
**normed_hidden_states = self.layer_norm(hidden_states)**
attention_output = self.SelfAttention(
normed_hidden_states,
mask=attention_mask,
position_bias=position_bias,
head_mask=head_mask,
past_key_value=past_key_value,
use_cache=use_cache,
output_attentions=output_attentions,
)
**hidden_states = hidden_states + self.dropout(attention_output[0])**
outputs = (hidden_states,) + attention_output[1:] # add attentions if we output them
return outputs
from adapters.
We re-use the original layer norm of BERT and with that the output dist of the adapter is rescaled to exactly the mean and variance expected by the next BERT layer. If you now use an entirely different model architecture with pre-layer norm, you perhaps need a different adapter architecture (e.g. found by experimentation). You may want to have a look at the AdapterFusion paper, it provides some insight where to place the adapter or which components have more influence on the standard BERT.
from adapters.
- In distilled BERT function, I see you have added adapters in the
attention outputs too, in the original Housbly paper, this is applied
always after feed-forward layer, could you explain this?
@rabeehkarimimahabadi Houlsby et al. add adapters in two places in every Transformers layer, once after the self-attention and once after the FF layer (see Figure 2 of their paper). As far as I'm aware, this is how it is implemented in DistilBERT, once for the self-attention output (sa_output
) and once for the FF output (ffn_output
). Please provide the specific code lines you're referring to if this is not what you meant.
from adapters.
thank you for the explanations.
from adapters.
Related Issues (20)
- Logits are changing in old adapters-transformer models if used by the new library HOT 1
- Transformers 4.35.0 Snyk Vulnerability HOT 1
- Error when trying to do torch.save HOT 2
- multi GPU setup causes error HOT 1
- Seed for Adapter Initialization?
- export onnx HOT 1
- How to add an adapter to a quantized model without peft? HOT 1
- load adapters from safetensor files (enhancement) HOT 2
- Cannot automatically convert prediction head of model class RobertaAdapterModel to flex head. HOT 2
- Adding adapters and fusion layer to UNIPELT
- Error when trying to set up adapter training for CodeLlama HOT 1
- T5 AdapterDrop Prefix-Tuning Bug HOT 2
- No difference in performance or speed between different adapter configs
- Total Parameter added to original model HOT 1
- UNIPELT original paper use alpha = 2 for Lora while the UNIPELT implementation in adapter has a alpha = 8 HOT 1
- regression head? HOT 4
- Custom Heads not working with adapters
- UNIPELT, UNIPELT (AP) and UNIPELT (APL)
- QuestionAnsweringTrainer for adapter? HOT 1
- "`.to` is not supported for `4-bit` or `8-bit` bitsandbytes models" when i use load_best_model_at_end=True in QLoRa HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adapters.