The unipelt from morningmoni

Too annoying

Your code makes people feel headache reading. Can't you do a demo that fine-tuning a layer of transformer

Differences between code and paper?

Hey,
First of all, thanks a lot for your great work and especially for open-sourcing your code!
While studying your implementation in more detail, we believe to have found some discrepancies between the description of your architecture in the paper and the code implementation that we were unsure about and would love to get some clarification on:

In Figure 1 of your paper, you show which modules of the Transformer layer are injected with trainable parameters. For LoRA, it is shown that the Query and Key matrices of the attention are adapted. However, from the code, it seems you actually adapt the Query and Value matrices:

UniPELT/transformers/models/bert/modeling_bert.py

Lines 224 to 236 in 53d3c35

 if config.add_lora: 

 self.query = LoRA_Linear(config.hidden_size, self.all_head_size, config.lora_r, 

 lora_alpha=config.lora_alpha, add_gate=config.add_lora_gate, 

 add_central_gate=config.add_central_gate) 

 else: 

 self.query = nn.Linear(config.hidden_size, self.all_head_size) 

 self.key = nn.Linear(config.hidden_size, self.all_head_size) 

 if config.add_lora: 

 self.value = LoRA_Linear(config.hidden_size, self.all_head_size, config.lora_r, 

 lora_alpha=config.lora_alpha, add_gate=config.add_lora_gate, 

 add_central_gate=config.add_central_gate) 

 else: 

 self.value = nn.Linear(config.hidden_size, self.all_head_size)

Do you have any insights on which of the variants worked better in your setup?

As you describe in your paper (and show in Figure 1), each trainable submodule is fitted with one gate G, i.e. the LoRA layer has one gate that is shared between the parameters adapting the Q and K/V matrices. However, in your code, the attention matrices are adapted by separate instances of the LoRA_Linear module, each of which (by default) adds its own gating layer, thus (from our understanding) resulting in two gates per LoRA layer:

UniPELT/lora_layers.py

Lines 127 to 128 in 53d3c35

if add_gate:

self.lora_gate = nn.Linear(in_features, 1)

As these variants are slightly different in terms of parameter budget, we would be interested to know which variant you used for performing your experiments?

Would be great to get some insights from your side on these points. And again, thanks for sharing your code!

morningmoni / unipelt Goto Github PK

unipelt's Introduction

Connect with me:

unipelt's People

Contributors

Stargazers

Watchers

Forkers

unipelt's Issues

Too annoying

Differences between code and paper?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if config.add_lora:
	self.query = LoRA_Linear(config.hidden_size, self.all_head_size, config.lora_r,
	lora_alpha=config.lora_alpha, add_gate=config.add_lora_gate,
	add_central_gate=config.add_central_gate)
	else:
	self.query = nn.Linear(config.hidden_size, self.all_head_size)
	self.key = nn.Linear(config.hidden_size, self.all_head_size)
	if config.add_lora:
	self.value = LoRA_Linear(config.hidden_size, self.all_head_size, config.lora_r,
	lora_alpha=config.lora_alpha, add_gate=config.add_lora_gate,
	add_central_gate=config.add_central_gate)
	else:
	self.value = nn.Linear(config.hidden_size, self.all_head_size)

	if add_gate:
	self.lora_gate = nn.Linear(in_features, 1)