Comments (9)
It seems like you could replace both self.mlp
and self.mlpf
with a single nn.Sequential
.
from lxmls-toolkit.
It would look something like this:
class Block(nn.Module):
""" an unassuming Transformer block """
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = nn.Sequential(
nn.Linear(config.n_embd, 4 * config.n_embd),
NewGELU(),
nn.Linear(4 * config.n_embd, config.n_embd),
nn.Dropout(config.resid_pdrop)
)
def forward(self, x):
block_in = x
# x = x + self.attn(self.ln_1(x))
attn_out = block_in + self.attn(self.ln_1(block_in))
# x = x + self.mlp(self.ln_2(x))
block_out = attn_out + self.mlp(self.ln_2(attn_out))
return block_out
If you do it this way, note that there's no need to distinguish between mlp
and mlpf
, which I view as a positive. I also tried to refactor forward so that the variable names are clearer -- constantly mutating x
makes it harder to tell what the actual connections are.
By the same token, maybe there are clearer names for ln_1
and ln_2
? Numerical indices tell us nothing about where/how they are applied.
from lxmls-toolkit.
this part here is not very clear
https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L88
the use of Lambda is a bit confusing. In general, once we have stuff running, we should change the short name to long self-descriptive names.
from lxmls-toolkit.
I would convert this
class Block(nn.Module):
""" an unassuming Transformer block """
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = nn.ModuleDict(dict(
c_fc = nn.Linear(config.n_embd, 4 * config.n_embd),
c_proj = nn.Linear(4 * config.n_embd, config.n_embd),
act = NewGELU(),
dropout = nn.Dropout(config.resid_pdrop),
))
m = self.mlp
self.mlpf = lambda x: m.dropout(m.c_proj(m.act(m.c_fc(x)))) # MLP forward
def forward(self, x):
x = x + self.attn(self.ln_1(x))
x = x + self.mlpf(self.ln_2(x))
return x
into
class Block(nn.Module):
""" an unassuming Transformer block """
def __init__(self, config):
super().__init__()
self.layer_norm_1 = nn.LayerNorm(config.n_embd)
self.dropout_1 = nn.Dropout(config.resid_pdrop)
self.causal_multi_head_attention = CausalSelfAttention(config)
self.layer_norm_2 = nn.LayerNorm(config.n_embd)
self.dropout_2 = nn.Dropout(config.att_pdrop)
self.mlp = nn.ModuleDict(dict(
c_fc = nn.Linear(config.n_embd, 4 * config.n_embd),
c_proj = nn.Linear(4 * config.n_embd, config.n_embd),
act = NewGELU()
))
def feed_forward(self, x):
return self.mlp.c_proj(self.mlp.act(self.mlp.c_fc(x)))
def forward(self, x):
# NOTE: We need to remove dropout from CausalSelfAttention
x = x + self.dropout_1(self.causal_multi_head_attention(self.layer_norm_1(x)))
x = x + self.dropout_2(self.feed_forward(self.layer_norm_2(x)))
return x
from lxmls-toolkit.
Can you update with the Sequential()
version @bpopeters ?
from lxmls-toolkit.
Naming convention of @ramon-astudillo looks good, if we want to avoid from numerical indices we can use pre/post_attention e.g replace self.layer_norm_1 with self.layer_norm_pre_att and self.layer_norm_2 with self.layer_norm_post_att.
I will start to make changes locally then we can merge after testing with notebooks.
from lxmls-toolkit.
this one integrates both mine and @bpopeters
class Block(nn.Module):
""" an unassuming Transformer block """
def __init__(self, config):
super().__init__()
# Causal Multi Head Attention
# NOTE: We need to remove dropout from CausalSelfAttention
self.causal_multi_head_attention = CausalSelfAttention(config)
# Feeed Forward
self.feed_forward = nn.Sequential(
nn.Linear(config.n_embd, 4 * config.n_embd),
NewGELU(),
nn.Linear(4 * config.n_embd, config.n_embd),
)
# Dropout and Layer normalization
self.layer_norm_1 = nn.LayerNorm(config.n_embd)
self.dropout_1 = nn.Dropout(config.resid_pdrop)
self.layer_norm_2 = nn.LayerNorm(config.n_embd)
self.dropout_2 = nn.Dropout(config.att_pdrop)
def forward(self, x):
block_in = x
attn_out = block_in + self.dropout_1(self.causal_multi_head_attention(self.layer_norm_1(block_in)))
block_out = attn_out + self.dropout_2(self.feed_forward(self.layer_norm_2(attn_out)))
return block_out
from lxmls-toolkit.
@grigvardanyan you can keep changes in a branch and then, once we have a running version from @venelink and @gonmelo we can start moving parts there piece by piece to ensure nothing breaks
from lxmls-toolkit.
I tested changed version of Casual self attention Layer it works fine I can push it now.
The problem is the only naming convention, which we also want to change. from_pretrained function maps hugging face model's layers with custom gpt model's layers by using names of each layer. One solution can be to change mapping part which will map our names to HF models names, but for exercise we also need to guide students to use our preferred names for the layers, otherwise it will broke loading of weights.
e.g
#TODO Complete Transformer Layer
self.causal_multi_head_attention = #
self.feed_forward = nn.Sequential(OrderedDict([
("fully_connected_first" , ),
("gelu", ),
("fully_connected_last", )
]))
self.layer_norm_1 = #
self.dropout_1 = #
self.layer_norm_2 = #
self.dropout_2 = #
from lxmls-toolkit.
Related Issues (20)
- Fix build fail due to deprecated code in parsing day HOT 1
- Toolkit still does not pass Travis tests HOT 3
- Some notebooks contain saved output HOT 3
- Fix build fail do to deprecated use of imp in matplotlib
- Reinforcement Learning Day Lab
- Gain/Scaling factor for glorot initialization is probably wrong HOT 3
- Dictionary in pos_corpus.py HOT 2
- Day 0: Tests to check installation HOT 1
- Exercise 2.11, expectation-maximisation results
- Bugs in the matrix tree theorem
- Problem when installing with pip HOT 3
- Transformer Day: Get miniGPT up and running
- Transformer Day: Create Transformer Exercises HOT 3
- Match RNN code and pseudo-code HOT 1
- Simple install with miniconda HOT 8
- Remove the legacy code from the toolkit HOT 1
- Propose alternative to student branch HOT 2
- Fix the attention code to allow GPT-2 weight loading HOT 2
- Propose a method for auto-grading HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lxmls-toolkit.