Giter VIP home page Giter VIP logo

Comments (9)

bpopeters avatar bpopeters commented on August 16, 2024 2

It seems like you could replace both self.mlp and self.mlpf with a single nn.Sequential.

from lxmls-toolkit.

bpopeters avatar bpopeters commented on August 16, 2024 1

It would look something like this:

class Block(nn.Module):
    """ an unassuming Transformer block """

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)

        self.mlp = nn.Sequential(
            nn.Linear(config.n_embd, 4 * config.n_embd),
            NewGELU(),
            nn.Linear(4 * config.n_embd, config.n_embd),
            nn.Dropout(config.resid_pdrop)
        )

    def forward(self, x):
        block_in = x
        # x = x + self.attn(self.ln_1(x))
        attn_out = block_in + self.attn(self.ln_1(block_in))

        # x = x + self.mlp(self.ln_2(x))
        block_out = attn_out + self.mlp(self.ln_2(attn_out))
        return block_out

If you do it this way, note that there's no need to distinguish between mlp and mlpf, which I view as a positive. I also tried to refactor forward so that the variable names are clearer -- constantly mutating x makes it harder to tell what the actual connections are.

By the same token, maybe there are clearer names for ln_1 and ln_2? Numerical indices tell us nothing about where/how they are applied.

from lxmls-toolkit.

ramon-astudillo avatar ramon-astudillo commented on August 16, 2024

this part here is not very clear

https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L88

the use of Lambda is a bit confusing. In general, once we have stuff running, we should change the short name to long self-descriptive names.

from lxmls-toolkit.

ramon-astudillo avatar ramon-astudillo commented on August 16, 2024

I would convert this

class Block(nn.Module):
    """ an unassuming Transformer block """

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = nn.ModuleDict(dict(
            c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd),
            c_proj  = nn.Linear(4 * config.n_embd, config.n_embd),
            act     = NewGELU(),
            dropout = nn.Dropout(config.resid_pdrop),
        ))
        m = self.mlp
        self.mlpf = lambda x: m.dropout(m.c_proj(m.act(m.c_fc(x)))) # MLP forward

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlpf(self.ln_2(x))
        return x

into

class Block(nn.Module):
    """ an unassuming Transformer block """

    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.n_embd)
        self.dropout_1 = nn.Dropout(config.resid_pdrop)
        self.causal_multi_head_attention = CausalSelfAttention(config)
        self.layer_norm_2 = nn.LayerNorm(config.n_embd)
        self.dropout_2 = nn.Dropout(config.att_pdrop)
        self.mlp = nn.ModuleDict(dict(
            c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd),
            c_proj  = nn.Linear(4 * config.n_embd, config.n_embd),
            act     = NewGELU()
        ))

    def feed_forward(self, x):
        return self.mlp.c_proj(self.mlp.act(self.mlp.c_fc(x)))

    def forward(self, x):
        # NOTE: We need to remove dropout from CausalSelfAttention
        x = x + self.dropout_1(self.causal_multi_head_attention(self.layer_norm_1(x)))
        x = x + self.dropout_2(self.feed_forward(self.layer_norm_2(x)))
        return x

from lxmls-toolkit.

ramon-astudillo avatar ramon-astudillo commented on August 16, 2024

Can you update with the Sequential() version @bpopeters ?

from lxmls-toolkit.

GrigVardanyanHS avatar GrigVardanyanHS commented on August 16, 2024

Naming convention of @ramon-astudillo looks good, if we want to avoid from numerical indices we can use pre/post_attention e.g replace self.layer_norm_1 with self.layer_norm_pre_att and self.layer_norm_2 with self.layer_norm_post_att.

I will start to make changes locally then we can merge after testing with notebooks.

from lxmls-toolkit.

ramon-astudillo avatar ramon-astudillo commented on August 16, 2024

this one integrates both mine and @bpopeters

class Block(nn.Module):
    """ an unassuming Transformer block """

    def __init__(self, config):
        super().__init__()

        # Causal Multi Head Attention
        # NOTE: We need to remove dropout from CausalSelfAttention
        self.causal_multi_head_attention = CausalSelfAttention(config)

        # Feeed Forward
        self.feed_forward = nn.Sequential(
            nn.Linear(config.n_embd, 4 * config.n_embd),
            NewGELU(),
            nn.Linear(4 * config.n_embd, config.n_embd),
        )

        # Dropout and Layer normalization
        self.layer_norm_1 = nn.LayerNorm(config.n_embd)
        self.dropout_1 = nn.Dropout(config.resid_pdrop)
        self.layer_norm_2 = nn.LayerNorm(config.n_embd)
        self.dropout_2 = nn.Dropout(config.att_pdrop)

    def forward(self, x):
        block_in = x
        attn_out = block_in + self.dropout_1(self.causal_multi_head_attention(self.layer_norm_1(block_in)))
        block_out = attn_out + self.dropout_2(self.feed_forward(self.layer_norm_2(attn_out)))
        return block_out

from lxmls-toolkit.

ramon-astudillo avatar ramon-astudillo commented on August 16, 2024

@grigvardanyan you can keep changes in a branch and then, once we have a running version from @venelink and @gonmelo we can start moving parts there piece by piece to ensure nothing breaks

from lxmls-toolkit.

grigvardanyan avatar grigvardanyan commented on August 16, 2024

I tested changed version of Casual self attention Layer it works fine I can push it now.
The problem is the only naming convention, which we also want to change. from_pretrained function maps hugging face model's layers with custom gpt model's layers by using names of each layer. One solution can be to change mapping part which will map our names to HF models names, but for exercise we also need to guide students to use our preferred names for the layers, otherwise it will broke loading of weights.

e.g

#TODO Complete Transformer Layer

self.causal_multi_head_attention = #

self.feed_forward = nn.Sequential(OrderedDict([
("fully_connected_first" , ),
("gelu", ),
("fully_connected_last", )
]))

self.layer_norm_1 = #
self.dropout_1 = #
self.layer_norm_2 = #
self.dropout_2 = #

from lxmls-toolkit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.