Giter VIP home page Giter VIP logo

Comments (4)

epwalsh avatar epwalsh commented on August 12, 2024

Here's my attempt to list them, let me know if I'm missing anything:

  • SwiGLU Activation instead of ReLU / GeLU. #29
    This is defined as Swish(xW) * xV
  • Parallel formulation of the transformer block. #18
    Instead of y = x + MLP(LN(x + Attention(LN(x)))) they do y = x + MLP(LN(x)) + Attention(LN(x))
  • Multi-Query Attention. #30
    For k attention heads with head size h, we usually we project the input into query, key, value, each of shape (k, h). In PaLM the query and value projections are shape (1, h), i.e. they use the same projection for each attention head. This results in more efficient inference during decoding.
  • RoPE embeddings. #63
    ⁉️ I'm not sure why they went with RoPE instead of ALiBi. Maybe because implementing flash attn with support for ALiBi was too complex? (PyTorch's built-in flash attn doesn't work with ALiBi)
  • Shared input-output embeddings.
    We're already doing this.
  • No bias terms. #27
  • They use a SentencePiece tokenizer with a vocabulary of 256k tokens.
    They construct the vocabulary in such a way that tokenization is completely reversible, as it is with byte-level BPE tokenizers like GPT's. That's quite a large vocab size though.
  • Scaling "logits" (pre-softmax outputs) by $1 / \sqrt{d_{model}}$. See Section 5 under "Weight initialization."

There's a PyTorch implementation PaLM here: https://github.com/lucidrains/PaLM-pytorch

from olmo.

dirkgr avatar dirkgr commented on August 12, 2024

@epwalsh, is this done? Scaling logits, do we care?

from olmo.

epwalsh avatar epwalsh commented on August 12, 2024

Scaling logits got overlooked for bigger things, but we should at least have the option implemented. I'll take care of it.

from olmo.

epwalsh avatar epwalsh commented on August 12, 2024

Scaling logits implemented in #239

from olmo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.