The PaLM paper has a short section of tweaks to the vanilla Transformer architecture.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Scaling logits implemented in <a class="issue-link js-issue-link" data-error-text="Fai

Stuff from PaLM that we don't have yet about olmo HOT 4 CLOSED

dirkgr commented on August 12, 2024

Stuff from PaLM that we don't have yet

from olmo.

epwalsh commented on August 12, 2024

Here's my attempt to list them, let me know if I'm missing anything:

SwiGLU Activation instead of ReLU / GeLU. #29
This is defined as Swish(xW) * xV
Parallel formulation of the transformer block. #18
Instead of y = x + MLP(LN(x + Attention(LN(x)))) they do y = x + MLP(LN(x)) + Attention(LN(x))
Multi-Query Attention. #30
For k attention heads with head size h, we usually we project the input into query, key, value, each of shape (k, h). In PaLM the query and value projections are shape (1, h), i.e. they use the same projection for each attention head. This results in more efficient inference during decoding.
RoPE embeddings. #63
⁉️ I'm not sure why they went with RoPE instead of ALiBi. Maybe because implementing flash attn with support for ALiBi was too complex? (PyTorch's built-in flash attn doesn't work with ALiBi)
Shared input-output embeddings.
We're already doing this.
No bias terms. #27
They use a SentencePiece tokenizer with a vocabulary of 256k tokens.
They construct the vocabulary in such a way that tokenization is completely reversible, as it is with byte-level BPE tokenizers like GPT's. That's quite a large vocab size though.
Scaling "logits" (pre-softmax outputs) by $1 / \sqrt{d_{model}}$. See Section 5 under "Weight initialization."