Compute the arithmetic intensity of phi-1 model

This is a case study of how to compute the arithmetic intensity of phi-1 model for training and inference

Model architecture

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2048)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-23): 24 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (k_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (v_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (dense): Linear(in_features=2048, out_features=2048, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear(in_features=2048, out_features=8192, bias=True)
          (fc2): Linear(in_features=8192, out_features=2048, bias=True)
        )
        (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (final_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=2048, out_features=51200, bias=True)
)

Model config

{
  "_name_or_path": "microsoft/phi-1",
  "architectures": [
    "PhiForCausalLM"
  ],
  "auto_map": {
    "AutoConfig": "configuration_phi.PhiConfig",
    "AutoModelForCausalLM": "modeling_phi.PhiForCausalLM"
  },
  "attention_dropout": 0.0,
  "bos_token_id": null,
  "embd_pdrop": 0.0,
  "eos_token_id": null,
  "hidden_act": "gelu_new",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 2048,
  "model_type": "phi",
  "num_attention_heads": 32,
  "num_hidden_layers": 24,
  "num_key_value_heads": null,
  "partial_rotary_factor": 0.5,
  "qk_layernorm": false,
  "resid_pdrop": 0.0,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.37.0",
  "use_cache": true,
  "vocab_size": 51200
}

Table of the model setting

Symbol	Definition	Shape	value
Scalars
Input Shape
b	Batch size	1
s	Sequence lenghth	1
M	Size of SRAM	1
Model Hyper-parameter
n	Number of attention heads	1	32
d	Hidden state size of one head	1	64
h	Hidden state size(h = n*d)	1	2048
Parameters
W_Q, W_K, W_V	Projection for Q, K, V	(h, h)	(2048, 2048)
W_O	Projection for self-attention ouput	(h, h)	(2048, 2048)
W1	First layer in the FFN	(h, 4h)	(2048, 8192)
W2	Second layer in the FFN	(4h, h)	(8192, 2048)

Table of self-attention

Symbol	Definition	Shape	value	FLOPS	IO
Input
X	Input for self attention	(b,s,h)	(b,s,2048)
Self-Attention
Q^{^^}, K^{^^}, V^{^^}	XW_Q, XW_K, XW_V	(b, s, h)	(b, s, 2048)	3*2bsh²	3(2bsh + h²)
Q^{^}, K^{^}, V^{^}	Reshape Q^{^^}, K^{^^}, V^{^^}	(b,s,n,d)
Q, K, V	Transpose Q^{^}, K^{^}, V^{^}	(b,n,s,d)
K^T	Transpose K	(b,n,d,s)
P	Softmax(QK^T/sqrt(d))	(b,n,s,s)	(b,32,s,s)	3nbs²	2bsnd+bs²n
A^{^^}	PV	(b,n,s,d)	(b,32,s,64)	2bs²nd	2bsnd+bs²n
A^{^}	Transpose A^{^^}	(b,s,n,d)
A	Reshape A^{^}	(b,s,h)
Y	AW_O	(b,s,h)	(b,s,2048)	2bsh²	2bsh+h²
MLP
Z	GELU(YW₁)W₂	(b,s,h)	(b,s,2048)	64bsh²	4bsh+16h²
Input
x	Input for self attention	(b,1,h)
K_s, V_S	Key/Value cache from past	(b,n,s,d)
Self-Attention
q^{^^}, k^{^^}, v^{^^}	xW_Q, xW_K, xW_V	(b, 1, h)	(b, 1, 2048)	3*2bh²	3(2bh + h²)
q^{^}, k^{^}, v^{^}	Reshape q^{^^}, k^{^^}, v^{^^}	(b,1,n,d)
q, k, v	Transpose q^{^}, k^{^}, v^{^}	(b,n,1,d)
K, V	concat(K_s,k), concat(V_s, v)	(b,n,s+1,d)
K^T	Transpose K	(b,n,d,s+1)
p	Softmax(qK^T/sqrt(d))	(b,n,1,s+1)	(b,32,1,s+1)	3bsnd	bsn+bsnd+bnd
a^{^^}	pV	(b,n,1,d)	(b,32,1,64)	2bsnd	bsn+bsnd+bnd
a^{^}	Transpose A^{^^}	(b,1,n,d)	(b,1,32,64)
a	Reshape A^{^}	(b,1,h)
y	AW_O	(b,1,h)	(b,1,2048)	2bh²	2bh+h²
MLP
z	GELU(yW₁)W₂	(b,1,h)	(b,1,2048)	64bh²	4bh+16h²

With FlashAttention

Based on this implementation https://huggingface.co/microsoft/phi-1/blob/main/modeling_phi.py. Flash_attention is used after Q, K, V transpose.

Symbol	Definition	Shape	value	FLOPS	IO
Input
X	Input for self attention	(b,s,h)	(b,s,2048)
Flash-Attention
Q^{^^}, K^{^^}, V^{^^}	XW_Q, XW_K, XW_V	(b, s, h)	(b, s, 2048)	3*2bsh²	3(2bsh + h²)
Q^{^}, K^{^}, V^{^}	Reshape Q^{^^}, K^{^^}, V^{^^}	(b,s,n,d)
Q, K, V	Transpose Q^{^}, K^{^}, V^{^}	(b,n,s,d)
K^T	Transpose K	(b,n,d,s)
A=flash_atttion(Q,K,V)	flash attention	(b, s, h)	(b, s, 2048)	3nbs² + 2bs²nd	s²d²/M
Y	AW_O	(b,s,h)	(b,s,2048)	2bsh²	2bsh+h²
MLP
Z	GELU(YW₁)W₂	(b,s,h)	(b,s,2048)	64bsh²	4bsh+16h²
Input
x	Input for self attention	(b,1,h)
K_s, V_S	Key/Value cache from past	(b,n,s,d)
Flash-Attention
q^{^^}, k^{^^}, v^{^^}	xW_Q, xW_K, xW_V	(b, 1, h)	(b, 1, 2048)	3*2bh²	3(2bh + h²)
q^{^}, k^{^}, v^{^}	Reshape q^{^^}, k^{^^}, v^{^^}	(b,1,n,d)
q, k, v	Transpose q^{^}, k^{^}, v^{^}	(b,n,1,d)
K, V	concat(K_s,k), concat(V_s, v)	(b,n,s+1,d)
K^T	Transpose K	(b,n,d,s+1)
a=flash_atttion(Q,K,V)	flash attention	(b, s, h)	(b, s, 2048)	5bsnd	s²d²/M
y	aW_O	(b,1,h)	(b,1,2048)	2bh²	2bh+h²
MLP
z	GELU(yW₁)W₂	(b,1,h)	(b,1,2048)	64bh²	4bh+16h²

Result analysis

Only forward passing is considered in the tables For the backward passing, the flash attention will recompute the S and P. The FLOPS may be higher than self-attention. However, if we denote all these numbers in Big O notation. The result will be consistent.
If we compare these two tables. we know that all steps other than attention part are exactlly the same. So and the FLOPS of two algrothem are at the same big O level. However, the IOS is 1/M (M is the size of SRAM) of the standard self-attention.
Given the phi-1 model structure, we can compute the arithmetic intensity between standard self-attention and flash attention. The following table is only considering the inference (auto-regression).

Algrithem	FLOPS	IOS	Arithmetic Intensity
flash attention	5bsnd	s²d²/M	5bM/(sd)
standard self-attention	5bsnd	2bsn+2bsnd+bnd	5sd/(2s+2sd+d)

So the arthmetic intensity of flash attention is related to size of the SRAM. Based on the flash attention paper, we can expect around 3X speed up.

DRAM vs SRAM

These two terms are types of hardwares. Both SRAM (Static Random Access Memory) and DRAM (Dynamic RAM) are types of random access memory (RAM).

What's on-chip SRAM in the paper?

on-chip SRAM = SM's L1 shred memory * number of SM

On the paper, A100 is used. So this number is 192kb * 108 (sm) = 20 mb

pengwei715 / arithmetic-intensity Goto Github PK

arithmetic-intensity's Introduction

Compute the arithmetic intensity of phi-1 model

Model architecture

Model config

Table of the model setting

Table of self-attention

With FlashAttention

Result analysis

DRAM vs SRAM

What's on-chip SRAM in the paper?

Reference

arithmetic-intensity's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent