System Info text-generation-inference version: 2.0.4 using sta

GPU memory not saturated using microsoft/Phi-3-small-128k-instruct about text-generation-inference HOT 1 OPEN

calwoo commented on June 30, 2024

GPU memory not saturated using microsoft/Phi-3-small-128k-instruct

from text-generation-inference.

Comments (1)

OlivierDehaene commented on June 30, 2024

The reason behind this difference in behaviour is because phi3 small is not natively supported whereas mini and medium are. Phi3 small therefore uses the AutoModel implementation that requires a lot more memory and is not ISO in features to native models (no flash/paged attention, padding...).

Phi3 small is really different from the other flavours, from what I could find so far:

Uses block sparse attention with dense attention every n layers.
Does not use custom softmax scaling (not sqrt(head_size)).
Does not use regular gating in the MLP layers, uses the gegelu activation instead, which does gating internally using linear gating. However a big difference is that the gate and non-gate parameters are interleaved (a_gelu, a_linear = input[..., ::2], input[..., 1::2]).
Residual + layernorm (not RMS) is arranged differently than in Llama.
Column-packed QKV with GQA

We are considering adding native support for it but have not done it yet.

On a sidenote, phi3 mini is affected by #2055, solved in #2060.

from text-generation-inference.

GPU memory not saturated using microsoft/Phi-3-small-128k-instruct about text-generation-inference HOT 1 OPEN

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent