Describe the bug If we keep kv cache as list of tensors, there ha

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Keep kv cache as list of tensors maybe better than one tensor about keras-nlp HOT 3 OPEN

lingzhi98 commented on June 10, 2024

Keep kv cache as list of tensors maybe better than one tensor

from keras-nlp.

Comments (3)

lingzhi98 commented on June 10, 2024

Spliting kv cache into key cache and value cache is also important (https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/gemma/gemma_attention.py#L166).

from keras-nlp.

mattdangerw commented on June 10, 2024

@lingzhi98 thanks! We are planning some generation improvements so will definitely check this out. Agreed we can let performance be our guide. Probably particularly jax compiled performance.

Were you thinking of a specific backend/compiled with XLA/not compiled? What's motivating the suggestion?

from keras-nlp.

lingzhi98 commented on June 10, 2024

I use jax as keras backend. I have seen the concatenation become the main overhead if increasing batch size. Due to keep kv caches as one tensor, we need slice the kv cache to get corresponding key/value cache to compute attention output and then update cache. Dynamic update slice fusion will blocked by this slice op (https://github.com/openxla/xla/blob/main/xla/service/gpu/ir_emission_utils.cc#L472) and hurts performance again.

from keras-nlp.

Keep kv cache as list of tensors maybe better than one tensor about keras-nlp HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent