hi, first of all, thanks for the video and slides, very helpful! How

Hi! As you can see from the <a href="https://github.com/hkproj/pytor

Question regarding kv cache about pytorch-llama-notes HOT 4 CLOSED

wangii commented on May 27, 2024

Question regarding kv cache

from pytorch-llama-notes.

Comments (4)

hkproj commented on May 27, 2024

Hi!

As you can see from the following lines, the KV cache belongs to the SelfAttention module, so each Encoder layer in the Transformer will have its own KV cache.

self.cache_k = torch.zeros((args.max_batch_size, args.max_seq_len, self.n_kv_heads, self.head_dim))
self.cache_v = torch.zeros((args.max_batch_size, args.max_seq_len, self.n_kv_heads, self.head_dim))

The first thing we do when we evaluate the attention, is to update the KV cache with the input received at each Encoder layer, as in the following lines:

# Replace the entry in the cache
self.cache_k[:batch_size, start_pos : start_pos + seq_len] = xk
self.cache_v[:batch_size, start_pos : start_pos + seq_len] = xv

and then we used the cache to perform the subsequent operations.

This means that each layer will first insert in its own cache the output of the previous layer. So basically this is to say that we do not have only one cache, but one cache for every Encoder layer and they are all different from each other, because each one of them receives the input from the previous layer.

Hope this clarifies your doubt.

Umar

from pytorch-llama-notes.

wangii commented on May 27, 2024

Hi Umar,

Thank you for taking the time to help, I really appreciate.

I do understand each layer has its own KV cache, but it doesn't solve my problem. In the original Transformer, each layer would use the full context to generate QKV to calculate its self-attention. With KV cache, each layer only generates QKV for new tokens. it seems to suggest that by appending a new token, the relative semantic structure of previous context won't change. If it's the case, could you kindly point out the references in any paper, b/c materials I read so far all claim KV cache doesn't change the algorithm. If it's not the case, what have I missed?

best,
Linan

from pytorch-llama-notes.

hkproj commented on May 27, 2024

Hi Linan

I answer to this part:

it seems to suggest that by appending a new token, the relative semantic structure of previous context won't change.

Yes, the relative semantic structure of the previous context won't change because each token only attends to token to its left (left context). So when a new token is added to the K and V, the previous token's attention scores doesn't change, because the previous token should not be able to attend the new token, but vice versa, the new token is able to attend the previous ones. This is due to the causal mask we apply to the attention scores to make the model auto regressive.

In our case, we do not need to apply the causal mask during inference, because it is implicitly applied by the way we compute the attention using the KV cache. During training, on the other hand, we apply the causal mask explicitly.

I hope it clarified your doubt.

Umar

from pytorch-llama-notes.

wangii commented on May 27, 2024

yes, now it all clicks. thanks for the patience!

from pytorch-llama-notes.

Question regarding kv cache about pytorch-llama-notes HOT 4 CLOSED

Comments (4)

Related Issues (1)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent