Giter VIP home page Giter VIP logo

Comments (4)

hkproj avatar hkproj commented on May 27, 2024

Hi!

As you can see from the following lines, the KV cache belongs to the SelfAttention module, so each Encoder layer in the Transformer will have its own KV cache.

self.cache_k = torch.zeros((args.max_batch_size, args.max_seq_len, self.n_kv_heads, self.head_dim))
self.cache_v = torch.zeros((args.max_batch_size, args.max_seq_len, self.n_kv_heads, self.head_dim))

The first thing we do when we evaluate the attention, is to update the KV cache with the input received at each Encoder layer, as in the following lines:

# Replace the entry in the cache
self.cache_k[:batch_size, start_pos : start_pos + seq_len] = xk
self.cache_v[:batch_size, start_pos : start_pos + seq_len] = xv

and then we used the cache to perform the subsequent operations.

This means that each layer will first insert in its own cache the output of the previous layer. So basically this is to say that we do not have only one cache, but one cache for every Encoder layer and they are all different from each other, because each one of them receives the input from the previous layer.

Hope this clarifies your doubt.

Umar

from pytorch-llama-notes.

wangii avatar wangii commented on May 27, 2024

Hi Umar,

Thank you for taking the time to help, I really appreciate.

I do understand each layer has its own KV cache, but it doesn't solve my problem. In the original Transformer, each layer would use the full context to generate QKV to calculate its self-attention. With KV cache, each layer only generates QKV for new tokens. it seems to suggest that by appending a new token, the relative semantic structure of previous context won't change. If it's the case, could you kindly point out the references in any paper, b/c materials I read so far all claim KV cache doesn't change the algorithm. If it's not the case, what have I missed?

best,
Linan

from pytorch-llama-notes.

hkproj avatar hkproj commented on May 27, 2024

Hi Linan

I answer to this part:

it seems to suggest that by appending a new token, the relative semantic structure of previous context won't change.

Yes, the relative semantic structure of the previous context won't change because each token only attends to token to its left (left context). So when a new token is added to the K and V, the previous token's attention scores doesn't change, because the previous token should not be able to attend the new token, but vice versa, the new token is able to attend the previous ones. This is due to the causal mask we apply to the attention scores to make the model auto regressive.

In our case, we do not need to apply the causal mask during inference, because it is implicitly applied by the way we compute the attention using the KV cache. During training, on the other hand, we apply the causal mask explicitly.

I hope it clarified your doubt.

Umar

from pytorch-llama-notes.

wangii avatar wangii commented on May 27, 2024

yes, now it all clicks. thanks for the patience!

from pytorch-llama-notes.

Related Issues (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.