Comments (4)
Hi!
As you can see from the following lines, the KV cache
belongs to the SelfAttention
module, so each Encoder
layer in the Transformer
will have its own KV cache
.
self.cache_k = torch.zeros((args.max_batch_size, args.max_seq_len, self.n_kv_heads, self.head_dim))
self.cache_v = torch.zeros((args.max_batch_size, args.max_seq_len, self.n_kv_heads, self.head_dim))
The first thing we do when we evaluate the attention, is to update the KV cache
with the input received at each Encoder
layer, as in the following lines:
# Replace the entry in the cache
self.cache_k[:batch_size, start_pos : start_pos + seq_len] = xk
self.cache_v[:batch_size, start_pos : start_pos + seq_len] = xv
and then we used the cache to perform the subsequent operations.
This means that each layer will first insert in its own cache the output of the previous layer. So basically this is to say that we do not have only one cache, but one cache for every Encoder layer and they are all different from each other, because each one of them receives the input from the previous layer.
Hope this clarifies your doubt.
Umar
from pytorch-llama-notes.
Hi Umar,
Thank you for taking the time to help, I really appreciate.
I do understand each layer has its own KV cache, but it doesn't solve my problem. In the original Transformer, each layer would use the full context to generate QKV to calculate its self-attention. With KV cache, each layer only generates QKV for new tokens. it seems to suggest that by appending a new token, the relative semantic structure of previous context won't change. If it's the case, could you kindly point out the references in any paper, b/c materials I read so far all claim KV cache doesn't change the algorithm. If it's not the case, what have I missed?
best,
Linan
from pytorch-llama-notes.
Hi Linan
I answer to this part:
it seems to suggest that by appending a new token, the relative semantic structure of previous context won't change.
Yes, the relative semantic structure of the previous context won't change because each token only attends to token to its left (left context). So when a new token is added to the K and V, the previous token's attention scores doesn't change, because the previous token should not be able to attend the new token, but vice versa, the new token is able to attend the previous ones. This is due to the causal mask we apply to the attention scores to make the model auto regressive.
In our case, we do not need to apply the causal mask during inference, because it is implicitly applied by the way we compute the attention using the KV cache. During training, on the other hand, we apply the causal mask explicitly.
I hope it clarified your doubt.
Umar
from pytorch-llama-notes.
yes, now it all clicks. thanks for the patience!
from pytorch-llama-notes.
Related Issues (1)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pytorch-llama-notes.