when I use 4.36.2, it's not work. But if I use 4.32.0, it's work. I only changed "

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

Hi, thanks for the nice work! I see the current implementation in <code class="notrans

llama_self_extend_patch_4_36 is not work about longlm HOT 7 CLOSED

datamllab commented on August 11, 2024

llama_self_extend_patch_4_36 is not work

from longlm.

Comments (7)

Mooler0410 commented on August 11, 2024

We found that after 4.36, the default attention of llama is changed from "LlamaAttention" to "LlamaSdpaAttention". Hence the replacement function will not work. Instead, you may try:

modify_method_of_instance(base_model, "LlamaAttention", "forward", self_extend_forward)
--> modify_method_of_instance(base_model, "LlamaSdpaAttention", "forward", self_extend_forward)

This might be the reason for the failure.

from longlm.

YL-9 commented on August 11, 2024

We found that after 4.36, the default attention of llama is changed from "LlamaAttention" to "LlamaSdpaAttention". Hence the replacement function will not work. Instead, you may try:

modify_method_of_instance(base_model, "LlamaAttention", "forward", self_extend_forward) --> modify_method_of_instance(base_model, "LlamaSdpaAttention", "forward", self_extend_forward)

This might be the reason for the failure.

it work, thank you.
I have another question. I want to add it here. but it can still only run normally on 4.32, and the running result of 4.36 is still wrong.
I just added the following three pieces of code and used this command to run them: CUDA_VISIBLE_DEVICES=0,1 python eval/passkey.py --model /data/supry/models/llama-2/llama2-7b-hf --min-tokens 4096 --max-tokens 8192 --tokens-step 4096 --length-step 1024 --iterations 20 --serope
$MW9HYFAAUK~T{GVP AJT8S](https://github.com/datamllab/LongLM/assets/73892208/ac8c2bc8-b5f1-4215-b77c-dd37da0523e2) ![~FT7XB3NFIFLJHT%2KFFWSJ](https://github.com/datamllab/LongLM/assets/73892208/ce7832ec-db10-4f78-9dd3-ba04549495d6) ![VFY8E1_DN)BRU1~EL{~K@I$

from longlm.

YL-9 commented on August 11, 2024

$$QY)X~9HYTK{2}0 I4K4_W$

from longlm.

Mooler0410 commented on August 11, 2024

Hi YL-9! Could you please test whether self-extend can work by instance wise modification, like the example we provide? Sometimes, direct modification to the transformers' class does not take effect, while the cause of failure is case by case. That's the reason why we choose to modify the forward function of a model instance rather than its class. (Of course,, this can avoid any unexpected behavior for the modification only happens to the specific model instance)

from longlm.

YL-9 commented on August 11, 2024

ok, thank you!

from longlm.

ys-zong commented on August 11, 2024

Hi, thanks for the nice work! I see the current implementation in llama_self_extend_patch_4_36.py is regular pytorch. I wonder if you plan to implement Flash attention for transformers==4.36?

from longlm.

Mooler0410 commented on August 11, 2024

Hi, thanks for the nice work! I see the current implementation in llama_self_extend_patch_4_36.py is regular pytorch. I wonder if you plan to implement Flash attention for transformers==4.36?

Hi, thank you for your interests. The main different between transformers==4.36 and transformers==4.38.2 is how the RoPE is applied to KV. You may have a check. The computation of self attention is nearly the same. This means you can follow our 4.38.2 implementation to have a flash attention implementation for 4.36 with minor modification.

One of the possible issues is the flash_attn version used by 4.36. In that case, you may use our triton flash attention implementation instead of flash_attn. It's 10~20% slower than flash_attn.

from longlm.

llama_self_extend_patch_4_36 is not work about longlm HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent