datamllab / longlm Goto Github PK
View Code? Open in Web Editor NEW[ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
Home Page: https://arxiv.org/pdf/2401.01325.pdf
License: MIT License
[ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
Home Page: https://arxiv.org/pdf/2401.01325.pdf
License: MIT License
The paper mentions a SEext-SOLAR-10.5B model?
Hi,
Thank you for releasing this! Can I save Phi 2 weights with self-extend to finetune on?
Thank you!
Hi how did you evaluate on LongBench?
I tried to map your LLama to extended version with
Line 40 in 6b84193
And I used the generation flow from longBench https://github.com/THUDM/LongBench/blob/main/pred.py
without extendedforward, one model takes 60GB memory.
Any example of how the extention can be done on the gemma models? Also, once the models are extended, how can they be exported to use with Ollama? Ollama uses gguf format
#Tokens of Prompt: 5144 Passkey target: 89427
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
Llama2: [What is the pass key? The pass key is 。。。. 。]
SelfExtend: [What is the pass key? The pass key is 。。。. 。]
错误显示如下:
Token indices sequence length is longer than the specified maximum sequence length for this model (9556 > 4096). Running this sequence through the model will result in indexing errors
Hi! I have a question that may seem simple, but I think I'm overlooking something.
Assume Phi-2's context window is 2K. When we apply a group size (
However, in the paper, the caption of Table 5 states:
The vanilla Phi-2 has a 2k context window. SelfExtend extends Phi-2 to 4K (
$G_s=4, w_n=512$ ), 6K ($G_s=8, w_n=512$ ).
Considering Phi-2's maximum token length is 2K, I am puzzled as to how the extended lengths of 4K and 6K are derived. What am I missing here?
Any insights would be greatly appreciated.
Great work!!
It would be super to have deeper support for Phi2 / Mixformer,
e.g. https://huggingface.co/amgadhasan/phi-2
Edit:
Tried the existing phi patches with some modifications, but it seems like some core assumptions are pretty different. e.g.
AttributeError: 'MixFormerSequentialForCausalLM' object has no attribute 'q_proj'
I followed your direction like the below to apply selfextend to llama3
"""
[04/19/2024]:💡 We added the support for LLama-3 with transformers==4.40. To use it with transformers==4.40, you may change the file name of Llama_4_40.py to Llama.py to replace the existing patch file.
"""
Exception Traceback (most recent call last)
Cell In[12], line 4
2 group_size = 5
3 window_size = 1024
----> 4 SelfExtend.apply(model, group_size, window_size, enable_flash_attention=True)#, flash_attention_impl='flash_attn')
5 model.eval()
File /home/ubuntu/reports/SelfExtend.py:109, in apply(loaded_model, group_size, window_size, enable_flash_attention, scale_base, flash_attention_impl)
107 print("Using triton flash self_extend!!")
108 if (not modifed):
--> 109 raise Exception(f"Failed to modify the attention method of {arch_name}")
110 else:
111 raise Exception(f"Need to set the flash_attention_impl to 'flash_attn' or 'triton'.")
Exception: Failed to modify the attention method of LlamaForCausalLM
"""
how to fix it?
Your model seam to work really great on long context. Do you have any plans to integrate this into vllm in the future?
Could you please release the example script for Phi-2? Thanks.
Hey love your work and thanks for releasing the code. I saw on r/LocalLLaMA that you plan on adding flash-attention support. Do you have a timeline in mind for it?
Thanks for your contribution on accommondating qwen on self-extend.
Qwen1.5 has already been 32k context length. I'm wondering if i can use self-extend to make it to about 100K?
Have you ever tested the effect on qwen1.5 using self-extend?
When trying to run predictions from a selfextended model im getting the above error TypeError: 'NoneType' object is not subscriptable but without applying selfextend() im not getting any error and am able to run predictions
Thanks to @dwzhu-pku for pointing this out.
Can it be implemented on qwen1.5?
Self-Extend | ReRoPE | |
---|---|---|
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2, 1, 0, 0, 0, 0, 0, 0, 0, 0], [3, 2, 1, 0, 0, 0, 0, 0, 0, 0], [4, 3, 2, 1, 0, 0, 0, 0, 0, 0], [4, 4, 3, 2, 1, 0, 0, 0, 0, 0], [5, 5, 4, 3, 2, 1, 0, 0, 0, 0], [5, 5, 4, 4, 3, 2, 1, 0, 0, 0], [6, 6, 5, 5, 4, 3, 2, 1, 0, 0], [6, 6, 5, 5, 4, 4, 3, 2, 1, 0]]) | tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2, 1, 0, 0, 0, 0, 0, 0, 0, 0], [3, 2, 1, 0, 0, 0, 0, 0, 0, 0], [4, 3, 2, 1, 0, 0, 0, 0, 0, 0], [5, 4, 3, 2, 1, 0, 0, 0, 0, 0], [6, 5, 4, 3, 2, 1, 0, 0, 0, 0], [6, 6, 5, 4, 3, 2, 1, 0, 0, 0], [6, 6, 6, 5, 4, 3, 2, 1, 0, 0], [6, 6, 6, 6, 5, 4, 3, 2, 1, 0]]) | |
base | 256 : 1.0464 512 : 2.2302 1024 : 2.4575 | 256 : 1.0464 512 : 0.8970 1024 : 2.1020 |
+logn | 256 : 1.0464 512 : 2.2816 1024 : 2.8442 | 256 : 1.0464 512 : 0.8832 1024 : 1.8993 |
+keynorm | 256 : 2.1632 512 : 2.5168 1024 : 3.2525 | 256 : 2.1632 512 : 3.1556 1024 : 4.6036 |
Is it possible to adapt this to cohere command-r models ?
I asked in Chinese because I guessed you can read Chinese based on the author list. If you need to ask in English, please contact me!
如题,我希望在HF下载的模型上启用SelfExtend以支持长上下文窗口,请教一下是否有相关的示例脚本,以供推理时开启和进行大海捞针测试(4k-256k)呢?注意到文中对注意力计算方法有所改动,这是一个对任意LLM即插即用的方法吗?
所指格式包含文件如
├── config.json
├── configuration.json
├── generation_config.json
├── model-00001-of-00003.safetensors
├── model-00002-of-00003.safetensors
├── model-00003-of-00003.safetensors
├── model.safetensors.index.json
├── sft_args.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model
如果可以提供,不胜感谢!
Hello!
Thank you for your great work, its amazing how much hard work you put for this algorithm. I just had one question is it possible to integrate this with vLLM serving ?
This will really boost the inference time in limited resources setting once you cross the 8192 token mark, is there a way ? Thank you in advance for your help!!
when I use 4.36.2, it's not work. But if I use 4.32.0, it's work.
I only changed "import llama_self_extend_patch as LlamaSE" in "llama_example.py" to "import llama_self_extend_patch_4_36 as LlamaSE"
Hello, congratulations on your acceptance to ICML! I think it is well deserved!
I have a question regarding the passkey retrieval task you posted.
Could you briefly explain how you created the text file for this task? (passkey_examples.jsonl)
Thank you.
I'm applying this to the Ghost 8B Beta (128k) chat version online here and it seems to work.
In general, I have not yet fine-tuned and tested the parameters against the original model (even the current version is online) but I have actually noticed that the context is long but still ensures very good quality, for example here.
This is a quick issue to share this joy with your research team. Thank you very much~
Hello. I just simplily run the example.py and met the error in the "=====SelfExtend using Torch======" part:
Traceback (most recent call last):
File "./LongLM/example.py", line 112, in <module>
SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)
File "./LongLM/SelfExtend.py", line 123, in apply
raise Exception(f"Failed to modify the attention method of {arch_name}")
Exception: Failed to modify the attention method of LlamaForCausalLM
transformers=4.41, flash_attn=2.5.8
Meanwhile, I have noticed the similar problem https://github.com/datamllab/LongLM/issues/31, So I tried setting the attention not be flash attention in the same time:
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16, use_flash_attention_2=False)
SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)
I printed the model, which shows it's not the flash attention:
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
So where is the problem?
We've already had the implementation. Actually,the new results we released on X (previous Twitter) with Google's Gemma are based on this implementation (otherwise we cannot do it on sequences > 30k). However, with the current implementation, we cannot reach the same results(on Longbench)as what we reported in the paper (based on the no flash attention version). There is a minor performance gap between the two versions.
We are still trying to figure out the reason.
Thanks for your code. I encountered the following issue while trying to extend the context length of Qwen1.5-14B-Chat. Do you know how I can fix this Exception? Many THX!
Traceback (most recent call last):
File "/***/envs/qwen/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/***/envs/qwen/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/***/Qwen1_5/longlm/LongLM/pred.py", line 63, in get_pred
SelfExtend.apply(model, group_size, window_size, enable_flash_attention=use_flash, flash_attention_impl="flash_attn")
File "/***/Qwen1_5/longlm/LongLM/SelfExtend.py", line 179, in apply
raise Exception(f"Failed to modify the attention method of {arch_name}")
Exception: Failed to modify the attention method of Qwen2ForCausalLM
Under the setting of:
I use 2 GPUs. During inference (with llama-2-7B, and 10k quetions in the demo), each GPU requires about 50GB of additional memory, excluding the load model. But when I use 8 GPUs, the inference process also requires about 50GB of memory per card. why is that?
Also, I can not test llama-2-70B model, because even if 8 GPUs are used for simultaneous inference, the GPU memory required by each GPU during the inference process is quite huge.
Are there any ways to optimize resource usage during inference? Have you tried testing your method on the 70B model with 10k tokens QA?
Hi, very impressive work! Would the code to reproduce results on LongBench be released anytime soon? Specifically, without flash attention, how to run the experiments with 16k input length and 7B/10B model? The memory consumption seems impossible to run on gpus with 40G memory and difficult even for gpus with 80G memory.
Hi,
Loved this paper and implementation. I implemented this for Phi2 with transformers==4.36.2 without caching. The outputs with in context size are even better at following instruction than actual model. However, when going out of context window, I am seeing a repetition. This might be due to extending context window size bit too much. Do you guys have suggestions for experimenting with different group and neighbour sizes or any insights.
Here is my implementation:
https://github.com/agokrani/LongLM/tree/phi2
I haven't implemented KV caching for now due to change in KV caching format in transformers. I will try to implement it soon. Would love to hear your thoughts.
Has anyone successfully run cases with a length of 16k? What kind of machine resources support the experiment?
Hi! I love your work and code implementation. Learned a lot.
I have couple questions regarding code implementation.
LongLM/self_extend_patch/Llama.py
Lines 294 to 295 in 6e25a31
I understand that group_query_position
is generated according to the formula shown in figure 3 of the paper. However, I am curious why group_key_position
is simply determined by dividing by group_size
, (without neighbor attention) unlike the query
. Could you please clarify if I am missing something here?
Thank you in advance for your help.
I tried to run example.py on an A100 (80GB) GPU. It seems there is a bug at line [41]
Line 41 in ee92c84
The current implementation doesn't load the input_ids tensors onto the device, which causes an error. I replaced the above code, and it's now working. Fixed the issue by adding: input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
Great job! However, I have encountered an issue. When the length of the input increases, the GPU memory consumption grows rapidly, and it quickly leads to an out-of-memory (OOM) error. Could you please let me know if this is a bug?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.