datamllab / longlm Goto Github PK

View Code? Open in Web Editor NEW

570.0 570.0 56.0 13.16 MB

[ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

Home Page: https://arxiv.org/pdf/2401.01325.pdf

License: MIT License

Python 100.00%

context-window large-language-models llm longlm self-extend selfextend

longlm's People

Contributors

Stargazers

Watchers

longlm's Issues

Release the SOLAR model

The paper mentions a SEext-SOLAR-10.5B model?

Export weights

Hi,
Thank you for releasing this! Can I save Phi 2 weights with self-extend to finetune on?
Thank you!

OOM on LongBench

Hi how did you evaluate on LongBench?

I tried to map your LLama to extended version with

LongLM/llama_example.py

Line 40 in 6b84193

 modify_method_of_instance(model, "LlamaAttention", "forward", self_extend_forward) 

it generates OOM with 2 A100-80GB with dataparallel?

And I used the generation flow from longBench https://github.com/THUDM/LongBench/blob/main/pred.py
without extendedforward, one model takes 60GB memory.

Question | Has anyone tried this with GGUF models?

Example for gemma & use with Ollama

Any example of how the extention can be done on the gemma models? Also, once the models are extended, how can they be exported to use with Ollama? Ollama uses gguf format

#Tokens of Prompt: 5144 Passkey target: 89427
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
Llama2: [What is the pass key? The pass key is 。。。. 。]
SelfExtend: [What is the pass key? The pass key is 。。。. 。]

运行代码后还是无法回答长文本

错误显示如下：
Token indices sequence length is longer than the specified maximum sequence length for this model (9556 > 4096). Running this sequence through the model will result in indexing errors

Question about equation 4 and Table 5 caption in paper

Hi! I have a question that may seem simple, but I think I'm overlooking something.

Assume Phi-2's context window is 2K. When we apply a group size ($G_s$) of 4 and neighbor tokens ($w_n$) of 512, according to Equation 4: $(L - w_n) \times G_s + w_n$, the calculated extended pre-training length is approximately $(2K - 0.5K) \times 4 + 0.5K = 6.5K$.

However, in the paper, the caption of Table 5 states:

The vanilla Phi-2 has a 2k context window. SelfExtend extends Phi-2 to 4K ($G_s=4, w_n=512$), 6K ($G_s=8, w_n=512$).

Considering Phi-2's maximum token length is 2K, I am puzzled as to how the extended lengths of 4K and 6K are derived. What am I missing here?
Any insights would be greatly appreciated.

Support for Phi2 / Mixformer

Great work!!

It would be super to have deeper support for Phi2 / Mixformer,
e.g. https://huggingface.co/amgadhasan/phi-2

Edit:
Tried the existing phi patches with some modifications, but it seems like some core assumptions are pretty different. e.g.
AttributeError: 'MixFormerSequentialForCausalLM' object has no attribute 'q_proj'

llama3 is not working.

I followed your direction like the below to apply selfextend to llama3
"""
[04/19/2024]:💡 We added the support for LLama-3 with transformers==4.40. To use it with transformers==4.40, you may change the file name of Llama_4_40.py to Llama.py to replace the existing patch file.
"""

I got this error.
"""

Exception Traceback (most recent call last)
Cell In[12], line 4
2 group_size = 5
3 window_size = 1024
----> 4 SelfExtend.apply(model, group_size, window_size, enable_flash_attention=True)#, flash_attention_impl='flash_attn')
5 model.eval()

File /home/ubuntu/reports/SelfExtend.py:109, in apply(loaded_model, group_size, window_size, enable_flash_attention, scale_base, flash_attention_impl)
107 print("Using triton flash self_extend!!")
108 if (not modifed):
--> 109 raise Exception(f"Failed to modify the attention method of {arch_name}")
110 else:
111 raise Exception(f"Need to set the flash_attention_impl to 'flash_attn' or 'triton'.")

Exception: Failed to modify the attention method of LlamaForCausalLM
"""

how to fix it?

vllm integration

Your model seam to work really great on long context. Do you have any plans to integrate this into vllm in the future?

Example for phi2?

Could you please release the example script for Phi-2? Thanks.

Flash Attention Support?

Hey love your work and thanks for releasing the code. I saw on r/LocalLLaMA that you plan on adding flash-attention support. Do you have a timeline in mind for it?

What effect on qwen1.5 will be if i use self-extend trick?

Thanks for your contribution on accommondating qwen on self-extend.
Qwen1.5 has already been 32k context length. I'm wondering if i can use self-extend to make it to about 100K?
Have you ever tested the effect on qwen1.5 using self-extend?

TypeError: 'NoneType' object is not subscriptable

When trying to run predictions from a selfextended model im getting the above error TypeError: 'NoneType' object is not subscriptable but without applying selfextend() im not getting any error and am able to run predictions

FlashAttention does not work for Batch size > 1

Thanks to @dwzhu-pku for pointing this out.

Can it be implemented on qwen1.5?

Differences with ReRoPE

Self-Extend (window_size = training_size/2) does not work at all
ReRoPE is Self-Extend (group PE) with window_size = training_size - 1

	Self-Extend	ReRoPE
	tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2, 1, 0, 0, 0, 0, 0, 0, 0, 0], [3, 2, 1, 0, 0, 0, 0, 0, 0, 0], [4, 3, 2, 1, 0, 0, 0, 0, 0, 0], [4, 4, 3, 2, 1, 0, 0, 0, 0, 0], [5, 5, 4, 3, 2, 1, 0, 0, 0, 0], [5, 5, 4, 4, 3, 2, 1, 0, 0, 0], [6, 6, 5, 5, 4, 3, 2, 1, 0, 0], [6, 6, 5, 5, 4, 4, 3, 2, 1, 0]])	tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2, 1, 0, 0, 0, 0, 0, 0, 0, 0], [3, 2, 1, 0, 0, 0, 0, 0, 0, 0], [4, 3, 2, 1, 0, 0, 0, 0, 0, 0], [5, 4, 3, 2, 1, 0, 0, 0, 0, 0], [6, 5, 4, 3, 2, 1, 0, 0, 0, 0], [6, 6, 5, 4, 3, 2, 1, 0, 0, 0], [6, 6, 6, 5, 4, 3, 2, 1, 0, 0], [6, 6, 6, 6, 5, 4, 3, 2, 1, 0]])
base	256 : 1.0464 512 : 2.2302 1024 : 2.4575	256 : 1.0464 512 : 0.8970 1024 : 2.1020
+logn	256 : 1.0464 512 : 2.2816 1024 : 2.8442	256 : 1.0464 512 : 0.8832 1024 : 1.8993
+keynorm	256 : 2.1632 512 : 2.5168 1024 : 3.2525	256 : 2.1632 512 : 3.1556 1024 : 4.6036

Cohere command r

Is it possible to adapt this to cohere command-r models ?

是否有示例代码支持对safetensors格式LLM启用SelfExtend

I asked in Chinese because I guessed you can read Chinese based on the author list. If you need to ask in English, please contact me!

如题，我希望在HF下载的模型上启用SelfExtend以支持长上下文窗口，请教一下是否有相关的示例脚本，以供推理时开启和进行大海捞针测试（4k-256k）呢？注意到文中对注意力计算方法有所改动，这是一个对任意LLM即插即用的方法吗？

所指格式包含文件如
├── config.json
├── configuration.json
├── generation_config.json
├── model-00001-of-00003.safetensors
├── model-00002-of-00003.safetensors
├── model-00003-of-00003.safetensors
├── model.safetensors.index.json
├── sft_args.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

如果可以提供，不胜感谢！

Support with vLLM

Hello!
Thank you for your great work, its amazing how much hard work you put for this algorithm. I just had one question is it possible to integrate this with vLLM serving ?

This will really boost the inference time in limited resources setting once you cross the 8192 token mark, is there a way ? Thank you in advance for your help!!

llama_self_extend_patch_4_36 is not work

when I use 4.36.2, it's not work. But if I use 4.32.0, it's work.
I only changed "import llama_self_extend_patch as LlamaSE" in "llama_example.py" to "import llama_self_extend_patch_4_36 as LlamaSE"

Passkey retrieval (needle in a haystack)

Hello, congratulations on your acceptance to ICML! I think it is well deserved!
I have a question regarding the passkey retrieval task you posted.

Could you briefly explain how you created the text file for this task? (passkey_examples.jsonl)
Thank you.

LongLM really has great potential.

I'm applying this to the Ghost 8B Beta (128k) chat version online here and it seems to work.
In general, I have not yet fine-tuned and tested the parameters against the original model (even the current version is online) but I have actually noticed that the context is long but still ensures very good quality, for example here.

This is a quick issue to share this joy with your research team. Thank you very much~

Run example.py Error: Failed to modify the attention method of LlamaForCausalLM

Hello. I just simplily run the example.py and met the error in the "=====SelfExtend using Torch======" part:

Traceback (most recent call last):
  File "./LongLM/example.py", line 112, in <module>
    SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)
  File "./LongLM/SelfExtend.py", line 123, in apply
    raise Exception(f"Failed to modify the attention method of {arch_name}")
Exception: Failed to modify the attention method of LlamaForCausalLM

transformers=4.41, flash_attn=2.5.8

Meanwhile, I have noticed the similar problem https://github.com/datamllab/LongLM/issues/31, So I tried setting the attention not be flash attention in the same time:
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16, use_flash_attention_2=False)

SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)

I printed the model, which shows it's not the flash attention:

  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

So where is the problem?

Flash Attention implementation is coming

We've already had the implementation. Actually,the new results we released on X (previous Twitter) with Google's Gemma are based on this implementation (otherwise we cannot do it on sequences > 30k). However, with the current implementation, we cannot reach the same results（on Longbench）as what we reported in the paper (based on the no flash attention version). There is a minor performance gap between the two versions.

We are still trying to figure out the reason.

Something wrong in modify_method_of_instance function

Thanks for your code. I encountered the following issue while trying to extend the context length of Qwen1.5-14B-Chat. Do you know how I can fix this Exception? Many THX!

Traceback (most recent call last):
  File "/***/envs/qwen/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/***/envs/qwen/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/***/Qwen1_5/longlm/LongLM/pred.py", line 63, in get_pred
    SelfExtend.apply(model, group_size, window_size, enable_flash_attention=use_flash, flash_attention_impl="flash_attn")
  File "/***/Qwen1_5/longlm/LongLM/SelfExtend.py", line 179, in apply
    raise Exception(f"Failed to modify the attention method of {arch_name}")
Exception: Failed to modify the attention method of Qwen2ForCausalLM

Under the setting of:

transformers 4.38.2
flash-attn 2.5.5

Requires excessive computing resources when inference

I use 2 GPUs. During inference (with llama-2-7B, and 10k quetions in the demo), each GPU requires about 50GB of additional memory, excluding the load model. But when I use 8 GPUs, the inference process also requires about 50GB of memory per card. why is that?

Also, I can not test llama-2-70B model, because even if 8 GPUs are used for simultaneous inference, the GPU memory required by each GPU during the inference process is quite huge.

Are there any ways to optimize resource usage during inference? Have you tried testing your method on the 70B model with 10k tokens QA?

How to reproduce the results on LongBench

Hi, very impressive work! Would the code to reproduce results on LongBench be released anytime soon? Specifically, without flash attention, how to run the experiments with 16k input length and 7B/10B model? The memory consumption seems impossible to run on gpus with 40G memory and difficult even for gpus with 80G memory.

Phi2 implementation and Suggestions

Hi,

Loved this paper and implementation. I implemented this for Phi2 with transformers==4.36.2 without caching. The outputs with in context size are even better at following instruction than actual model. However, when going out of context window, I am seeing a repetition. This might be due to extending context window size bit too much. Do you guys have suggestions for experimenting with different group and neighbour sizes or any insights.

Here is my implementation:
https://github.com/agokrani/LongLM/tree/phi2

I haven't implemented KV caching for now due to change in KV caching format in transformers. I will try to implement it soon. Would love to hear your thoughts.

OOM when length is 16k

Has anyone successfully run cases with a length of 16k? What kind of machine resources support the experiment?

Questions regarding group query/key positional index

Hi! I love your work and code implementation. Learned a lot.
I have couple questions regarding code implementation.

LongLM/self_extend_patch/Llama.py

Lines 294 to 295 in 6e25a31

 group_query_position = query_position // group_size_1 + _re_group_size_2 - _re_group_size_2 / group_size_1 

 group_key_position = key_position // group_size_1

I understand that group_query_position is generated according to the formula shown in figure 3 of the paper. However, I am curious why group_key_position is simply determined by dividing by group_size, (without neighbor attention) unlike the query . Could you please clarify if I am missing something here?

Thank you in advance for your help.

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when resuming training

I tried to run example.py on an A100 (80GB) GPU. It seems there is a bug at line [41]

LongLM/example.py

Line 41 in ee92c84

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

The current implementation doesn't load the input_ids tensors onto the device, which causes an error. I replaced the above code, and it's now working. Fixed the issue by adding: input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

Long context

Long input series makes oom

Great job! However, I have encountered an issue. When the length of the input increases, the GPU memory consumption grows rapidly, and it quickly leads to an out-of-memory (OOM) error. Could you please let me know if this is a bug?

	group_query_position = query_position // group_size_1 + _re_group_size_2 - _re_group_size_2 / group_size_1
	group_key_position = key_position // group_size_1

datamllab / longlm Goto Github PK

longlm's People

Contributors

Stargazers

Watchers

Forkers

longlm's Issues

I got this error. """

Differences with ReRoPE

Recommend Projects

Recommend Topics

Recommend Org

I got this error.
"""