datamllab / longlm Goto Github PK

View Code? Open in Web Editor NEW

570.0 10.0 57.0 13.16 MB

[ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

Home Page: https://arxiv.org/pdf/2401.01325.pdf

License: MIT License

Python 100.00%

context-window large-language-models llm longlm self-extend selfextend

longlm's Issues

llama_self_extend_patch_4_36 is not work

when I use 4.36.2, it's not work. But if I use 4.32.0, it's work.
I only changed "import llama_self_extend_patch as LlamaSE" in "llama_example.py" to "import llama_self_extend_patch_4_36 as LlamaSE"

OOM when length is 16k

Has anyone successfully run cases with a length of 16k? What kind of machine resources support the experiment?

How to reproduce the results on LongBench

Hi, very impressive work! Would the code to reproduce results on LongBench be released anytime soon? Specifically, without flash attention, how to run the experiments with 16k input length and 7B/10B model? The memory consumption seems impossible to run on gpus with 40G memory and difficult even for gpus with 80G memory.

Support for Phi2 / Mixformer

Great work!!

It would be super to have deeper support for Phi2 / Mixformer,
e.g. https://huggingface.co/amgadhasan/phi-2

Edit:
Tried the existing phi patches with some modifications, but it seems like some core assumptions are pretty different. e.g.
AttributeError: 'MixFormerSequentialForCausalLM' object has no attribute 'q_proj'

About GPU memory usage

Dear author，

I'm trying to run LongLM on a single A10 with 24G memory, I have tried 'meta-llama/Llama-2-7b-chat-hf' and failed with out of CUDA memory error(attached).

I realized that your example.py is running on 4 RTX3090s, 24GB memory each. So I wonder whether an A10 is something worth a shot or not even close?

I also want to ask whether compressed models, for example Unsloth's model, can be used in LongLM or not?

FlashAttention does not work for Batch size > 1

Thanks to @dwzhu-pku for pointing this out.

llama3 is not working.

I followed your direction like the below to apply selfextend to llama3
"""
[04/19/2024]:💡 We added the support for LLama-3 with transformers==4.40. To use it with transformers==4.40, you may change the file name of Llama_4_40.py to Llama.py to replace the existing patch file.
"""

I got this error.
"""

Exception Traceback (most recent call last)
Cell In[12], line 4
2 group_size = 5
3 window_size = 1024
----> 4 SelfExtend.apply(model, group_size, window_size, enable_flash_attention=True)#, flash_attention_impl='flash_attn')
5 model.eval()

File /home/ubuntu/reports/SelfExtend.py:109, in apply(loaded_model, group_size, window_size, enable_flash_attention, scale_base, flash_attention_impl)
107 print("Using triton flash self_extend!!")
108 if (not modifed):
--> 109 raise Exception(f"Failed to modify the attention method of {arch_name}")
110 else:
111 raise Exception(f"Need to set the flash_attention_impl to 'flash_attn' or 'triton'.")

Exception: Failed to modify the attention method of LlamaForCausalLM
"""

how to fix it?

运行代码后还是无法回答长文本

错误显示如下：
Token indices sequence length is longer than the specified maximum sequence length for this model (9556 > 4096). Running this sequence through the model will result in indexing errors

LongLM isn't compatible with gemma-2-27b-it or gemma-2b-it

I found that the current version of LongLM can not load Gemma 1 or Gemma 2 model successfully. I wrote a minimum test to help reproduce the issue:

# transfromers version 4.38.2
# this example is tested with 4 RTX3090s, 24GB memory each
import warnings
warnings.filterwarnings("ignore")

import torch 
import json
import time
from transformers.models.llama.modeling_llama import LlamaAttention
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig

import SelfExtend 

window_size = 1024
group_size = 32

model_name = '/tmp/gemma-2b-it/'
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.eval()
SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)
prompt = "How are you?"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

start_time = time.time()
tokens = model.generate(input_ids, max_new_tokens=4096)
answer = tokenizer.decode(tokens[0].tolist()[input_ids.shape[1]:], skip_special_tokens=True)
print( answer )

While trying to load the model, it fails with the error message below:

$ python3 test.py 
`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.07it/s]
Traceback (most recent call last):
  File "/var/lib/condor/execute/slot1/dir_2652801/test.py", line 22, in <module>
    SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)
  File "/var/lib/condor/execute/slot1/dir_2652801/SelfExtend.py", line 160, in apply
    raise Exception(f"Failed to modify the attention method of {arch_name}")
Exception: Failed to modify the attention method of GemmaForCausalLM

I found that it fails in the duplicate check in the L24 of SelfExtend.py. When it fails, instance = False.

Below is a conda env export dump including package details in my Python environment:

channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - bzip2=1.0.8=h5eee18b_6
  - ca-certificates=2024.7.2=h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_1
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libuuid=1.41.5=h5eee18b_0
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.14=h5eee18b_0
  - pip=24.0=py310h06a4308_0
  - python=3.10.14=h955ad1f_1
  - readline=8.2=h5eee18b_0
  - setuptools=69.5.1=py310h06a4308_0
  - sqlite=3.45.3=h5eee18b_0
  - tk=8.6.14=h39e8969_0
  - wheel=0.43.0=py310h06a4308_0
  - xz=5.4.6=h5eee18b_1
  - zlib=1.2.13=h5eee18b_1
  - pip:
      - accelerate==0.33.0
      - aiohttp==3.9.5
      - aiosignal==1.3.1
      - annotated-types==0.7.0
      - anyio==4.4.0
      - async-timeout==4.0.3
      - attrs==23.2.0
      - certifi==2024.7.4
      - charset-normalizer==3.3.2
      - click==8.1.7
      - cloudpickle==3.0.0
      - cmake==3.30.1
      - datasets==2.20.0
      - dill==0.3.8
      - diskcache==5.6.3
      - distro==1.9.0
      - dnspython==2.6.1
      - einops==0.8.0
      - email-validator==2.2.0
      - exceptiongroup==1.2.2
      - fastapi==0.111.1
      - fastapi-cli==0.0.4
      - filelock==3.15.4
      - flash-attn==2.6.3
      - frozenlist==1.4.1
      - fsspec==2024.5.0
      - h11==0.14.0
      - httpcore==1.0.5
      - httptools==0.6.1
      - httpx==0.27.0
      - huggingface-hub==0.24.2
      - idna==3.7
      - interegular==0.3.3
      - jinja2==3.1.4
      - jsonschema==4.23.0
      - jsonschema-specifications==2023.12.1
      - lark==1.1.9
      - llvmlite==0.43.0
      - lm-format-enforcer==0.10.3
      - markdown-it-py==3.0.0
      - markupsafe==2.1.5
      - mdurl==0.1.2
      - mpmath==1.3.0
      - msgpack==1.0.8
      - multidict==6.0.5
      - multiprocess==0.70.16
      - nest-asyncio==1.6.0
      - networkx==3.3
      - ninja==1.11.1.1
      - numba==0.60.0
      - numpy==1.26.4
      - nvidia-cublas-cu12==12.1.3.1
      - nvidia-cuda-cupti-cu12==12.1.105
      - nvidia-cuda-nvrtc-cu12==12.1.105
      - nvidia-cuda-runtime-cu12==12.1.105
      - nvidia-cudnn-cu12==8.9.2.26
      - nvidia-cufft-cu12==11.0.2.54
      - nvidia-curand-cu12==10.3.2.106
      - nvidia-cusolver-cu12==11.4.5.107
      - nvidia-cusparse-cu12==12.1.0.106
      - nvidia-ml-py==12.555.43
      - nvidia-nccl-cu12==2.20.5
      - nvidia-nvjitlink-cu12==12.5.82
      - nvidia-nvtx-cu12==12.1.105
      - openai==1.37.1
      - outlines==0.0.46
      - packaging==24.1
      - pandas==2.2.2
      - pillow==10.4.0
      - prometheus-client==0.20.0
      - prometheus-fastapi-instrumentator==7.0.0
      - protobuf==5.27.2
      - psutil==6.0.0
      - py-cpuinfo==9.0.0
      - pyairports==2.1.1
      - pyarrow==17.0.0
      - pyarrow-hotfix==0.6
      - pycountry==24.6.1
      - pydantic==2.8.2
      - pydantic-core==2.20.1
      - pygments==2.18.0
      - python-dateutil==2.9.0.post0
      - python-dotenv==1.0.1
      - python-multipart==0.0.9
      - pytz==2024.1
      - pyyaml==6.0.1
      - pyzmq==26.0.3
      - ray==2.33.0
      - referencing==0.35.1
      - regex==2024.7.24
      - requests==2.32.3
      - rich==13.7.1
      - rpds-py==0.19.1
      - safetensors==0.4.3
      - sentencepiece==0.2.0
      - shellingham==1.5.4
      - six==1.16.0
      - sniffio==1.3.1
      - starlette==0.37.2
      - sympy==1.13.1
      - tiktoken==0.7.0
      - tokenizers==0.19.1
      - torch==2.3.1
      - torchvision==0.18.1
      - tqdm==4.66.4
      - transformers==4.43.3
      - triton==2.3.1
      - typer==0.12.3
      - typing-extensions==4.12.2
      - tzdata==2024.1
      - urllib3==2.2.2
      - uvicorn==0.30.3
      - uvloop==0.19.0
      - vllm==0.5.3.post1
      - vllm-flash-attn==2.5.9.post1
      - watchfiles==0.22.0
      - websockets==12.0
      - xformers==0.0.27
      - xxhash==3.4.1
      - yarl==1.9.4

是否有示例代码支持对safetensors格式LLM启用SelfExtend

I asked in Chinese because I guessed you can read Chinese based on the author list. If you need to ask in English, please contact me!

如题，我希望在HF下载的模型上启用SelfExtend以支持长上下文窗口，请教一下是否有相关的示例脚本，以供推理时开启和进行大海捞针测试（4k-256k）呢？注意到文中对注意力计算方法有所改动，这是一个对任意LLM即插即用的方法吗？

所指格式包含文件如
├── config.json
├── configuration.json
├── generation_config.json
├── model-00001-of-00003.safetensors
├── model-00002-of-00003.safetensors
├── model-00003-of-00003.safetensors
├── model.safetensors.index.json
├── sft_args.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

如果可以提供，不胜感谢！

Requires excessive computing resources when inference

I use 2 GPUs. During inference (with llama-2-7B, and 10k quetions in the demo), each GPU requires about 50GB of additional memory, excluding the load model. But when I use 8 GPUs, the inference process also requires about 50GB of memory per card. why is that?

Also, I can not test llama-2-70B model, because even if 8 GPUs are used for simultaneous inference, the GPU memory required by each GPU during the inference process is quite huge.

Are there any ways to optimize resource usage during inference? Have you tried testing your method on the 70B model with 10k tokens QA?

Can it be implemented on qwen1.5?

OOM on LongBench

Hi how did you evaluate on LongBench?

I tried to map your LLama to extended version with

LongLM/llama_example.py

Line 40 in 6b84193

 modify_method_of_instance(model, "LlamaAttention", "forward", self_extend_forward) 

it generates OOM with 2 A100-80GB with dataparallel?

And I used the generation flow from longBench https://github.com/THUDM/LongBench/blob/main/pred.py
without extendedforward, one model takes 60GB memory.

Phi2 implementation and Suggestions

Hi,

Loved this paper and implementation. I implemented this for Phi2 with transformers==4.36.2 without caching. The outputs with in context size are even better at following instruction than actual model. However, when going out of context window, I am seeing a repetition. This might be due to extending context window size bit too much. Do you guys have suggestions for experimenting with different group and neighbour sizes or any insights.

Here is my implementation:
https://github.com/agokrani/LongLM/tree/phi2

I haven't implemented KV caching for now due to change in KV caching format in transformers. I will try to implement it soon. Would love to hear your thoughts.

Example for gemma & use with Ollama

Any example of how the extention can be done on the gemma models? Also, once the models are extended, how can they be exported to use with Ollama? Ollama uses gguf format

Question about equation 4 and Table 5 caption in paper

Hi! I have a question that may seem simple, but I think I'm overlooking something.

Assume Phi-2's context window is 2K. When we apply a group size ($G_s$) of 4 and neighbor tokens ($w_n$) of 512, according to Equation 4: $(L - w_n) \times G_s + w_n$, the calculated extended pre-training length is approximately $(2K - 0.5K) \times 4 + 0.5K = 6.5K$.

However, in the paper, the caption of Table 5 states:

The vanilla Phi-2 has a 2k context window. SelfExtend extends Phi-2 to 4K ($G_s=4, w_n=512$), 6K ($G_s=8, w_n=512$).

Considering Phi-2's maximum token length is 2K, I am puzzled as to how the extended lengths of 4K and 6K are derived. What am I missing here?
Any insights would be greatly appreciated.

没有报错，但是显示不出来实验结果

#Tokens of Prompt: 5144 Passkey target: 89427
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
Llama2: [What is the pass key? The pass key is 。。。. 。]
SelfExtend: [What is the pass key? The pass key is 。。。. 。]

Something wrong in modify_method_of_instance function

Thanks for your code. I encountered the following issue while trying to extend the context length of Qwen1.5-14B-Chat. Do you know how I can fix this Exception? Many THX!

Traceback (most recent call last):
  File "/***/envs/qwen/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/***/envs/qwen/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/***/Qwen1_5/longlm/LongLM/pred.py", line 63, in get_pred
    SelfExtend.apply(model, group_size, window_size, enable_flash_attention=use_flash, flash_attention_impl="flash_attn")
  File "/***/Qwen1_5/longlm/LongLM/SelfExtend.py", line 179, in apply
    raise Exception(f"Failed to modify the attention method of {arch_name}")
Exception: Failed to modify the attention method of Qwen2ForCausalLM

Under the setting of:

transformers 4.38.2
flash-attn 2.5.5

vllm integration

Your model seam to work really great on long context. Do you have any plans to integrate this into vllm in the future?

Cohere command r

Is it possible to adapt this to cohere command-r models ?

Run example.py Error: Failed to modify the attention method of LlamaForCausalLM

Hello. I just simplily run the example.py and met the error in the "=====SelfExtend using Torch======" part:

Traceback (most recent call last):
  File "./LongLM/example.py", line 112, in <module>
    SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)
  File "./LongLM/SelfExtend.py", line 123, in apply
    raise Exception(f"Failed to modify the attention method of {arch_name}")
Exception: Failed to modify the attention method of LlamaForCausalLM

transformers=4.41, flash_attn=2.5.8

Meanwhile, I have noticed the similar problem https://github.com/datamllab/LongLM/issues/31, So I tried setting the attention not be flash attention in the same time:
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16, use_flash_attention_2=False)

SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)

I printed the model, which shows it's not the flash attention:

  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

So where is the problem?

Long context

Export weights

Hi,
Thank you for releasing this! Can I save Phi 2 weights with self-extend to finetune on?
Thank you!

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when resuming training

I tried to run example.py on an A100 (80GB) GPU. It seems there is a bug at line [41]

LongLM/example.py

Line 41 in ee92c84

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

The current implementation doesn't load the input_ids tensors onto the device, which causes an error. I replaced the above code, and it's now working. Fixed the issue by adding: input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

Support with vLLM

Hello!
Thank you for your great work, its amazing how much hard work you put for this algorithm. I just had one question is it possible to integrate this with vLLM serving ?

This will really boost the inference time in limited resources setting once you cross the 8192 token mark, is there a way ? Thank you in advance for your help!!

Example for phi2?

Could you please release the example script for Phi-2? Thanks.

Flash Attention Support?

Hey love your work and thanks for releasing the code. I saw on r/LocalLLaMA that you plan on adding flash-attention support. Do you have a timeline in mind for it?

Question | Has anyone tried this with GGUF models?

Flash Attention implementation is coming

We've already had the implementation. Actually,the new results we released on X (previous Twitter) with Google's Gemma are based on this implementation (otherwise we cannot do it on sequences > 30k). However, with the current implementation, we cannot reach the same results（on Longbench）as what we reported in the paper (based on the no flash attention version). There is a minor performance gap between the two versions.

We are still trying to figure out the reason.

TypeError: 'NoneType' object is not subscriptable

When trying to run predictions from a selfextended model im getting the above error TypeError: 'NoneType' object is not subscriptable but without applying selfextend() im not getting any error and am able to run predictions

What effect on qwen1.5 will be if i use self-extend trick?

Thanks for your contribution on accommondating qwen on self-extend.
Qwen1.5 has already been 32k context length. I'm wondering if i can use self-extend to make it to about 100K?
Have you ever tested the effect on qwen1.5 using self-extend?

Passkey retrieval (needle in a haystack)

Hello, congratulations on your acceptance to ICML! I think it is well deserved!
I have a question regarding the passkey retrieval task you posted.

Could you briefly explain how you created the text file for this task? (passkey_examples.jsonl)
Thank you.

Long input series makes oom

Great job! However, I have encountered an issue. When the length of the input increases, the GPU memory consumption grows rapidly, and it quickly leads to an out-of-memory (OOM) error. Could you please let me know if this is a bug?

Questions regarding group query/key positional index

Hi! I love your work and code implementation. Learned a lot.
I have couple questions regarding code implementation.

LongLM/self_extend_patch/Llama.py

Lines 294 to 295 in 6e25a31

 group_query_position = query_position // group_size_1 + _re_group_size_2 - _re_group_size_2 / group_size_1 

 group_key_position = key_position // group_size_1

I understand that group_query_position is generated according to the formula shown in figure 3 of the paper. However, I am curious why group_key_position is simply determined by dividing by group_size, (without neighbor attention) unlike the query . Could you please clarify if I am missing something here?

Thank you in advance for your help.

Differences with ReRoPE

Self-Extend (window_size = training_size/2) does not work at all
ReRoPE is Self-Extend (group PE) with window_size = training_size - 1

	Self-Extend	ReRoPE
	tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2, 1, 0, 0, 0, 0, 0, 0, 0, 0], [3, 2, 1, 0, 0, 0, 0, 0, 0, 0], [4, 3, 2, 1, 0, 0, 0, 0, 0, 0], [4, 4, 3, 2, 1, 0, 0, 0, 0, 0], [5, 5, 4, 3, 2, 1, 0, 0, 0, 0], [5, 5, 4, 4, 3, 2, 1, 0, 0, 0], [6, 6, 5, 5, 4, 3, 2, 1, 0, 0], [6, 6, 5, 5, 4, 4, 3, 2, 1, 0]])	tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2, 1, 0, 0, 0, 0, 0, 0, 0, 0], [3, 2, 1, 0, 0, 0, 0, 0, 0, 0], [4, 3, 2, 1, 0, 0, 0, 0, 0, 0], [5, 4, 3, 2, 1, 0, 0, 0, 0, 0], [6, 5, 4, 3, 2, 1, 0, 0, 0, 0], [6, 6, 5, 4, 3, 2, 1, 0, 0, 0], [6, 6, 6, 5, 4, 3, 2, 1, 0, 0], [6, 6, 6, 6, 5, 4, 3, 2, 1, 0]])
base	256 : 1.0464 512 : 2.2302 1024 : 2.4575	256 : 1.0464 512 : 0.8970 1024 : 2.1020
+logn	256 : 1.0464 512 : 2.2816 1024 : 2.8442	256 : 1.0464 512 : 0.8832 1024 : 1.8993
+keynorm	256 : 2.1632 512 : 2.5168 1024 : 3.2525	256 : 2.1632 512 : 3.1556 1024 : 4.6036

LongLM really has great potential.

I'm applying this to the Ghost 8B Beta (128k) chat version online here and it seems to work.
In general, I have not yet fine-tuned and tested the parameters against the original model (even the current version is online) but I have actually noticed that the context is long but still ensures very good quality, for example here.

This is a quick issue to share this joy with your research team. Thank you very much~

Release the SOLAR model

The paper mentions a SEext-SOLAR-10.5B model?

	group_query_position = query_position // group_size_1 + _re_group_size_2 - _re_group_size_2 / group_size_1
	group_key_position = key_position // group_size_1

datamllab / longlm Goto Github PK

longlm's Issues

I got this error. """

Differences with ReRoPE

Recommend Projects

Recommend Topics

Recommend Org

I got this error.
"""