datamllab / longlm Goto Github PK
View Code? Open in Web Editor NEW[ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
Home Page: https://arxiv.org/pdf/2401.01325.pdf
License: MIT License
[ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
Home Page: https://arxiv.org/pdf/2401.01325.pdf
License: MIT License
when I use 4.36.2, it's not work. But if I use 4.32.0, it's work.
I only changed "import llama_self_extend_patch as LlamaSE" in "llama_example.py" to "import llama_self_extend_patch_4_36 as LlamaSE"
Has anyone successfully run cases with a length of 16k? What kind of machine resources support the experiment?
Hi, very impressive work! Would the code to reproduce results on LongBench be released anytime soon? Specifically, without flash attention, how to run the experiments with 16k input length and 7B/10B model? The memory consumption seems impossible to run on gpus with 40G memory and difficult even for gpus with 80G memory.
Great work!!
It would be super to have deeper support for Phi2 / Mixformer,
e.g. https://huggingface.co/amgadhasan/phi-2
Edit:
Tried the existing phi patches with some modifications, but it seems like some core assumptions are pretty different. e.g.
AttributeError: 'MixFormerSequentialForCausalLM' object has no attribute 'q_proj'
Dear author๏ผ
I'm trying to run LongLM on a single A10 with 24G memory, I have tried 'meta-llama/Llama-2-7b-chat-hf' and failed with out of CUDA memory error(attached).
I realized that your example.py is running on 4 RTX3090s, 24GB memory each. So I wonder whether an A10 is something worth a shot or not even close?
I also want to ask whether compressed models, for example Unsloth's model, can be used in LongLM or not?
Thanks to @dwzhu-pku for pointing this out.
I followed your direction like the below to apply selfextend to llama3
"""
[04/19/2024]:๐ก We added the support for LLama-3 with transformers==4.40. To use it with transformers==4.40, you may change the file name of Llama_4_40.py to Llama.py to replace the existing patch file.
"""
Exception Traceback (most recent call last)
Cell In[12], line 4
2 group_size = 5
3 window_size = 1024
----> 4 SelfExtend.apply(model, group_size, window_size, enable_flash_attention=True)#, flash_attention_impl='flash_attn')
5 model.eval()
File /home/ubuntu/reports/SelfExtend.py:109, in apply(loaded_model, group_size, window_size, enable_flash_attention, scale_base, flash_attention_impl)
107 print("Using triton flash self_extend!!")
108 if (not modifed):
--> 109 raise Exception(f"Failed to modify the attention method of {arch_name}")
110 else:
111 raise Exception(f"Need to set the flash_attention_impl to 'flash_attn' or 'triton'.")
Exception: Failed to modify the attention method of LlamaForCausalLM
"""
how to fix it?
้่ฏฏๆพ็คบๅฆไธ๏ผ
Token indices sequence length is longer than the specified maximum sequence length for this model (9556 > 4096). Running this sequence through the model will result in indexing errors
I found that the current version of LongLM can not load Gemma 1 or Gemma 2 model successfully. I wrote a minimum test to help reproduce the issue:
# transfromers version 4.38.2
# this example is tested with 4 RTX3090s, 24GB memory each
import warnings
warnings.filterwarnings("ignore")
import torch
import json
import time
from transformers.models.llama.modeling_llama import LlamaAttention
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import SelfExtend
window_size = 1024
group_size = 32
model_name = '/tmp/gemma-2b-it/'
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.eval()
SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)
prompt = "How are you?"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
start_time = time.time()
tokens = model.generate(input_ids, max_new_tokens=4096)
answer = tokenizer.decode(tokens[0].tolist()[input_ids.shape[1]:], skip_special_tokens=True)
print( answer )
While trying to load the model, it fails with the error message below:
$ python3 test.py
`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:01<00:00, 1.07it/s]
Traceback (most recent call last):
File "/var/lib/condor/execute/slot1/dir_2652801/test.py", line 22, in <module>
SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)
File "/var/lib/condor/execute/slot1/dir_2652801/SelfExtend.py", line 160, in apply
raise Exception(f"Failed to modify the attention method of {arch_name}")
Exception: Failed to modify the attention method of GemmaForCausalLM
I found that it fails in the duplicate check in the L24 of SelfExtend.py. When it fails, instance = False
.
Below is a conda env export
dump including package details in my Python environment:
channels:
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- _openmp_mutex=5.1=1_gnu
- bzip2=1.0.8=h5eee18b_6
- ca-certificates=2024.7.2=h06a4308_0
- ld_impl_linux-64=2.38=h1181459_1
- libffi=3.4.4=h6a678d5_1
- libgcc-ng=11.2.0=h1234567_1
- libgomp=11.2.0=h1234567_1
- libstdcxx-ng=11.2.0=h1234567_1
- libuuid=1.41.5=h5eee18b_0
- ncurses=6.4=h6a678d5_0
- openssl=3.0.14=h5eee18b_0
- pip=24.0=py310h06a4308_0
- python=3.10.14=h955ad1f_1
- readline=8.2=h5eee18b_0
- setuptools=69.5.1=py310h06a4308_0
- sqlite=3.45.3=h5eee18b_0
- tk=8.6.14=h39e8969_0
- wheel=0.43.0=py310h06a4308_0
- xz=5.4.6=h5eee18b_1
- zlib=1.2.13=h5eee18b_1
- pip:
- accelerate==0.33.0
- aiohttp==3.9.5
- aiosignal==1.3.1
- annotated-types==0.7.0
- anyio==4.4.0
- async-timeout==4.0.3
- attrs==23.2.0
- certifi==2024.7.4
- charset-normalizer==3.3.2
- click==8.1.7
- cloudpickle==3.0.0
- cmake==3.30.1
- datasets==2.20.0
- dill==0.3.8
- diskcache==5.6.3
- distro==1.9.0
- dnspython==2.6.1
- einops==0.8.0
- email-validator==2.2.0
- exceptiongroup==1.2.2
- fastapi==0.111.1
- fastapi-cli==0.0.4
- filelock==3.15.4
- flash-attn==2.6.3
- frozenlist==1.4.1
- fsspec==2024.5.0
- h11==0.14.0
- httpcore==1.0.5
- httptools==0.6.1
- httpx==0.27.0
- huggingface-hub==0.24.2
- idna==3.7
- interegular==0.3.3
- jinja2==3.1.4
- jsonschema==4.23.0
- jsonschema-specifications==2023.12.1
- lark==1.1.9
- llvmlite==0.43.0
- lm-format-enforcer==0.10.3
- markdown-it-py==3.0.0
- markupsafe==2.1.5
- mdurl==0.1.2
- mpmath==1.3.0
- msgpack==1.0.8
- multidict==6.0.5
- multiprocess==0.70.16
- nest-asyncio==1.6.0
- networkx==3.3
- ninja==1.11.1.1
- numba==0.60.0
- numpy==1.26.4
- nvidia-cublas-cu12==12.1.3.1
- nvidia-cuda-cupti-cu12==12.1.105
- nvidia-cuda-nvrtc-cu12==12.1.105
- nvidia-cuda-runtime-cu12==12.1.105
- nvidia-cudnn-cu12==8.9.2.26
- nvidia-cufft-cu12==11.0.2.54
- nvidia-curand-cu12==10.3.2.106
- nvidia-cusolver-cu12==11.4.5.107
- nvidia-cusparse-cu12==12.1.0.106
- nvidia-ml-py==12.555.43
- nvidia-nccl-cu12==2.20.5
- nvidia-nvjitlink-cu12==12.5.82
- nvidia-nvtx-cu12==12.1.105
- openai==1.37.1
- outlines==0.0.46
- packaging==24.1
- pandas==2.2.2
- pillow==10.4.0
- prometheus-client==0.20.0
- prometheus-fastapi-instrumentator==7.0.0
- protobuf==5.27.2
- psutil==6.0.0
- py-cpuinfo==9.0.0
- pyairports==2.1.1
- pyarrow==17.0.0
- pyarrow-hotfix==0.6
- pycountry==24.6.1
- pydantic==2.8.2
- pydantic-core==2.20.1
- pygments==2.18.0
- python-dateutil==2.9.0.post0
- python-dotenv==1.0.1
- python-multipart==0.0.9
- pytz==2024.1
- pyyaml==6.0.1
- pyzmq==26.0.3
- ray==2.33.0
- referencing==0.35.1
- regex==2024.7.24
- requests==2.32.3
- rich==13.7.1
- rpds-py==0.19.1
- safetensors==0.4.3
- sentencepiece==0.2.0
- shellingham==1.5.4
- six==1.16.0
- sniffio==1.3.1
- starlette==0.37.2
- sympy==1.13.1
- tiktoken==0.7.0
- tokenizers==0.19.1
- torch==2.3.1
- torchvision==0.18.1
- tqdm==4.66.4
- transformers==4.43.3
- triton==2.3.1
- typer==0.12.3
- typing-extensions==4.12.2
- tzdata==2024.1
- urllib3==2.2.2
- uvicorn==0.30.3
- uvloop==0.19.0
- vllm==0.5.3.post1
- vllm-flash-attn==2.5.9.post1
- watchfiles==0.22.0
- websockets==12.0
- xformers==0.0.27
- xxhash==3.4.1
- yarl==1.9.4
I asked in Chinese because I guessed you can read Chinese based on the author list. If you need to ask in English, please contact me!
ๅฆ้ข๏ผๆๅธๆๅจHFไธ่ฝฝ็ๆจกๅไธๅฏ็จSelfExtendไปฅๆฏๆ้ฟไธไธๆ็ชๅฃ๏ผ่ฏทๆไธไธๆฏๅฆๆ็ธๅ ณ็็คบไพ่ๆฌ๏ผไปฅไพๆจ็ๆถๅผๅฏๅ่ฟ่กๅคงๆตทๆ้ๆต่ฏ๏ผ4k-256k๏ผๅข๏ผๆณจๆๅฐๆไธญๅฏนๆณจๆๅ่ฎก็ฎๆนๆณๆๆๆนๅจ๏ผ่ฟๆฏไธไธชๅฏนไปปๆLLMๅณๆๅณ็จ็ๆนๆณๅ๏ผ
ๆๆๆ ผๅผๅ
ๅซๆไปถๅฆ
โโโ config.json
โโโ configuration.json
โโโ generation_config.json
โโโ model-00001-of-00003.safetensors
โโโ model-00002-of-00003.safetensors
โโโ model-00003-of-00003.safetensors
โโโ model.safetensors.index.json
โโโ sft_args.json
โโโ special_tokens_map.json
โโโ tokenizer_config.json
โโโ tokenizer.json
โโโ tokenizer.model
ๅฆๆๅฏไปฅๆไพ๏ผไธ่ๆ่ฐข๏ผ
I use 2 GPUs. During inference (with llama-2-7B, and 10k quetions in the demo), each GPU requires about 50GB of additional memory, excluding the load model. But when I use 8 GPUs, the inference process also requires about 50GB of memory per card. why is that?
Also, I can not test llama-2-70B model, because even if 8 GPUs are used for simultaneous inference, the GPU memory required by each GPU during the inference process is quite huge.
Are there any ways to optimize resource usage during inference? Have you tried testing your method on the 70B model with 10k tokens QA?
Can it be implemented on qwen1.5?
Hi how did you evaluate on LongBench?
I tried to map your LLama to extended version with
Line 40 in 6b84193
And I used the generation flow from longBench https://github.com/THUDM/LongBench/blob/main/pred.py
without extendedforward, one model takes 60GB memory.
Hi,
Loved this paper and implementation. I implemented this for Phi2 with transformers==4.36.2 without caching. The outputs with in context size are even better at following instruction than actual model. However, when going out of context window, I am seeing a repetition. This might be due to extending context window size bit too much. Do you guys have suggestions for experimenting with different group and neighbour sizes or any insights.
Here is my implementation:
https://github.com/agokrani/LongLM/tree/phi2
I haven't implemented KV caching for now due to change in KV caching format in transformers. I will try to implement it soon. Would love to hear your thoughts.
Any example of how the extention can be done on the gemma models? Also, once the models are extended, how can they be exported to use with Ollama? Ollama uses gguf format
Hi! I have a question that may seem simple, but I think I'm overlooking something.
Assume Phi-2's context window is 2K. When we apply a group size (
However, in the paper, the caption of Table 5 states:
The vanilla Phi-2 has a 2k context window. SelfExtend extends Phi-2 to 4K (
$G_s=4, w_n=512$ ), 6K ($G_s=8, w_n=512$ ).
Considering Phi-2's maximum token length is 2K, I am puzzled as to how the extended lengths of 4K and 6K are derived. What am I missing here?
Any insights would be greatly appreciated.
#Tokens of Prompt: 5144 Passkey target: 89427
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
Llama2: [What is the pass key? The pass key is ใใใ. ใ]
SelfExtend: [What is the pass key? The pass key is ใใใ. ใ]
Thanks for your code. I encountered the following issue while trying to extend the context length of Qwen1.5-14B-Chat. Do you know how I can fix this Exception? Many THX!
Traceback (most recent call last):
File "/***/envs/qwen/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/***/envs/qwen/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/***/Qwen1_5/longlm/LongLM/pred.py", line 63, in get_pred
SelfExtend.apply(model, group_size, window_size, enable_flash_attention=use_flash, flash_attention_impl="flash_attn")
File "/***/Qwen1_5/longlm/LongLM/SelfExtend.py", line 179, in apply
raise Exception(f"Failed to modify the attention method of {arch_name}")
Exception: Failed to modify the attention method of Qwen2ForCausalLM
Under the setting of:
Your model seam to work really great on long context. Do you have any plans to integrate this into vllm in the future?
Is it possible to adapt this to cohere command-r models ?
Hello. I just simplily run the example.py and met the error in the "=====SelfExtend using Torch======" part:
Traceback (most recent call last):
File "./LongLM/example.py", line 112, in <module>
SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)
File "./LongLM/SelfExtend.py", line 123, in apply
raise Exception(f"Failed to modify the attention method of {arch_name}")
Exception: Failed to modify the attention method of LlamaForCausalLM
transformers=4.41, flash_attn=2.5.8
Meanwhile, I have noticed the similar problem https://github.com/datamllab/LongLM/issues/31, So I tried setting the attention not be flash attention in the same time:
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16, use_flash_attention_2=False)
SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)
I printed the model, which shows it's not the flash attention:
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
So where is the problem?
Hi,
Thank you for releasing this! Can I save Phi 2 weights with self-extend to finetune on?
Thank you!
I tried to run example.py on an A100 (80GB) GPU. It seems there is a bug at line [41]
Line 41 in ee92c84
The current implementation doesn't load the input_ids tensors onto the device, which causes an error. I replaced the above code, and it's now working. Fixed the issue by adding: input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
Hello!
Thank you for your great work, its amazing how much hard work you put for this algorithm. I just had one question is it possible to integrate this with vLLM serving ?
This will really boost the inference time in limited resources setting once you cross the 8192 token mark, is there a way ? Thank you in advance for your help!!
Could you please release the example script for Phi-2? Thanks.
Hey love your work and thanks for releasing the code. I saw on r/LocalLLaMA that you plan on adding flash-attention support. Do you have a timeline in mind for it?
We've already had the implementation. Actually,the new results we released on X (previous Twitter) with Google's Gemma are based on this implementation (otherwise we cannot do it on sequences > 30k). However, with the current implementation, we cannot reach the same results๏ผon Longbench๏ผas what we reported in the paper (based on the no flash attention version). There is a minor performance gap between the two versions.
We are still trying to figure out the reason.
When trying to run predictions from a selfextended model im getting the above error TypeError: 'NoneType' object is not subscriptable but without applying selfextend() im not getting any error and am able to run predictions
Thanks for your contribution on accommondating qwen on self-extend.
Qwen1.5 has already been 32k context length. I'm wondering if i can use self-extend to make it to about 100K?
Have you ever tested the effect on qwen1.5 using self-extend?
Hello, congratulations on your acceptance to ICML! I think it is well deserved!
I have a question regarding the passkey retrieval task you posted.
Could you briefly explain how you created the text file for this task? (passkey_examples.jsonl)
Thank you.
Great job! However, I have encountered an issue. When the length of the input increases, the GPU memory consumption grows rapidly, and it quickly leads to an out-of-memory (OOM) error. Could you please let me know if this is a bug?
Hi! I love your work and code implementation. Learned a lot.
I have couple questions regarding code implementation.
LongLM/self_extend_patch/Llama.py
Lines 294 to 295 in 6e25a31
I understand that group_query_position
is generated according to the formula shown in figure 3 of the paper. However, I am curious why group_key_position
is simply determined by dividing by group_size
, (without neighbor attention) unlike the query
. Could you please clarify if I am missing something here?
Thank you in advance for your help.
ย | Self-Extend | ReRoPE |
---|---|---|
ย | tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2, 1, 0, 0, 0, 0, 0, 0, 0, 0], [3, 2, 1, 0, 0, 0, 0, 0, 0, 0], [4, 3, 2, 1, 0, 0, 0, 0, 0, 0], [4, 4, 3, 2, 1, 0, 0, 0, 0, 0], [5, 5, 4, 3, 2, 1, 0, 0, 0, 0], [5, 5, 4, 4, 3, 2, 1, 0, 0, 0], [6, 6, 5, 5, 4, 3, 2, 1, 0, 0], [6, 6, 5, 5, 4, 4, 3, 2, 1, 0]]) | tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2, 1, 0, 0, 0, 0, 0, 0, 0, 0], [3, 2, 1, 0, 0, 0, 0, 0, 0, 0], [4, 3, 2, 1, 0, 0, 0, 0, 0, 0], [5, 4, 3, 2, 1, 0, 0, 0, 0, 0], [6, 5, 4, 3, 2, 1, 0, 0, 0, 0], [6, 6, 5, 4, 3, 2, 1, 0, 0, 0], [6, 6, 6, 5, 4, 3, 2, 1, 0, 0], [6, 6, 6, 6, 5, 4, 3, 2, 1, 0]]) |
base | 256 : 1.0464 512 : 2.2302 1024 : 2.4575 | 256 : 1.0464 512 : 0.8970 1024 : 2.1020 |
+logn | 256 : 1.0464 512 : 2.2816 1024 : 2.8442 | 256 : 1.0464 512 : 0.8832 1024 : 1.8993 |
+keynorm | 256 : 2.1632 512 : 2.5168 1024 : 3.2525 | 256 : 2.1632 512 : 3.1556 1024 : 4.6036 |
I'm applying this to the Ghost 8B Beta (128k) chat version online here and it seems to work.
In general, I have not yet fine-tuned and tested the parameters against the original model (even the current version is online) but I have actually noticed that the context is long but still ensures very good quality, for example here.
This is a quick issue to share this joy with your research team. Thank you very much~
The paper mentions a SEext-SOLAR-10.5B model?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.