Giter VIP home page Giter VIP logo

longlm's Issues

llama_self_extend_patch_4_36 is not work

when I use 4.36.2, it's not work. But if I use 4.32.0, it's work.
I only changed "import llama_self_extend_patch as LlamaSE" in "llama_example.py" to "import llama_self_extend_patch_4_36 as LlamaSE"

OOM when length is 16k

Has anyone successfully run cases with a length of 16k? What kind of machine resources support the experiment?

How to reproduce the results on LongBench

Hi, very impressive work! Would the code to reproduce results on LongBench be released anytime soon? Specifically, without flash attention, how to run the experiments with 16k input length and 7B/10B model? The memory consumption seems impossible to run on gpus with 40G memory and difficult even for gpus with 80G memory.

Support for Phi2 / Mixformer

Great work!!

It would be super to have deeper support for Phi2 / Mixformer,
e.g. https://huggingface.co/amgadhasan/phi-2

Edit:
Tried the existing phi patches with some modifications, but it seems like some core assumptions are pretty different. e.g.
AttributeError: 'MixFormerSequentialForCausalLM' object has no attribute 'q_proj'

About GPU memory usage

Dear author๏ผŒ

I'm trying to run LongLM on a single A10 with 24G memory, I have tried 'meta-llama/Llama-2-7b-chat-hf' and failed with out of CUDA memory error(attached).
20240726105026

I realized that your example.py is running on 4 RTX3090s, 24GB memory each. So I wonder whether an A10 is something worth a shot or not even close?

I also want to ask whether compressed models, for example Unsloth's model, can be used in LongLM or not?

llama3 is not working.

I followed your direction like the below to apply selfextend to llama3
"""
[04/19/2024]:๐Ÿ’ก We added the support for LLama-3 with transformers==4.40. To use it with transformers==4.40, you may change the file name of Llama_4_40.py to Llama.py to replace the existing patch file.
"""

I got this error.
"""

Exception Traceback (most recent call last)
Cell In[12], line 4
2 group_size = 5
3 window_size = 1024
----> 4 SelfExtend.apply(model, group_size, window_size, enable_flash_attention=True)#, flash_attention_impl='flash_attn')
5 model.eval()

File /home/ubuntu/reports/SelfExtend.py:109, in apply(loaded_model, group_size, window_size, enable_flash_attention, scale_base, flash_attention_impl)
107 print("Using triton flash self_extend!!")
108 if (not modifed):
--> 109 raise Exception(f"Failed to modify the attention method of {arch_name}")
110 else:
111 raise Exception(f"Need to set the flash_attention_impl to 'flash_attn' or 'triton'.")

Exception: Failed to modify the attention method of LlamaForCausalLM
"""

how to fix it?

LongLM isn't compatible with gemma-2-27b-it or gemma-2b-it

I found that the current version of LongLM can not load Gemma 1 or Gemma 2 model successfully. I wrote a minimum test to help reproduce the issue:

# transfromers version 4.38.2
# this example is tested with 4 RTX3090s, 24GB memory each
import warnings
warnings.filterwarnings("ignore")

import torch 
import json
import time
from transformers.models.llama.modeling_llama import LlamaAttention
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig

import SelfExtend 

window_size = 1024
group_size = 32

model_name = '/tmp/gemma-2b-it/'
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.eval()
SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)
prompt = "How are you?"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

start_time = time.time()
tokens = model.generate(input_ids, max_new_tokens=4096)
answer = tokenizer.decode(tokens[0].tolist()[input_ids.shape[1]:], skip_special_tokens=True)
print( answer )

While trying to load the model, it fails with the error message below:

$ python3 test.py 
`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2/2 [00:01<00:00,  1.07it/s]
Traceback (most recent call last):
  File "/var/lib/condor/execute/slot1/dir_2652801/test.py", line 22, in <module>
    SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)
  File "/var/lib/condor/execute/slot1/dir_2652801/SelfExtend.py", line 160, in apply
    raise Exception(f"Failed to modify the attention method of {arch_name}")
Exception: Failed to modify the attention method of GemmaForCausalLM

I found that it fails in the duplicate check in the L24 of SelfExtend.py. When it fails, instance = False.

Below is a conda env export dump including package details in my Python environment:

channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - bzip2=1.0.8=h5eee18b_6
  - ca-certificates=2024.7.2=h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_1
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libuuid=1.41.5=h5eee18b_0
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.14=h5eee18b_0
  - pip=24.0=py310h06a4308_0
  - python=3.10.14=h955ad1f_1
  - readline=8.2=h5eee18b_0
  - setuptools=69.5.1=py310h06a4308_0
  - sqlite=3.45.3=h5eee18b_0
  - tk=8.6.14=h39e8969_0
  - wheel=0.43.0=py310h06a4308_0
  - xz=5.4.6=h5eee18b_1
  - zlib=1.2.13=h5eee18b_1
  - pip:
      - accelerate==0.33.0
      - aiohttp==3.9.5
      - aiosignal==1.3.1
      - annotated-types==0.7.0
      - anyio==4.4.0
      - async-timeout==4.0.3
      - attrs==23.2.0
      - certifi==2024.7.4
      - charset-normalizer==3.3.2
      - click==8.1.7
      - cloudpickle==3.0.0
      - cmake==3.30.1
      - datasets==2.20.0
      - dill==0.3.8
      - diskcache==5.6.3
      - distro==1.9.0
      - dnspython==2.6.1
      - einops==0.8.0
      - email-validator==2.2.0
      - exceptiongroup==1.2.2
      - fastapi==0.111.1
      - fastapi-cli==0.0.4
      - filelock==3.15.4
      - flash-attn==2.6.3
      - frozenlist==1.4.1
      - fsspec==2024.5.0
      - h11==0.14.0
      - httpcore==1.0.5
      - httptools==0.6.1
      - httpx==0.27.0
      - huggingface-hub==0.24.2
      - idna==3.7
      - interegular==0.3.3
      - jinja2==3.1.4
      - jsonschema==4.23.0
      - jsonschema-specifications==2023.12.1
      - lark==1.1.9
      - llvmlite==0.43.0
      - lm-format-enforcer==0.10.3
      - markdown-it-py==3.0.0
      - markupsafe==2.1.5
      - mdurl==0.1.2
      - mpmath==1.3.0
      - msgpack==1.0.8
      - multidict==6.0.5
      - multiprocess==0.70.16
      - nest-asyncio==1.6.0
      - networkx==3.3
      - ninja==1.11.1.1
      - numba==0.60.0
      - numpy==1.26.4
      - nvidia-cublas-cu12==12.1.3.1
      - nvidia-cuda-cupti-cu12==12.1.105
      - nvidia-cuda-nvrtc-cu12==12.1.105
      - nvidia-cuda-runtime-cu12==12.1.105
      - nvidia-cudnn-cu12==8.9.2.26
      - nvidia-cufft-cu12==11.0.2.54
      - nvidia-curand-cu12==10.3.2.106
      - nvidia-cusolver-cu12==11.4.5.107
      - nvidia-cusparse-cu12==12.1.0.106
      - nvidia-ml-py==12.555.43
      - nvidia-nccl-cu12==2.20.5
      - nvidia-nvjitlink-cu12==12.5.82
      - nvidia-nvtx-cu12==12.1.105
      - openai==1.37.1
      - outlines==0.0.46
      - packaging==24.1
      - pandas==2.2.2
      - pillow==10.4.0
      - prometheus-client==0.20.0
      - prometheus-fastapi-instrumentator==7.0.0
      - protobuf==5.27.2
      - psutil==6.0.0
      - py-cpuinfo==9.0.0
      - pyairports==2.1.1
      - pyarrow==17.0.0
      - pyarrow-hotfix==0.6
      - pycountry==24.6.1
      - pydantic==2.8.2
      - pydantic-core==2.20.1
      - pygments==2.18.0
      - python-dateutil==2.9.0.post0
      - python-dotenv==1.0.1
      - python-multipart==0.0.9
      - pytz==2024.1
      - pyyaml==6.0.1
      - pyzmq==26.0.3
      - ray==2.33.0
      - referencing==0.35.1
      - regex==2024.7.24
      - requests==2.32.3
      - rich==13.7.1
      - rpds-py==0.19.1
      - safetensors==0.4.3
      - sentencepiece==0.2.0
      - shellingham==1.5.4
      - six==1.16.0
      - sniffio==1.3.1
      - starlette==0.37.2
      - sympy==1.13.1
      - tiktoken==0.7.0
      - tokenizers==0.19.1
      - torch==2.3.1
      - torchvision==0.18.1
      - tqdm==4.66.4
      - transformers==4.43.3
      - triton==2.3.1
      - typer==0.12.3
      - typing-extensions==4.12.2
      - tzdata==2024.1
      - urllib3==2.2.2
      - uvicorn==0.30.3
      - uvloop==0.19.0
      - vllm==0.5.3.post1
      - vllm-flash-attn==2.5.9.post1
      - watchfiles==0.22.0
      - websockets==12.0
      - xformers==0.0.27
      - xxhash==3.4.1
      - yarl==1.9.4

ๆ˜ฏๅฆๆœ‰็คบไพ‹ไปฃ็ ๆ”ฏๆŒๅฏนsafetensorsๆ ผๅผLLMๅฏ็”จSelfExtend

I asked in Chinese because I guessed you can read Chinese based on the author list. If you need to ask in English, please contact me!

ๅฆ‚้ข˜๏ผŒๆˆ‘ๅธŒๆœ›ๅœจHFไธ‹่ฝฝ็š„ๆจกๅž‹ไธŠๅฏ็”จSelfExtendไปฅๆ”ฏๆŒ้•ฟไธŠไธ‹ๆ–‡็ช—ๅฃ๏ผŒ่ฏทๆ•™ไธ€ไธ‹ๆ˜ฏๅฆๆœ‰็›ธๅ…ณ็š„็คบไพ‹่„šๆœฌ๏ผŒไปฅไพ›ๆŽจ็†ๆ—ถๅผ€ๅฏๅ’Œ่ฟ›่กŒๅคงๆตทๆž้’ˆๆต‹่ฏ•๏ผˆ4k-256k๏ผ‰ๅ‘ข๏ผŸๆณจๆ„ๅˆฐๆ–‡ไธญๅฏนๆณจๆ„ๅŠ›่ฎก็ฎ—ๆ–นๆณ•ๆœ‰ๆ‰€ๆ”นๅŠจ๏ผŒ่ฟ™ๆ˜ฏไธ€ไธชๅฏนไปปๆ„LLMๅณๆ’ๅณ็”จ็š„ๆ–นๆณ•ๅ—๏ผŸ

ๆ‰€ๆŒ‡ๆ ผๅผๅŒ…ๅซๆ–‡ไปถๅฆ‚
โ”œโ”€โ”€ config.json
โ”œโ”€โ”€ configuration.json
โ”œโ”€โ”€ generation_config.json
โ”œโ”€โ”€ model-00001-of-00003.safetensors
โ”œโ”€โ”€ model-00002-of-00003.safetensors
โ”œโ”€โ”€ model-00003-of-00003.safetensors
โ”œโ”€โ”€ model.safetensors.index.json
โ”œโ”€โ”€ sft_args.json
โ”œโ”€โ”€ special_tokens_map.json
โ”œโ”€โ”€ tokenizer_config.json
โ”œโ”€โ”€ tokenizer.json
โ””โ”€โ”€ tokenizer.model

ๅฆ‚ๆžœๅฏไปฅๆไพ›๏ผŒไธ่ƒœๆ„Ÿ่ฐข๏ผ

Requires excessive computing resources when inference

I use 2 GPUs. During inference (with llama-2-7B, and 10k quetions in the demo), each GPU requires about 50GB of additional memory, excluding the load model. But when I use 8 GPUs, the inference process also requires about 50GB of memory per card. why is that?

Also, I can not test llama-2-70B model, because even if 8 GPUs are used for simultaneous inference, the GPU memory required by each GPU during the inference process is quite huge.

Are there any ways to optimize resource usage during inference? Have you tried testing your method on the 70B model with 10k tokens QA?

Phi2 implementation and Suggestions

Hi,

Loved this paper and implementation. I implemented this for Phi2 with transformers==4.36.2 without caching. The outputs with in context size are even better at following instruction than actual model. However, when going out of context window, I am seeing a repetition. This might be due to extending context window size bit too much. Do you guys have suggestions for experimenting with different group and neighbour sizes or any insights.

Here is my implementation:
https://github.com/agokrani/LongLM/tree/phi2

I haven't implemented KV caching for now due to change in KV caching format in transformers. I will try to implement it soon. Would love to hear your thoughts.

Example for gemma & use with Ollama

Any example of how the extention can be done on the gemma models? Also, once the models are extended, how can they be exported to use with Ollama? Ollama uses gguf format

Question about equation 4 and Table 5 caption in paper

Hi! I have a question that may seem simple, but I think I'm overlooking something.

Assume Phi-2's context window is 2K. When we apply a group size ($G_s$) of 4 and neighbor tokens ($w_n$) of 512, according to Equation 4: $(L - w_n) \times G_s + w_n$, the calculated extended pre-training length is approximately $(2K - 0.5K) \times 4 + 0.5K = 6.5K$.

However, in the paper, the caption of Table 5 states:

The vanilla Phi-2 has a 2k context window. SelfExtend extends Phi-2 to 4K ($G_s=4, w_n=512$), 6K ($G_s=8, w_n=512$).

Considering Phi-2's maximum token length is 2K, I am puzzled as to how the extended lengths of 4K and 6K are derived. What am I missing here?
Any insights would be greatly appreciated.

ๆฒกๆœ‰ๆŠฅ้”™๏ผŒไฝ†ๆ˜ฏๆ˜พ็คบไธๅ‡บๆฅๅฎž้ชŒ็ป“ๆžœ

#Tokens of Prompt: 5144 Passkey target: 89427
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
Llama2: [What is the pass key? The pass key is ใ€‚ใ€‚ใ€‚. ใ€‚]
SelfExtend: [What is the pass key? The pass key is ใ€‚ใ€‚ใ€‚. ใ€‚]

Something wrong in modify_method_of_instance function

Thanks for your code. I encountered the following issue while trying to extend the context length of Qwen1.5-14B-Chat. Do you know how I can fix this Exception? Many THX!

Traceback (most recent call last):
  File "/***/envs/qwen/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/***/envs/qwen/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/***/Qwen1_5/longlm/LongLM/pred.py", line 63, in get_pred
    SelfExtend.apply(model, group_size, window_size, enable_flash_attention=use_flash, flash_attention_impl="flash_attn")
  File "/***/Qwen1_5/longlm/LongLM/SelfExtend.py", line 179, in apply
    raise Exception(f"Failed to modify the attention method of {arch_name}")
Exception: Failed to modify the attention method of Qwen2ForCausalLM

Under the setting of:

  • transformers 4.38.2
  • flash-attn 2.5.5

vllm integration

Your model seam to work really great on long context. Do you have any plans to integrate this into vllm in the future?

Run example.py Error: Failed to modify the attention method of LlamaForCausalLM

Hello. I just simplily run the example.py and met the error in the "=====SelfExtend using Torch======" part:

Traceback (most recent call last):
  File "./LongLM/example.py", line 112, in <module>
    SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)
  File "./LongLM/SelfExtend.py", line 123, in apply
    raise Exception(f"Failed to modify the attention method of {arch_name}")
Exception: Failed to modify the attention method of LlamaForCausalLM

transformers=4.41, flash_attn=2.5.8

Meanwhile, I have noticed the similar problem https://github.com/datamllab/LongLM/issues/31, So I tried setting the attention not be flash attention in the same time:
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16, use_flash_attention_2=False)

SelfExtend.apply(model, group_size, window_size, enable_flash_attention=False)

I printed the model, which shows it's not the flash attention:

  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

So where is the problem?

Export weights

Hi,
Thank you for releasing this! Can I save Phi 2 weights with self-extend to finetune on?
Thank you!

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when resuming training

I tried to run example.py on an A100 (80GB) GPU. It seems there is a bug at line [41]

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

The current implementation doesn't load the input_ids tensors onto the device, which causes an error. I replaced the above code, and it's now working. Fixed the issue by adding: input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

Support with vLLM

Hello!
Thank you for your great work, its amazing how much hard work you put for this algorithm. I just had one question is it possible to integrate this with vLLM serving ?

This will really boost the inference time in limited resources setting once you cross the 8192 token mark, is there a way ? Thank you in advance for your help!!

Flash Attention Support?

Hey love your work and thanks for releasing the code. I saw on r/LocalLLaMA that you plan on adding flash-attention support. Do you have a timeline in mind for it?

Flash Attention implementation is coming

We've already had the implementation. Actually,the new results we released on X (previous Twitter) with Google's Gemma are based on this implementation (otherwise we cannot do it on sequences > 30k). However, with the current implementation, we cannot reach the same results๏ผˆon Longbench๏ผ‰as what we reported in the paper (based on the no flash attention version). There is a minor performance gap between the two versions.

We are still trying to figure out the reason.

TypeError: 'NoneType' object is not subscriptable

When trying to run predictions from a selfextended model im getting the above error TypeError: 'NoneType' object is not subscriptable but without applying selfextend() im not getting any error and am able to run predictions

What effect on qwen1.5 will be if i use self-extend trick?

Thanks for your contribution on accommondating qwen on self-extend.
Qwen1.5 has already been 32k context length. I'm wondering if i can use self-extend to make it to about 100K?
Have you ever tested the effect on qwen1.5 using self-extend?

Passkey retrieval (needle in a haystack)

Hello, congratulations on your acceptance to ICML! I think it is well deserved!
I have a question regarding the passkey retrieval task you posted.

Could you briefly explain how you created the text file for this task? (passkey_examples.jsonl)
Thank you.

Long input series makes oom

Great job! However, I have encountered an issue. When the length of the input increases, the GPU memory consumption grows rapidly, and it quickly leads to an out-of-memory (OOM) error. Could you please let me know if this is a bug?

Questions regarding group query/key positional index

Hi! I love your work and code implementation. Learned a lot.
I have couple questions regarding code implementation.

group_query_position = query_position // group_size_1 + _re_group_size_2 - _re_group_size_2 / group_size_1
group_key_position = key_position // group_size_1

I understand that group_query_position is generated according to the formula shown in figure 3 of the paper. However, I am curious why group_key_position is simply determined by dividing by group_size, (without neighbor attention) unlike the query . Could you please clarify if I am missing something here?

Thank you in advance for your help.

Differences with ReRoPE

Differences with ReRoPE

  • Self-Extend (window_size = training_size/2) does not work at all
  • ReRoPE is Self-Extend (group PE) with window_size = training_size - 1

ย  Self-Extend ReRoPE
ย  tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2, 1, 0, 0, 0, 0, 0, 0, 0, 0], [3, 2, 1, 0, 0, 0, 0, 0, 0, 0], [4, 3, 2, 1, 0, 0, 0, 0, 0, 0], [4, 4, 3, 2, 1, 0, 0, 0, 0, 0], [5, 5, 4, 3, 2, 1, 0, 0, 0, 0], [5, 5, 4, 4, 3, 2, 1, 0, 0, 0], [6, 6, 5, 5, 4, 3, 2, 1, 0, 0], [6, 6, 5, 5, 4, 4, 3, 2, 1, 0]]) tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2, 1, 0, 0, 0, 0, 0, 0, 0, 0], [3, 2, 1, 0, 0, 0, 0, 0, 0, 0], [4, 3, 2, 1, 0, 0, 0, 0, 0, 0], [5, 4, 3, 2, 1, 0, 0, 0, 0, 0], [6, 5, 4, 3, 2, 1, 0, 0, 0, 0], [6, 6, 5, 4, 3, 2, 1, 0, 0, 0], [6, 6, 6, 5, 4, 3, 2, 1, 0, 0], [6, 6, 6, 6, 5, 4, 3, 2, 1, 0]])
base 256 : 1.0464 512 : 2.2302 1024 : 2.4575 256 : 1.0464 512 : 0.8970 1024 : 2.1020
+logn 256 : 1.0464 512 : 2.2816 1024 : 2.8442 256 : 1.0464 512 : 0.8832 1024 : 1.8993
+keynorm 256 : 2.1632 512 : 2.5168 1024 : 3.2525 256 : 2.1632 512 : 3.1556 1024 : 4.6036

LongLM really has great potential.

I'm applying this to the Ghost 8B Beta (128k) chat version online here and it seems to work.
In general, I have not yet fine-tuned and tested the parameters against the original model (even the current version is online) but I have actually noticed that the context is long but still ensures very good quality, for example here.

This is a quick issue to share this joy with your research team. Thank you very much~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.