glaciohound / lm-infinite Goto Github PK

View Code? Open in Web Editor NEW

97.0 4.0 10.0 105 KB

Implementation of paper "LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models"

Home Page: https://arxiv.org/abs/2308.16137

License: MIT License

Python 97.87% Shell 2.13%

language-model long-context model-diagnostics

lm-infinite's Issues

passkey代码运行不通

llama.py#L144会调用transformers的models/llama/modeling_llama.py 的if seq_len > self.max_seq_len_cached

RuntimeError: Boolean value of Tensor with more than one value is ambiguous
这里参数传递有问题
transformers版本为4.32.1

GPTNeoX or Transformers support?

I'm trying to integrate LM-Infinite into GPTNeoX pythia-dedup. I managed to bring in the lambda_attn to work, but the rotary's implementation on the GPTNeoX is a bit different, and the heads is a 3 * hidden_size to form QKV, and the other model has separated layers of 1 * hidden_size that are independent Q/K/V. It managed to train, but during inference or evaluation (single batch) I got stuck on some shape mismatch.

I did managed to see the training benefit of lambda_attn, with a higher it/s. The GPU metrics are more smooth and steady on high throughput. The CPU exhibits also higher compute demand compared to traditional training and it doesn't appear to show any contention for the training. As a test, I did managed to train a larger context with the same hardware and at a higher performance, this works obviously.

Perhaps I was thinking wether having a folder or a separate repo with these modeling_$model.py that can be fit into transformers, would help to simplify the setup and adoption?

Improve GPU memory usage but slower inference speed?

Hi, thanks for the nice work! I tried to use the following code to enable LM-Infinite for Llama following Readme,

model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype=torch.bfloat16, device_map="cuda", low_cpu_mem_usage=True)

from models.llama import convert_llama_model
model = convert_llama_model(model, 4096, 10)

and then do the inference as usual. The GPU memory usage is lower than using regular attention but the inference speed becomes much slower (like 10x slower). I'm using A100 GPU and I checked the GPU-Util: it's very low ~10%. I I wonder if you have any idea why it happens? Many thanks.

Some errors.

Hi,

when I run the code, I encounter two errors:

1. An error 1 occurred when running Evaluation on Passkey Retrieval Task:
Traceback (most recent call last):
File "scripts/eval_downstream_tasks.py", line 121, in
main(args)
File "scripts/eval_downstream_tasks.py", line 71, in main
output, output_ids = model.generate(
TypeError: generate() missing 1 required positional argument: 'do_sample'
2. An error 2 occurred when running Generation":
Traceback (most recent call last):
File "scripts/eval_generation.py", line 107, in
main(args)
File "scripts/eval_generation.py", line 94, in main
scores = generation_overall_metric(
File "LM-Infinite/data/generation_metrics.py", line 6, in generation_overall_metric
rouge = evaluate.load("rouge")
File "python3.8/dist-packages/evaluate/loading.py", line 731, in load
evaluation_module = evaluation_module_factory(
File "python3.8/dist-packages/evaluate/loading.py", line 681, in evaluation_module_factory
raise FileNotFoundError(
FileNotFoundError: Couldn't find a module script at LM-Infinite/rouge/rouge.py. Module 'rouge' doesn't exist on the Hugging Face Hub either.

Looking forward to your reply！

limited_distance_forward() got an unexpected keyword argument 'padding_mask'

I'm trying to run the eval script.

PYTHONPATH=. deepspeed --include localhost:$CUDA_VISIBLE_DEVICES --master_port $MASTER_PORT scripts/eval_downstream_tasks.py     --deepspeed_config configs/zero3_efficient_config.json     --model meta-llama/Llama-2-7b-hf --tokenizer_path meta-llama/Llama-2-7b-hf     --use_lambda_attention --local_branch 4096 --global_branch 100 --limit_distance 4096     --dataset passkey_retrieval --dataset_dir ${PASSKEY_DATA} --dataset_group ${MAX_LENGTH}     --max_generation_length 10 --evaluate_metrics     --log_dir $LOG_DIR/$TRIAL

How to Inferance?

The documentation does not make it clear how to perform inference using the lambda attention.

Should the llama model be fine-tuned?

Hello! I am a rookie to LLMs and I want to reproduce your nice work with the llama model (not the llama2).
Should I fine-tune the llama model on ARXIV or OpenWebText2 before evaluating it?
From my comprehension these two datasets are both the pre-training dataset of llama, so maybe the raw weights of llama model just work?
Thank you so much for your reply~

kv_seq_len bug?

if kv_seq_len > local_branch + global_branch and use_lambda_mask:
            past_key_value = (
                torch.cat([
                    key_states[..., :global_branch, :],
                    key_states[..., -local_branch:, :],
                ], dim=-2),
                torch.cat([
                    value_states[..., :global_branch, :],
                    value_states[..., -local_branch:, :],
                ], dim=-2),
                key_position_ids[..., :local_branch + global_branch]
            ) if use_cache else None

Code in models/llama.py lines 144-155 does not update the kv_seq_len, but updates the past_key_value?

TypeError: attn_forward_factory() missing 5 required positional arguments: 'top_k_attention', 'top_k_insert_at', 'top_k_from_layer', 'top_k_to_layer', and 'layer_i'

When I finish loading llama-7b-chat model and run the code ‘model = convert_llama_model(model, 4096, 10)’， the error occured：

for layer_i, hidden_layer in enumerate(model.model.layers):
attn = hidden_layer.self_attn
attn.forward = attn_forward_factory(
attn, True, local_branch, global_branch, local_branch, 0
)
return model
TypeError: attn_forward_factory() missing 5 required positional arguments: 'top_k_attention', 'top_k_insert_at', 'top_k_from_layer', 'top_k_to_layer', and 'layer_i'

Implementation with RoPE

Hi, thanks for sharing this nice work!
I am a little confused about why keeping all k vectors unrotated while rotating all q vectors on the global branch. Any explanations would be appreciated!

glaciohound / lm-infinite Goto Github PK

lm-infinite's Issues

passkey代码运行不通

GPTNeoX or Transformers support?

Improve GPU memory usage but slower inference speed?

Some errors.

limited_distance_forward() got an unexpected keyword argument 'padding_mask'

How to Inferance?

Should the llama model be fine-tuned?

kv_seq_len bug?

TypeError: attn_forward_factory() missing 5 required positional arguments: 'top_k_attention', 'top_k_insert_at', 'top_k_from_layer', 'top_k_to_layer', and 'layer_i'

Implementation with RoPE

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent