Giter VIP home page Giter VIP logo

lm-infinite's Issues

passkey代码运行不通

llama.py#L144会调用transformers的models/llama/modeling_llama.py 的if seq_len > self.max_seq_len_cached

RuntimeError: Boolean value of Tensor with more than one value is ambiguous
这里参数传递有问题
transformers版本为4.32.1

GPTNeoX or Transformers support?

I'm trying to integrate LM-Infinite into GPTNeoX pythia-dedup. I managed to bring in the lambda_attn to work, but the rotary's implementation on the GPTNeoX is a bit different, and the heads is a 3 * hidden_size to form QKV, and the other model has separated layers of 1 * hidden_size that are independent Q/K/V. It managed to train, but during inference or evaluation (single batch) I got stuck on some shape mismatch.

I did managed to see the training benefit of lambda_attn, with a higher it/s. The GPU metrics are more smooth and steady on high throughput. The CPU exhibits also higher compute demand compared to traditional training and it doesn't appear to show any contention for the training. As a test, I did managed to train a larger context with the same hardware and at a higher performance, this works obviously.

Perhaps I was thinking wether having a folder or a separate repo with these modeling_$model.py that can be fit into transformers, would help to simplify the setup and adoption?

Improve GPU memory usage but slower inference speed?

Hi, thanks for the nice work! I tried to use the following code to enable LM-Infinite for Llama following Readme,

model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype=torch.bfloat16, device_map="cuda", low_cpu_mem_usage=True)

from models.llama import convert_llama_model
model = convert_llama_model(model, 4096, 10)

and then do the inference as usual. The GPU memory usage is lower than using regular attention but the inference speed becomes much slower (like 10x slower). I'm using A100 GPU and I checked the GPU-Util: it's very low ~10%. I I wonder if you have any idea why it happens? Many thanks.

Some errors.

Hi,

when I run the code, I encounter two errors:

1. An error 1 occurred when running Evaluation on Passkey Retrieval Task:
Traceback (most recent call last):
File "scripts/eval_downstream_tasks.py", line 121, in
main(args)
File "scripts/eval_downstream_tasks.py", line 71, in main
output, output_ids = model.generate(
TypeError: generate() missing 1 required positional argument: 'do_sample'

2. An error 2 occurred when running Generation":
Traceback (most recent call last):
File "scripts/eval_generation.py", line 107, in
main(args)
File "scripts/eval_generation.py", line 94, in main
scores = generation_overall_metric(
File "LM-Infinite/data/generation_metrics.py", line 6, in generation_overall_metric
rouge = evaluate.load("rouge")
File "python3.8/dist-packages/evaluate/loading.py", line 731, in load
evaluation_module = evaluation_module_factory(
File "python3.8/dist-packages/evaluate/loading.py", line 681, in evaluation_module_factory
raise FileNotFoundError(
FileNotFoundError: Couldn't find a module script at LM-Infinite/rouge/rouge.py. Module 'rouge' doesn't exist on the Hugging Face Hub either.

Looking forward to your reply!

limited_distance_forward() got an unexpected keyword argument 'padding_mask'

I'm trying to run the eval script.

PYTHONPATH=. deepspeed --include localhost:$CUDA_VISIBLE_DEVICES --master_port $MASTER_PORT scripts/eval_downstream_tasks.py     --deepspeed_config configs/zero3_efficient_config.json     --model meta-llama/Llama-2-7b-hf --tokenizer_path meta-llama/Llama-2-7b-hf     --use_lambda_attention --local_branch 4096 --global_branch 100 --limit_distance 4096     --dataset passkey_retrieval --dataset_dir ${PASSKEY_DATA} --dataset_group ${MAX_LENGTH}     --max_generation_length 10 --evaluate_metrics     --log_dir $LOG_DIR/$TRIAL
image

How to Inferance?

The documentation does not make it clear how to perform inference using the lambda attention.

Should the llama model be fine-tuned?

Hello! I am a rookie to LLMs and I want to reproduce your nice work with the llama model (not the llama2).
Should I fine-tune the llama model on ARXIV or OpenWebText2 before evaluating it?
From my comprehension these two datasets are both the pre-training dataset of llama, so maybe the raw weights of llama model just work?
Thank you so much for your reply~

kv_seq_len bug?

if kv_seq_len > local_branch + global_branch and use_lambda_mask:
            past_key_value = (
                torch.cat([
                    key_states[..., :global_branch, :],
                    key_states[..., -local_branch:, :],
                ], dim=-2),
                torch.cat([
                    value_states[..., :global_branch, :],
                    value_states[..., -local_branch:, :],
                ], dim=-2),
                key_position_ids[..., :local_branch + global_branch]
            ) if use_cache else None

Code in models/llama.py lines 144-155 does not update the kv_seq_len, but updates the past_key_value?

TypeError: attn_forward_factory() missing 5 required positional arguments: 'top_k_attention', 'top_k_insert_at', 'top_k_from_layer', 'top_k_to_layer', and 'layer_i'

When I finish loading llama-7b-chat model and run the code ‘model = convert_llama_model(model, 4096, 10)’, the error occured:

for layer_i, hidden_layer in enumerate(model.model.layers):
attn = hidden_layer.self_attn
attn.forward = attn_forward_factory(
attn, True, local_branch, global_branch, local_branch, 0
)
return model
TypeError: attn_forward_factory() missing 5 required positional arguments: 'top_k_attention', 'top_k_insert_at', 'top_k_from_layer', 'top_k_to_layer', and 'layer_i'

Implementation with RoPE

Hi, thanks for sharing this nice work!
I am a little confused about why keeping all k vectors unrotated while rotating all q vectors on the global branch. Any explanations would be appreciated!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.