beomi / infinitransformer Goto Github PK

Unofficial PyTorch/🤗Transformers(Gemma/Llama3) implementation of Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Home Page: https://arxiv.org/abs/2404.07143

License: MIT License

Python 99.42% Shell 0.58%

gemma huggingface pytorch transformers infinitransformer llama llama3

infinitransformer's Issues

About memory missing location information

I noticed that the memory retrieval and update happens before 'apply_rotary_pos_emb'. Wondering whether the memory lacking location information would confuse the model's perception of the order of historical information?

question about norm_term_broadcastable

InfiniTransformer/modeling_gemma.py

Line 837 in d3659c3

norm_term_broadcastable = self.norm_term.unsqueeze(1).expand(

I think 'norm_term_broadcastable' should be multiplied by 'query_states'.

Code not running on GPU

It seems like the code is forced to run on CPU (sending my computer out of ram). If I output torch gpu available is says true, and it's using GPU, but the model still loads on CPU ram. Looking into the code, it seems that any additional config (as "device_map={"GPU": 0}" inside the GemmaConfig.from_pretrained) is ignored and not used by the modeling_gemma.py... Any advice?

When will the code be made public, please?

Very exciting work, when will it be made public to help researchers explore it more deeply?

Support Zero-3?

I used accelerate launch with ZERO-3 to run train.llama.infini.noclm.1Mseq.sh.
But I got this:
RuntimeError: Function 'LinearFunctionForZeroStage3Backward' returned nan values in its 0th output

config no attn_implementation = "eager"

i print config, buf find there is no attn_implementation = "eager", here is the output:
GemmaConfig {
"architectures": [
"GemmaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 2,
"eos_token_id": 1,
"head_dim": 256,
"hidden_act": "gelu",
"hidden_activation": null,
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 16384,
"max_position_embeddings": 8192,
"memory_size": 2048,
"model_type": "gemma",
"num_attention_heads": 8,
"num_hidden_layers": 18,
"num_key_value_heads": 1,
"pad_token_id": 0,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 10000.0,
"segment_size": 16,
"torch_dtype": "bfloat16",
"transformers_version": "4.39.3",
"use_cache": false,
"vocab_size": 256000
}

Model generating random sequence

By saving the model and reloading it I managed to get the model working, both with quantized and full precision (it still uses 10gb max of gpu ram).
However, the model generates random characters. Here's the output:


GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaInfiniAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear(in_features=16384, out_features=2048, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
      )
    )
    (norm): GemmaRMSNorm()
  )
  (lm_head): Linear(in_features=2048, out_features=256000, bias=False)
)
Input:
This work introduces an efficient method to scale Transformer-based
generated_sequence:
<bos>fetchone’mittlungPublicado:, (хьтан cobr " about mattino
relenting ? Alamofire theyallclose"conio
Generated:
<bos>fetchonekjø  professeurs { AssemblyCulture for disagre ‘ Compañ ‘…GraphicsUnit youXmlEnum atpaddingVertical such. nakalista .enumi,stdarg It Caleb including autunno ifwithIdentifierഛ“ Muitos for якостіبسم  relenting

When the model is printed it correctly says "GemmaInfiniAttention" for the self_attn layers, but it still generates random characters. What am I doing wrong?

What is the min GPU memory required to fine-tune the model?

First of all, thank you very much for your work.

I try to train the model Gemma-2B 32K seq len with 2K segment size on a single A6000Ada 48G
But even if I adjust the parameters in train.gemma.infini.noclm.sh like the following, it still shows that the GPU memory is exceeded.
Is this normal?

accelerate launch --mixed_precision='bf16' \
    train.gemma.infini.noclm.py \
    --model_name_or_path='google/gemma-2b' \
    --segment_length=2 \
    --block_size=32 \
    --dataset_name='wikitext' \
    --dataset_config_name='wikitext-2-raw-v1' \
    --per_device_train_batch_size=1 \
    --per_device_eval_batch_size=1 \
    --weight_decay=1.0 \
    --output_dir='./models/gemma-2b-infini-noclm-wikitext' \
    --checkpointing_steps=10 \
    --num_train_epochs=1 \
    --learning_rate=5e-5 \
    --seed=42 \
    --low_cpu_mem_usage \
    --report_to='wandb' \
    --preprocessing_num_workers=64 \
    --with_tracking \

mem and norm_term is nan？

I loaded the model and run the inference just , I found mem and norm_term are large and round 2nd is inf .
"""
[Update] self.norm_term 7444.0 1181.0
[Update] self.memory 11912.0 -11120.0
[Update] self.norm_term 6420.0 1126.0
[Update] self.memory 13560.0 -14040.0
[Update] self.norm_term 7524.0 1179.0
[Update] self.memory 13808.0 -12528.0
[Update] self.norm_term 6344.0 1184.0
[Update] self.memory 18416.0 -15440.0
[Update] self.norm_term 8456.0 613.5
[Update] self.memory 23856.0 -26608.0
[Update] self.norm_term 8648.0 964.5
[Update] self.memory 25968.0 -26096.0
[Update] self.norm_term 12440.0 175.875
Loss @ segment 0: 13.347726821899414
--------------------------------------------- Round 0 ----------------------------------------------
[Update] self.memory inf -inf
[Update] self.norm_term 46080.0 586.0
[Update] self.memory nan nan
[Update] self.norm_term nan nan
"""
values are max and min of the tensor.
this is just value error.

Memory should be per layer

Expand memory to have 'layer idx'
Use memory per layer (not all same memory on all layers)

load model failed

I downloaded the gemma model and got an error when loading it locally.

code:
pretrained_model = GemmaForCausalLM.from_pretrained("./model")

error:
Traceback (most recent call last):
File "test_basic.py", line 36, in
pretrained_model = GemmaForCausalLM.from_pretrained("./model")
File "/root/data/vjuicefs_translation_002/11120000/InfiniTransformer/src/transformers/src/transformers/modeling_utils.py", line 3550, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/root/data/vjuicefs_translation_002/11120000/InfiniTransformer/src/transformers/src/transformers/models/gemma/modeling_gemma.py", line 1368, in init
self.model = GemmaModel(config)
File "/root/data/vjuicefs_translation_002/11120000/InfiniTransformer/src/transformers/src/transformers/models/gemma/modeling_gemma.py", line 1119, in init
[
File "/root/data/vjuicefs_translation_002/11120000/InfiniTransformer/src/transformers/src/transformers/models/gemma/modeling_gemma.py", line 1120, in
GemmaDecoderLayer(config, layer_idx)
File "/root/data/vjuicefs_translation_002/11120000/InfiniTransformer/src/transformers/src/transformers/models/gemma/modeling_gemma.py", line 877, in init
self.self_attn = GEMMA_ATTENTION_CLASSES[config._attn_implementation](
File "/root/data/vjuicefs_translation_002/11120000/InfiniTransformer/src/transformers/src/transformers/models/gemma/modeling_gemma.py", line 709, in init
self.segment_size = config.segment_size
File "/root/data/vjuicefs_translation_002/11120000/InfiniTransformer/src/transformers/src/transformers/configuration_utils.py", line 263, in getattribute
return super().getattribute(key)
AttributeError: 'GemmaConfig' object has no attribute 'segment_size'

can you support llama2 model?

Discord server for this?

Hi! How about someone create a discord server to discuss about this implementation? (I'm not enough good, but I'm sure you will find much help in a server, and I'd like to follow the development!)

Issue while runing test_train.small.gemma.infini.py

Did anyone face this issue?

warnings.warn(
Traceback (most recent call last):
File "test_train.small.gemma.infini.py", line 150, in
trainer.train()
File "/transformers/src/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/transformers/src/transformers/trainer.py", line 2216, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/transformers/src/transformers/trainer.py", line 3238, in training_step
loss = self.compute_loss(model, inputs)
File "/transformers/src/transformers/trainer.py", line 3264, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 822, in forward
return model_forward(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 810, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/usr/local/lib/python3.8/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/jupyter_workspace/Raghav/InfiniTransformer/infini_gemma/modeling_infini_gemma.py", line 1613, in forward
outputs = self.model(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/jupyter_workspace/Raghav/InfiniTransformer/infini_gemma/modeling_infini_gemma.py", line 1371, in forward
layer_outputs = self._gradient_checkpointing_func(
File "/usr/local/lib/python3.8/dist-packages/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/external_utils.py", line 36, in inner
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 487, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 598, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 262, in forward
outputs = run_function(*args)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/jupyter_workspace/Raghav/InfiniTransformer/infini_gemma/modeling_infini_gemma.py", line 1056, in forward
_attended = self.self_attn(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/jupyter_workspace/Raghav/InfiniTransformer/infini_gemma/modeling_infini_gemma.py", line 868, in forward
query_states, key_states = apply_rotary_pos_emb(
File "/jupyter_workspace/Raghav/InfiniTransformer/infini_gemma/modeling_infini_gemma.py", line 272, in apply_rotary_pos_emb
q_embed = (q * cos) + (rotate_half(q) * sin)
RuntimeError: The size of tensor a (65536) must match the size of tensor b (8192) at non-singleton dimension 2

Are there any trained InfinityTransformer weights available?

If someone has trained these, could they share them?

Segment and block size error

Hi! After many, many hours I've succesfully run this code with LoRa, PEFT and BnB quantization. I can train Gemma2B in my 3060 12GB GPU and the output becomes decent (or better: not so random anymore) after just a few minutes of training with some data from Wikipedia. Loss starts at 50 and goes down to 3.5, which seems way, way better than before (I got stuck at 9 loss for a long time, even with hours of training).

However, due to the low memory setup, I had to greatly decrease segment size. My plan was to decrease it from 8192 (using the "test_model_to_hf" the default value is 8192) to 256, and train with a block size of 1024 (in theory, with 4bit quantization I can get up to 1600 tokens of block size, while 8 bit crashes frequently with 1024 - if I setup a checkpoint system I might train the model in 8 bit, restarting and loading from the checkpoint when it crashes). But the code crashes in the trainer, saying:
"RuntimeError: The size of tensor a (1024) must match the size of tensor b (256) at non-singleton dimension 2".
Any clue why? That's the test_train.small.gemma.infini.py modified code:

print("Torch Version:", torch.__version__)
print("CUDA:", torch.cuda.is_available())

if torch.cuda.is_available():
    device = "cuda:0"  # set GPU device using CUDA_VISIBLE_DEVICES
else:
    device = "cpu"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    #bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    llm_int8_enable_fp32_cpu_offload=True
)


model = GemmaForCausalLM.from_pretrained(
    "./models/gemma-2b",
    attn_implementation="eager",
    torch_dtype="auto", device_map="auto",
    quantization_config=bnb_config,
    low_cpu_mem_usage=True
)
config = model.config
print(config)
print(model)
l_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "lm_head",
        "up_proj",
        "down_proj"
    ],
    lora_dropout=0.05,
    modules_to_save=["gate"],
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, l_config)
load = False #Change thsi if you need to load the adapters
if load:
    model = PeftModel.from_pretrained(model, "models/gemma-2b_adapters",
                              attn_implementation="eager",
                              torch_dtype="auto", device_map="auto",
                              quantization_config=bnb_config,
                              low_cpu_mem_usage=True,
                                    is_trainable=True)


model.enable_input_require_grads()

tokenizer = AutoTokenizer.from_pretrained("models/gemma-2b")
wiki = load_dataset("wikipedia", "20220301.simple", split='train[:200]')


def tokenize_function(examples):
    return tokenizer(examples["text"])


try:
    column_names = list(wiki["train"].features)
except KeyError:
    column_names = list(wiki.features)
tokenized_datasets = wiki.map(
    tokenize_function, remove_columns=column_names, batched=True
)


block_size = config.segment_size * 4  # will be 32768
print("block_size:", block_size)


def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, and if the total_length < block_size  we exclude this batch and return an empty dict.
    # We could add padding if the model supported it instead of this drop, you can customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result


lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
)

print(lm_datasets)

training_args = TrainingArguments(
    output_dir="./models/gemma-2b-wikitext",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=1,  # to test batch dim
    gradient_accumulation_steps=4,
    save_total_limit=1,
    report_to="none",  # "none" if you don't want to report to wandb
    run_name="gemma-2b-wikitext",
    optim="adamw_bnb_8bit",
    learning_rate=1e-4,
    bf16=True,
    logging_first_step=True,
    logging_steps=1,
    save_strategy="no",
    evaluation_strategy="no",
    # warmup_ratio=0.1,
    max_grad_norm=1.0,
    gradient_checkpointing=True,
)

try:
    train_dataset = lm_datasets["train"]
except KeyError:
    train_dataset = lm_datasets

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    # eval_dataset=lm_datasets["validation"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()

The error is thrown by the trainer.train() row.

Edit: the error is called from the trainer, but it is in modeling_infini_gemma, at row 272:
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
--> q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed

Also, this error happens even if I try to load everything in full precision: even before going oom, it says "RuntimeError: The size of tensor a (32768) must match the size of tensor b (8192) at non-singleton dimension 2". This happens with the teest_train.small.gemma.infini, without any change and without the preloaded model (it downloads the google/gemma-2b model).

Inference code (with Segments)

Basic inference
Segment based inference

Segment-Wise Attention

From what i understand from the paper the sequence is segmented in smaller segments and fed to the attention layers. Is this not implemented yet in this implementation?

Memory does not use PE

Relocate memory retrieval/update before apply PE

Suggest to use the constant memory gradient computation in Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

If you model is trained to process long sentences, such as 32k, 64k, even more, maybe use this will cost less gpu memory, paper address: https://arxiv.org/pdf/2006.16236.pdf

this is also a reference work of infinite transformer.

Model loses information very quickly

Hi! I trained the model with LoRA and 8 bit precision down to 1.5/2.5 training loss. The generation is segment-wise, but the model seems to not generate correct text. It cannot perform a needle-in-a-sack test even in small tests (less tokens than the segment size, aka 400 for me). It starts to spit out nosense very quickly. For example:
I've tried a NIAS test with this pattern:
"There is an important info hidden inside a lot of irrelevant text. Find it and memorize it. I will quiz you about the important information there."
Then a loop of "\nThe grass is green. The sky is blue. The sun is yellow. Here we go. There and back again.\nThe grass is green. The sky is blue. The sun is yellow. Here we go. There and back again.\nThe grass is green. The sky is blue. The sun is yellow. Here we go. There and back again." continues for many times (I've repeated it as long as to reach 400 tokens, 3600 tokens and 10k tokens).
Inside the loop, in a random position, there's a "\nThe pass key is 72498. Remember it. 72498 is the pass key.". In the end of the prompt, there's written "What is the pass key? The pass key is " and the base model completes correctly with 72498, up until 3600 tokens (then my GPU goes oom).

With infini attention, the model can't complete correctly even once. Moreover, the pattern repeated many times gets "broken", here's a completion example:
" The sun is yellow. Here we go. There and back again.\nThe grass is green. The sky is bluer. The sun is yellow. Here we go. There and back again.\nThe grass is green. The sky is bluer. The sun is yellow. Here we go. There and back again.\nThe grass is green. The sky is blue. The sun is yellow. Here we will. They will be a bit of the distance, at least we"

It behaves as if the model can't keep information at all, or for a very short amount of time. Has anyone tested how good those models go? I sadly noticed that the repo has not been updated in a month :-(

Limitations of the method

Just wondering if any limitations of the Infini-attention like inference speed and model performance. Not too much discussions in the paper.

question about activation function

` def _retrieve_from_memory(self, query_states):
# Retrieve context from compressive memory using linear attention (Eq. 3)
if self.memory is None:
debug_print("[Retrieve] No memory found")
return torch.zeros_like(query_states)
debug_print("[Retrieve] query_states.shape", query_states.shape)
debug_print("[Retrieve] self.memory.shape", self.memory.shape)
memory_output = torch.matmul(query_states, self.memory) / self.norm_term
return memory_output

def _update_memory(self, key_states, value_states):
    # Update compressive memory with new key-value states (Eq. 4)
    if self.memory is not None:
        self.memory = self.memory + torch.matmul(
            key_states.transpose(-2, -1), value_states
        )
        debug_print("[Update] self.memory.shape", self.memory.shape)
    else:
        self.memory = torch.matmul(key_states.transpose(-2, -1), value_states)
        debug_print("[Update] self.memory.shape", self.memory.shape)
    if self.norm_term is not None:
        self.norm_term = self.norm_term + key_states.sum(dim=-2)
    else:
        self.norm_term = key_states.sum(dim=-2)`

In the parper,

if you forget to add the activation function ?

BitLinear

What are your thoughts about adding bitlinear?

beomi / infinitransformer Goto Github PK

infinitransformer's Issues

Recommend Projects

Recommend Topics

Recommend Org