Hi! After many, many hours I've succesfully run this code with LoRa, PEFT and BnB quantization. I can train Gemma2B in my 3060 12GB GPU and the output becomes decent (or better: not so random anymore) after just a few minutes of training with some data from Wikipedia. Loss starts at 50 and goes down to 3.5, which seems way, way better than before (I got stuck at 9 loss for a long time, even with hours of training).
However, due to the low memory setup, I had to greatly decrease segment size. My plan was to decrease it from 8192 (using the "test_model_to_hf" the default value is 8192) to 256, and train with a block size of 1024 (in theory, with 4bit quantization I can get up to 1600 tokens of block size, while 8 bit crashes frequently with 1024 - if I setup a checkpoint system I might train the model in 8 bit, restarting and loading from the checkpoint when it crashes). But the code crashes in the trainer, saying:
"RuntimeError: The size of tensor a (1024) must match the size of tensor b (256) at non-singleton dimension 2".
Any clue why? That's the test_train.small.gemma.infini.py modified code:
print("Torch Version:", torch.__version__)
print("CUDA:", torch.cuda.is_available())
if torch.cuda.is_available():
device = "cuda:0" # set GPU device using CUDA_VISIBLE_DEVICES
else:
device = "cpu"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
#bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
llm_int8_enable_fp32_cpu_offload=True
)
model = GemmaForCausalLM.from_pretrained(
"./models/gemma-2b",
attn_implementation="eager",
torch_dtype="auto", device_map="auto",
quantization_config=bnb_config,
low_cpu_mem_usage=True
)
config = model.config
print(config)
print(model)
l_config = LoraConfig(
r=32,
lora_alpha=64,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"lm_head",
"up_proj",
"down_proj"
],
lora_dropout=0.05,
modules_to_save=["gate"],
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, l_config)
load = False #Change thsi if you need to load the adapters
if load:
model = PeftModel.from_pretrained(model, "models/gemma-2b_adapters",
attn_implementation="eager",
torch_dtype="auto", device_map="auto",
quantization_config=bnb_config,
low_cpu_mem_usage=True,
is_trainable=True)
model.enable_input_require_grads()
tokenizer = AutoTokenizer.from_pretrained("models/gemma-2b")
wiki = load_dataset("wikipedia", "20220301.simple", split='train[:200]')
def tokenize_function(examples):
return tokenizer(examples["text"])
try:
column_names = list(wiki["train"].features)
except KeyError:
column_names = list(wiki.features)
tokenized_datasets = wiki.map(
tokenize_function, remove_columns=column_names, batched=True
)
block_size = config.segment_size * 4 # will be 32768
print("block_size:", block_size)
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, and if the total_length < block_size we exclude this batch and return an empty dict.
# We could add padding if the model supported it instead of this drop, you can customize this part to your needs.
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
)
print(lm_datasets)
training_args = TrainingArguments(
output_dir="./models/gemma-2b-wikitext",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=1, # to test batch dim
gradient_accumulation_steps=4,
save_total_limit=1,
report_to="none", # "none" if you don't want to report to wandb
run_name="gemma-2b-wikitext",
optim="adamw_bnb_8bit",
learning_rate=1e-4,
bf16=True,
logging_first_step=True,
logging_steps=1,
save_strategy="no",
evaluation_strategy="no",
# warmup_ratio=0.1,
max_grad_norm=1.0,
gradient_checkpointing=True,
)
try:
train_dataset = lm_datasets["train"]
except KeyError:
train_dataset = lm_datasets
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=train_dataset,
# eval_dataset=lm_datasets["validation"],
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()
The error is thrown by the trainer.train() row.
Edit: the error is called from the trainer, but it is in modeling_infini_gemma, at row 272:
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
--> q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
Also, this error happens even if I try to load everything in full precision: even before going oom, it says "RuntimeError: The size of tensor a (32768) must match the size of tensor b (8192) at non-singleton dimension 2". This happens with the teest_train.small.gemma.infini, without any change and without the preloaded model (it downloads the google/gemma-2b model).