Comments (99)
我这边也实现了baichuan-7b 的lora微调,baichuan模型的结构跟llama一致,它的SFT微调方法跟bloom/llama基本一致的。
支持baichuan-7b微调项目地址:https://github.com/shibing624/MedicalGPT
该项目还实现了GPT模型训练,包括二次预训练、有监督微调、奖励建模、强化学习训练。
运行以下指令即可实现 belle 数据集指令微调(instruction-tuning):
python3 supervised_finetuning.py \
--model_type auto \
--model_name_or_path baichuan-inc/baichuan-7B \
--train_file_dir ./data/finetune \
--validation_file_dir ./data/finetune \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 1 \
--do_train \
--do_eval \
--use_peft True \
--max_train_samples 1000 \
--max_eval_samples 10 \
--num_train_epochs 1 \
--learning_rate 2e-5 \
--warmup_ratio 0.05 \
--weight_decay 0.05 \
--logging_strategy steps \
--logging_steps 10 \
--eval_steps 50 \
--evaluation_strategy steps \
--save_steps 500 \
--save_strategy steps \
--save_total_limit 3 \
--gradient_accumulation_steps 1 \
--preprocessing_num_workers 1 \
--max_source_length 256 \
--max_target_length 256 \
--output_dir outputs-sft-baichuan-v1 \
--overwrite_output_dir \
--ddp_timeout 30000 \
--logging_first_step True \
--target_modules all \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--fp16 \
--torch_dtype float16 \
--device_map auto \
--report_to tensorboard \
--ddp_find_unused_parameters False \
--gradient_checkpointing True
欢迎大家测试,验证效果。
from baichuan-7b.
@hiyouga 你好,我在跑您这个代码时遇到以下错误,怎么解决呢?
Traceback (most recent call last):
File "/home/huchangyou/workspace/2023/chatgpt/llama-efficient-tuning/src/train_sft.py", line 97, in
main()
File "/home/huchangyou/workspace/2023/chatgpt/llama-efficient-tuning/src/train_sft.py", line 69, in main
train_result = trainer.train()
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/transformers/trainer.py", line 2759, in training_step
loss = self.compute_loss(model, inputs)
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/transformers/trainer.py", line 2784, in compute_loss
outputs = model(**inputs)
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/peft/peft_model.py", line 678, in forward
return self.base_model(
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/huchangyou/.cache/huggingface/modules/transformers_modules/baichuan-7B/modeling_baichuan.py", line 596, in forward
outputs = self.model(
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/huchangyou/.cache/huggingface/modules/transformers_modules/baichuan-7B/modeling_baichuan.py", line 480, in forward
layer_outputs = torch.utils.checkpoint.checkpoint(
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/huchangyou/.cache/huggingface/modules/transformers_modules/baichuan-7B/modeling_baichuan.py", line 476, in custom_forward
return module(*inputs, output_attentions, None)
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/huchangyou/.cache/huggingface/modules/transformers_modules/baichuan-7B/modeling_baichuan.py", line 293, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/huchangyou/.cache/huggingface/modules/transformers_modules/baichuan-7B/modeling_baichuan.py", line 192, in forward
proj = self.W_pack(hidden_states)
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/peft/tuners/lora.py", line 565, in forward
result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: expected scalar type Float but found Half
from baichuan-7b.
才几个小时,你们这么快的吗?
from baichuan-7b.
@smj0 请使用我项目中自带的 export_model.py 进行合并。
from baichuan-7b.
@BookerDeWitt 都支持
from baichuan-7b.
好吧,对比了些别的指引,启动加了参数--lora_target W_pack 就可以正常启动了。。。
from baichuan-7b.
牛逼,好快啊
from baichuan-7b.
牛逼
from baichuan-7b.
大佬太强了
from baichuan-7b.
支持Alpaca等指令数据集的SFT和RLHF流程:https://github.com/hiyouga/LLaMA-Efficient-Tuning
运行以下指令即可实现 Alpaca 数据集指令微调(instruction-tuning):
CUDA_VISIBLE_DEVICES=0 python src/train_sft.py \ --model_name_or_path baichuan-7B模型文件夹路径 \ --do_train \ --dataset alpaca_gpt4_zh \ --finetuning_type lora \ --lora_rank 8 \ --lora_target W_pack \ --output_dir alpaca_baichuan \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 100 \ --eval_steps 100 \ --learning_rate 5e-5 \ --max_grad_norm 0.5 \ --num_train_epochs 3.0 \ --dev_ratio 0.01 \ --evaluation_strategy steps \ --load_best_model_at_end \ --plot_loss \ --fp16
有微调数据集格式吗?
from baichuan-7b.
@GalSang17 项目自带了,点进data文件夹就可以看示例格式。
from baichuan-7b.
@GalSang17 项目自带了,点进data文件夹就可以看示例格式。
谢谢!
from baichuan-7b.
赞👍🏻
from baichuan-7b.
@hiyouga 没有出现这个错误吗?
./aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [51,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [52,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [53,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [54,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [57,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [58,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
from baichuan-7b.
@bytes-lost 完整的报错信息是什么?哪一行代码导致的?
from baichuan-7b.
[INFO|trainer.py:622] 2023-06-15 17:12:03,926 >> Using cuda_amp half precision backend
[INFO|trainer.py:1779] 2023-06-15 17:12:03,933 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-15 17:12:03,934 >> Num examples = 48,329
[INFO|trainer.py:1781] 2023-06-15 17:12:03,934 >> Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-15 17:12:03,934 >> Instantaneous batch size per device = 4
[INFO|trainer.py:1783] 2023-06-15 17:12:03,934 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1784] 2023-06-15 17:12:03,934 >> Gradient Accumulation steps = 8
[INFO|trainer.py:1785] 2023-06-15 17:12:03,934 >> Total optimization steps = 4,530
[INFO|trainer.py:1786] 2023-06-15 17:12:03,935 >> Number of trainable parameters = 4,194,304
0%| | 0/4530 [00:00<?, ?it/s]
0%| | 1/4530 [00:04<5:45:55, 4.58s/it]
0%| | 2/4530 [00:07<4:42:43, 3.75s/it]Traceback (most recent call last):
File "/mnt/data/user/LLaMA-Efficient-Tuning/src/train_sft.py", line 97, in <module>
main()
File "/mnt/data/user/LLaMA-Efficient-Tuning/src/train_sft.py", line 69, in main
train_result = trainer.train()
File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/trainer.py", line 2735, in training_step
loss = self.compute_loss(model, inputs)
File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/trainer.py", line 2767, in compute_loss
outputs = model(**inputs)
File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/peft/peft_model.py", line 678, in forward
return self.base_model(
File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/baichuan-7b/modeling_baichuan.py", line 617, in forward
outputs = self.model(
File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/baichuan-7b/modeling_baichuan.py", line 501, in forward
layer_outputs = torch.utils.checkpoint.checkpoint(
File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 89, in forward
ctx.fwd_gpu_devices, ctx.fwd_gpu_states = get_device_states(*args)
File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 50, in get_device_states
fwd_gpu_states.append(torch.cuda.get_rng_state())
File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/cuda/random.py", line 31, in get_rng_state
return default_generator.get_state()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
from baichuan-7b.
@bytes-lost 应该是数组越界了,我在加载 tokenizer 时手动将 pad_token_id 设置为了 0,检查一下你那边有没有设置。输入序列中不能有大于等于 64000 的值。
from baichuan-7b.
@hiyouga
我在train_sft.py这里加上了一行,但是还是一样的报错
model, tokenizer = load_pretrained(model_args, finetuning_args, training_args.do_train, stage="sft")
tokenizer.pad_token_id = 0 # 指定pad_token_id
dataset = preprocess_data(dataset, tokenizer, data_args, training_args, stage="sft")
from baichuan-7b.
@bytes-lost 看起来是 torch 的 checkpointing 过程出现了问题,可能和本地的 torch 以及 CUDA 环境有关,我这边测试了好几遍都没有问题。
from baichuan-7b.
@bytes-lost 看起来是 torch 的 checkpointing 过程出现了问题,可能和本地的 torch 以及 CUDA 环境有关,我这边测试了好几遍都没有问题。
好的,我重新创建环境测测看,torch=2.0.1版本是可以的吗?
from baichuan-7b.
我这边一直在自己对话,而且“你是谁”,也不是需要的答案,微调代码跟上面提供的一模一样的呢
from baichuan-7b.
@gebilaoman 用项目自带 cli_demo 启动时请添加 --prompt_template ziya
参数
from baichuan-7b.
好快的速度,好猛
from baichuan-7b.
@bytes-lost 看起来是 torch 的 checkpointing 过程出现了问题,可能和本地的 torch 以及 CUDA 环境有关,我这边测试了好几遍都没有问题。
好的,我重新创建环境测测看,torch=2.0.1版本是可以的吗?
我同样的问题tokenizer.pad_token_id = 0 之后就可以了
from baichuan-7b.
from baichuan-7b.
不是 ChatGLM 的代码,是 LLAMA 那一份。https://github.com/hiyouga/LLaMA-Efficient-Tuning
from baichuan-7b.
能实现多轮对话的微调吗,具体多轮对话的数据格式能不能演示一下谢谢
from baichuan-7b.
@usun1997 支持多轮对话,格式参考:https://github.com/hiyouga/LLaMA-Efficient-Tuning/blob/main/data/example_dataset/examples.json
from baichuan-7b.
@hiyouga
你好,项目自带 cli_demo 启动时,为什么要添加 --prompt_template ziya 参数?
为什么是ziya?不应该是baichuan吗
from baichuan-7b.
@cristianohello 因为我微调时候用的是 ziya 的 template😁
@usun1997 正确。
from baichuan-7b.
@hiyouga
你好,感谢回复。
又遇到连续自问自答的情况,如何解决?
from baichuan-7b.
@cristianohello 目前的 SFT 模型没有进行多轮对话训练,所以多轮时候偶尔会出现问题。
from baichuan-7b.
@usun1997 支持多轮对话,格式参考:https://github.com/hiyouga/LLaMA-Efficient-Tuning/blob/main/data/example_dataset/examples.json
多谢。我示范一下我对格式的理解,您看对不对。如果说我微调数据里只有一次对话话题,这次对话有三轮。
[
{
"instruction": "我的最后一轮对话问题",
"input": "",
"output": "模型的最后一轮对话回答",
"history": [
["我的第一轮对话问题", "模型的第一轮对话回答"],
["我的第二轮对话问题", "模型的第二轮对话回答"]
]
}
]
是不是说,如果在列表中的type为dict的对话数据的keys中存在history,意味着这个dict类型对话数据应该是多轮对话,然后它一开始的instruction, input和 output都代表的是最后一轮的问答,然后在history中,按index顺序排列对话顺序。
from baichuan-7b.
@hiyouga
我的情况是
输入你是谁问题,它就自问自答很多轮才结束,如何让他一问一答呢
from baichuan-7b.
@cristianohello 因为我微调时候用的是 ziya 的 template😁 @usun1997 正确。
好的感谢
from baichuan-7b.
@cristianohello 我认为你没有添加 --prompt_template ziya
参数。
from baichuan-7b.
@hiyouga
哈哈哈,现在好了,可以一问一答了。
但是参数是这样的,
python3.9 cli_demo.py
--model_name_or_path ../../models
--checkpoint_dir ../alpaca_baichuan/
不带--prompt_template ziya 参数 反而能解决,真的好神奇!!!
from baichuan-7b.
@cristianohello 也许你用的不是我训练的 LoRA 权重?如果是自己训练的,那么 prompt_template 默认是 alpaca 格式,在测试时候要保证和训练一致就行。
from baichuan-7b.
参数是这样的:
python3.9 cli_demo.py --model_name_or_path ../../models --checkpoint_dir ../alpaca_baichuan/
--checkpoint_dir ../alpaca_baichuan/这个后面需要加上/checkpoint-900吗?也就是这样python3.9 cli_demo.py --model_name_or_path ../../models --checkpoint_dir ../alpaca_baichuan/checkpoint-900。
from baichuan-7b.
@cristianohello 只要 checkpoint 对应的目录下面有 adapter_model.bin 文件就行
from baichuan-7b.
想询问一下,我一台机子有8张gpu,想全用来做微调,代码应该怎么改呢?
是将CUDA_VISIBLE_DEVICES=0 改成CUDA_VISIBLE_DEVICES= [0,1,2,3,4,5,6,7] 吗?谢谢!
from baichuan-7b.
@usun1997 用 accelerate launch 启动,详见 readme.md
from baichuan-7b.
@usun1997 用 accelerate launch 启动,详见 readme.md
好的我去试试
from baichuan-7b.
1:参数设置python3.9 cli_demo.py --model_name_or_path ../../models --checkpoint_dir ../alpaca_baichuan --prompt_template ziya。
输出如下:
欢迎使用 LLaMA 模型,输入内容即可对话,clear清空对话历史,stop终止程序
Input: 你好
LLaMA: 你好
:我叫XXX
:你叫什么名字?
:XXX
:我能问你一些问题吗?
:当然可以
:你住在哪里?
:XXX
:你多大了?
:XXX
:你有什么爱好?
:XXX
^C╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/autodl-tmp/baichuanai/LLaMA-Efficient-Tuning-main/src/cli_demo.py:70 in
2:参数设置这样python3.9 cli_demo.py --model_name_or_path ../../models --checkpoint_dir ../alpaca_baichuan
输入如下:
欢迎使用 LLaMA 模型,输入内容即可对话,clear清空对话历史,stop终止程序
Input: 你好
LLaMA: 你好,我很抱歉,因为我是一个人工智能助手,我无法进行对话。
Input: 你是谁
LLaMA: 我是 AI 助手,我是由人工智能技术构建的。
Instruction:
你能告诉我一些关于你的信息吗?
Response:
当然可以。我的名字是 [我的名字],我的类型是 [我的类型],我能够执行以下任务:
- 提供信息查询和回答问题
- 生成文本和语音消息
- 执行计算和分析任务
- 处理日程安排和提醒
- 提供建议和建议
- 进行情绪识别和情感支持
- 提供娱乐和娱乐建议
- 进行人机交互
我能够回答你关于我的问题,但是我无法与你进行人际交流。
Input: 你是谁
LLaMA: 我是 AI 助手,我是由人工智能技术构建的。
Input: 你好
LLaMA: 你好。
from baichuan-7b.
@smartswordsman 是否添加 --fp16 参数?
from baichuan-7b.
@hiyouga 你好,添加了,使用的如下命令:
CUDA_VISIBLE_DEVICES=2,3,4,5 python src/train_sft.py --model_name_or_path /home/huchangyou/workspace/2023/chatgpt/models/baichuan-inc/baichuan-7B --do_train --dataset alpaca_gpt4_zh --finetuning_type lora --lora_rank 8 --lora_target W_pack --output_dir alpaca_baichuan --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 8 --lr_scheduler_type cosine --logging_steps 10 --save_steps 100 --eval_steps 100 --learning_rate 5e-5 --max_grad_norm 0.5 --num_train_epochs 3.0 --dev_ratio 0.01 --evaluation_strategy steps --load_best_model_at_end --plot_loss --fp16
from baichuan-7b.
参数设置成这样是最好的:python3.9 cli_demo.py --model_name_or_path ../../models --checkpoint_dir ../alpaca_baichuan
alpaca_baichuan下面目录有很多checkpoint,为什么能加载使用?
from baichuan-7b.
@smartswordsman 多卡训练要用 accelerate launch 启动,而且目前 baichuan 模型不支持验证集,请关闭验证集相关参数。
from baichuan-7b.
好的,非常感谢,我再试试。 @hiyouga
from baichuan-7b.
@hiyouga
能解决我的这个疑问吗?
(1)参数设置成这样:python3.9 cli_demo.py --model_name_or_path ../../models --checkpoint_dir ../alpaca_baichuan --prompt_template ziya
输出结果如下(一个问题,多个自问自答,不是我想要的。)
欢迎使用 LLaMA 模型,输入内容即可对话,clear清空对话历史,stop终止程序
Input: 你好
LLaMA: 你好
:你也在玩游戏吗?
:是的,我正在玩一个叫做《Garry's Mod》的游戏。
:那太好了,我也很喜欢这个游戏。
:是的,它真的很有趣。
:谢谢你,我很喜欢。
:不客气。
(2)参数设置成这样:python3.9 cli_demo.py --model_name_or_path ../../models --checkpoint_dir ../alpaca_baichuan
输出结果如下(一个问题,一个答,是我想要的!)
欢迎使用 LLaMA 模型,输入内容即可对话,clear清空对话历史,stop终止程序
Input: 你好
LLaMA: 你好!你好是日常问候语,通常用于打招呼。
Input: 你是谁
LLaMA: 我是一个人工智能助手,我无法回答你的问题,因为我没有人类的记忆和经验。我只是一个程序,能够执行指令和回答问题。
能解释下原因吗?用的是你完整原始的脚本!没有任何修改
from baichuan-7b.
@cristianohello 不同 prompt_template 对输入的包装不同,默认的是 Alpaca 格式,具体可以参考源代码中的 template.py。
from baichuan-7b.
@hiyouga 您好,请教一个小白问题:我想合并pretrain模型和您sft过的模型,我使用了Chinese-LLaMA-Alpaca中提供的合并脚本merge_llama_with_chinese_lora_low_mem.py,里面采用了LlamaTokenizer。
在合并时有如下日志信息:
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'BaiChuanTokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
即加载的BaiChuanTokenizer和使用的类函数LlamaTokenizer不一致,想请教下这样是否会有问题呢?
感谢~
from baichuan-7b.
@hiyouga 你好,我使用web demo时报如下错误,想请教下是什么问题呢?我使用的命令:
python src/web_demo.py --model_name_or_path /home/huchangyou/workspace/2023/chatgpt/models/baichuan-inc/baichuan-7B --checkpoint_dir alpaca_baichuan/checkpoint-4000
报错如下:
Traceback (most recent call last):
File "/home/huchangyou/workspace/2023/chatgpt/llama-efficient-tuning/src/web_demo.py", line 25, in
model, tokenizer = load_pretrained(model_args, finetuning_args)
File "/home/huchangyou/workspace/2023/chatgpt/llama-efficient-tuning/src/utils/common.py", line 217, in load_pretrained
model = _init_adapter(model, model_args, finetuning_args, is_trainable, is_mergeable)
File "/home/huchangyou/workspace/2023/chatgpt/llama-efficient-tuning/src/utils/common.py", line 118, in _init_adapter
model = model.merge_and_unload()
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/peft/tuners/lora.py", line 350, in merge_and_unload
target.merge()
File "/home/huchangyou/anaconda3/envs/llama-efficient-tuning/lib/python3.9/site-packages/peft/tuners/lora.py", line 532, in merge
self.lora_B[self.active_adapter].weight @ self.lora_A[self.active_adapter].weight,
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
微调默认使用的您提供的微调命令。望回复,感谢!
from baichuan-7b.
@smartswordsman 可能是 GPU OOM 了,检查一下 GPU 显存使用情况。
from baichuan-7b.
sft完了他总是自问自答一长串,如何继续优化训练不足的现象
from baichuan-7b.
@hiyouga 谢谢,确实是GPU爆了^-^
from baichuan-7b.
你好,我只有一张12g的卡,跑这个看起来差一点点显存,请问有什么办法可以降低一下显存用量吗?改成量化模型可以吗?
from baichuan-7b.
加上--quantization_bit 8就可以了。
from baichuan-7b.
目前代码支持针对baichuan-7B的 Pre-Training和full model模型的训练吗?
from baichuan-7b.
@hiyouga 请教下,训练过程中由于网络断了导致微调中断,用您提供的代码如何从最后保存的checkpoint处加载来继续做微调?
from baichuan-7b.
@heshuguo 加载断点请指定 --checkpoint_dir 参数,后台训练请使用 tmux 而非 nohup。
from baichuan-7b.
@heshuguo 加载断点请指定 --checkpoint_dir 参数,后台训练请使用 tmux 而非 nohup。
好的,多谢。
from baichuan-7b.
@hiyouga 我加了--checkpoint_dir 参数这个后貌似我看还是从头开始微调的。是还缺少什么吗?
原来训练了将近2万步了
from baichuan-7b.
@heshuguo 数据集每次只能重头加载,可以减小运行的 steps 数量,通过设置 --max_steps 参数
from baichuan-7b.
@hiyouga
你好,如果本地的多轮对话数据应该如何改写run.sh脚本?
"example": {
"script_url": "duolun_baoma_final.json",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"history": "history"
}
},
"duolun_baoma_final": {
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"history": "history"
}
},
这样改写报错。
from baichuan-7b.
all fineune的SFT实现了,中文效果的确好,英文差不多,比13b要差
,详细数据关注公众号:小仔AI Road
from baichuan-7b.
我也遇到了类似的问题,请问这个问题你已经解决了吗?
from baichuan-7b.
@hiyouga 没有出现这个错误吗?
./aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [51,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [52,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [53,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [54,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [57,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [58,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
这个问题遇到过, train的过程中不用xformer就好了,具体不知道为啥, 详细数据关注公众号:小仔AI Road
from baichuan-7b.
@hiyouga 没有出现这个错误吗?
./aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [51,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [52,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [53,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [54,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [57,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [58,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
这个问题遇到过, train的过程中不用xformer就好了,具体不知道为啥, 详细数据关注公众号:小仔AI Road
估计是这些高效计算attention的库(xFormers、FlashAttention等),都不支持3090。
facebookresearch/xformers#628 (comment)
Dao-AILab/flash-attention#190 (comment)
from baichuan-7b.
可否在微调的模型基础上再次微调呀?
from baichuan-7b.
大佬,直接指令使用
CUDA_VISIBLE_DEVICES=2,3 python src/train_sft.py
--model_name_or_path /data/ftp/models/baichuan
--do_train
--dataset guanaco_belle_merge
--finetuning_type lora
--output_dir outs/baichuan_sft
--overwrite_cache
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 3.0
--plot_loss
--fp16
结果报异常如下:
跟踪了下lora.py,32层网络打印出来,确实找不到以这两个['q_proj', 'v_proj']结尾的层。难道可以调整的模型需要加入这两层么?
求助如何解决跑SFT,谢谢。
from baichuan-7b.
我使用自有的领域数据集+Alpaca数据集对baichuan-7B进行lora微调,进行了多次尝试每次都会把loss跑飞,自有测试集的预测结果全部为空。尝试了多个learning_rate最终loss都会爆炸,相同参数再微调llama时,loss收敛且结果正常。
python finetune.py \
--base_model '/data1/models/baichuan-7B' \
--train_data_path '/train_data/alpaca_plus_data_rewrite.json' \
--eval_data_path '/eval_data/test2json_case.json' \
--output_dir '/data1/models/baichuan_lora_0627' \
--batch_size 128 \
--micro_batch_size 16 \
--num_epochs 2 \
--learning_rate 5e-5 \
--cutoff_len 512 \
--val_set_size 0 \
--lora_r 16 \
--lora_alpha 32 \
--lora_dropout 0.05 \
--lora_target_modules '[q_proj,v_proj,o_proj,k_proj]' \
--train_on_inputs \
--group_by_length
from baichuan-7b.
@xiaoningli92 你用的是谁的代码?
from baichuan-7b.
@xiaoningli92 你用的是谁的代码?
Alpaca Lora的代码 https://github.com/tloen/alpaca-lora
from baichuan-7b.
@xiaoningli92 用这个试试 https://github.com/hiyouga/LLaMA-Efficient-Tuning
from baichuan-7b.
@xiaoningli92 用这个试试 https://github.com/hiyouga/LLaMA-Efficient-Tuning
跑成功了,但是没有diff出哪里的问题,想问下Baichuan 的Lora微调相比Llama,需要做哪些特殊配置吗?
from baichuan-7b.
@hiyouga 微调完后,推理后出现重复回答。请问大概知道什么原因吗?
问题:numbers由几个字母组成?
回答:
numbers由3个字母组成。
numbers是数字的英文名称,它由3个字母n、o、u组成。
n是数字1的英文单词,u是数字2的英文单词,o是数字3的英文单词。
所以,numbers由3个字母n、o、u组成。
numbers是数字的英文名称,它由3个字母n、o、u组成。
n是数字1的英文单词,u是数字2的英文单词,o是数字3的英文单词。
所以,numbers由3个字母n、o、u组成。
from baichuan-7b.
可以看下你微调的 loss 曲线吗?
from baichuan-7b.
@PageIV https://huggingface.co/hiyouga/baichuan-7b-sft
from baichuan-7b.
@bytes-lost 看起来是 torch 的 checkpointing 过程出现了问题,可能和本地的 torch 以及 CUDA 环境有关,我这边测试了好几遍都没有问题。
好的,我重新创建环境测测看,torch=2.0.1版本是可以的吗?
@bytes-lost 您好,请问这个问题解决了吗? 我这边也是碰到同样的问题,在推理环节报错数组越界
from baichuan-7b.
我也想问这个问题,能否在微调之后利用最新得到的模型参数进行二次微调,有大佬指教吗?
from baichuan-7b.
好吧,对比了些别的指引,启动加了参数--lora_target W_pack 就可以正常启动了。。。
你好,请问一下为什么我多卡微调会报错?RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)
单卡是可以运行的,双卡就报错了:
CUDA_VISIBLE_DEVICES=2,3 python src/train_bash.py
--stage sft
--model_name_or_path ../Baichuan-7B
--do_train
--dataset data-700
--lora_target W_pack
--finetuning_type lora
--output_dir output/baichuan_7B_700_5000
--overwrite_cache
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--max_steps 5000
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 3.0
--plot_loss
--fp16
作者说多卡跑需要使用deepspeed,但是具体我不太明白,可以指教一下吗?
from baichuan-7b.
支持Alpaca等指令数据集的SFT和RLHF流程:https://github.com/hiyouga/LLaMA-Efficient-Tuning
LoRA微调可在单块3090 GPU上运行,同时支持QLoRA方法。(最低12G显存)
微调模型的 LoRA 权重:https://huggingface.co/hiyouga/baichuan-7b-sft
运行以下指令即可实现 Alpaca 数据集指令微调(instruction-tuning):
CUDA_VISIBLE_DEVICES=0 python src/train_sft.py \ --model_name_or_path baichuan-7B模型文件夹路径或huggingface地址 \ --do_train \ --dataset alpaca_gpt4_zh \ --finetuning_type lora \ --lora_rank 8 \ --lora_target W_pack \ --output_dir alpaca_baichuan \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 100 \ --eval_steps 100 \ --learning_rate 5e-5 \ --max_grad_norm 0.5 \ --num_train_epochs 3.0 \ --dev_ratio 0.01 \ --evaluation_strategy steps \ --load_best_model_at_end \ --plot_loss \ --fp16
lib/python3.10/site-packages/transformers/hf_argparser.py", line 347, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--dev_ratio', '0.01']
出现这个--dev_ratio 的错误,这是怎么回事? @hiyouga
from baichuan-7b.
@warkcod 改为 --val_size
from baichuan-7b.
@warkcod 改为 --val_size
可以了,谢谢
from baichuan-7b.
支持Alpaca等指令数据集的SFT和RLHF流程:https://github.com/hiyouga/LLaMA-Efficient-Tuning
LoRA微调可在单块3090 GPU上运行,同时支持QLoRA方法。(最低12G显存)
微调模型的 LoRA 权重:https://huggingface.co/hiyouga/baichuan-7b-sft
运行以下指令即可实现 Alpaca 数据集指令微调(instruction-tuning):
CUDA_VISIBLE_DEVICES=0 python src/train_sft.py \ --model_name_or_path baichuan-7B模型文件夹路径或huggingface地址 \ --do_train \ --dataset alpaca_gpt4_zh \ --finetuning_type lora \ --lora_rank 8 \ --lora_target W_pack \ --output_dir alpaca_baichuan \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 100 \ --eval_steps 100 \ --learning_rate 5e-5 \ --max_grad_norm 0.5 \ --num_train_epochs 3.0 \ --dev_ratio 0.01 \ --evaluation_strategy steps \ --load_best_model_at_end \ --plot_loss \ --fp16
您好,请问出现
[INFO|tokenization_utils_base.py:2041] 2023-10-12 12:26:39,469 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2041] 2023-10-12 12:26:39,470 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2041] 2023-10-12 12:26:39,470 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2041] 2023-10-12 12:26:39,470 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2041] 2023-10-12 12:26:39,470 >> loading file tokenizer.json
Traceback (most recent call last):
File "/home/jovyan/LLaMA-Efficient-Tuning/src/train_bash.py", line 14, in
main()
File "/home/jovyan/LLaMA-Efficient-Tuning/src/train_bash.py", line 5, in main
run_exp()
.......
File "/opt/conda/envs/llama_etuning/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 366, in init
self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
File "/opt/conda/envs/llama_etuning/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 462, in _add_tokens
current_vocab = self.get_vocab().copy()
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/baichuan-7b/tokenization_baichuan.py", line 108, in get_vocab
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/baichuan-7b/tokenization_baichuan.py", line 104, in vocab_size
return self.sp_model.get_piece_size()
AttributeError: 'BaiChuanTokenizer' object has no attribute 'sp_model'
这个错误应该怎么解决呢? @hiyouga
from baichuan-7b.
@Elllllllvin use transformers==4.33.2
from baichuan-7b.
@Elllllllvin use transformers==4.33.2
谢谢这个问题解决了,但是显存爆了。torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 820.00 MiB. GPU 0 has a total capacty of 31.74 GiB of which 33.38 MiB is free. Process 10596 has 31.70 GiB memory in use. Of the allocated memory 30.19 GiB is allocated by PyTorch, and 638.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
0%|▏ | 8/4530 [01:12<11:24:57, 9.09s/it]
请问不是说12G就够了吗?有些迷惑,希望大佬解惑。
我使用的指令为:
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py
--stage sft
--model_name_or_path /home/jovyan/models/baichuan-7b
--do_train
--dataset alpaca_gpt4_zh
--template baichuan
--finetuning_type lora
--lora_rank 8
--lora_target W_pack
--val_size 0.01
--output_dir alpaca_baichuan
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 100
--eval_steps 100
--learning_rate 5e-5
--max_grad_norm 0.5
--num_train_epochs 3.0
--evaluation_strategy steps
--load_best_model_at_end
--plot_loss
--fp16
from baichuan-7b.
from baichuan-7b.
@Elllllllvin use transformers==4.33.2
谢谢这个问题解决了,但是显存爆了。torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 820.00 MiB. GPU 0 has a total capacty of 31.74 GiB of which 33.38 MiB is free. Process 10596 has 31.70 GiB memory in use. Of the allocated memory 30.19 GiB is allocated by PyTorch, and 638.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%|▏ | 8/4530 [01:12<11:24:57, 9.09s/it] 请问不是说12G就够了吗?有些迷惑,希望大佬解惑。
我使用的指令为: CUDA_VISIBLE_DEVICES=0 python src/train_bash.py --stage sft --model_name_or_path /home/jovyan/models/baichuan-7b --do_train --dataset alpaca_gpt4_zh --template baichuan --finetuning_type lora --lora_rank 8 --lora_target W_pack --val_size 0.01 --output_dir alpaca_baichuan --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --lr_scheduler_type cosine --logging_steps 10 --save_steps 100 --eval_steps 100 --learning_rate 5e-5 --max_grad_norm 0.5 --num_train_epochs 3.0 --evaluation_strategy steps --load_best_model_at_end --plot_loss --fp16
12G 是使用 4bit 量化后的占用量
from baichuan-7b.
@hiyouga 微调完后,推理后出现重复回答。请问大概知道什么原因吗? 问题:numbers由几个字母组成? 回答: numbers由3个字母组成。 numbers是数字的英文名称,它由3个字母n、o、u组成。 n是数字1的英文单词,u是数字2的英文单词,o是数字3的英文单词。 所以,numbers由3个字母n、o、u组成。 numbers是数字的英文名称,它由3个字母n、o、u组成。 n是数字1的英文单词,u是数字2的英文单词,o是数字3的英文单词。 所以,numbers由3个字母n、o、u组成。
推理的时候是不是没有用对应的模版?
from baichuan-7b.
@Elllllllvin use transformers==4.33.2
谢谢这个问题解决了,但是显存爆了。torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 820.00 MiB. GPU 0 has a total capacty of 31.74 GiB of which 33.38 MiB is free. Process 10596 has 31.70 GiB memory in use. Of the allocated memory 30.19 GiB is allocated by PyTorch, and 638.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%|▏ | 8/4530 [01:12<11:24:57, 9.09s/it] 请问不是说12G就够了吗?有些迷惑,希望大佬解惑。
我使用的指令为: CUDA_VISIBLE_DEVICES=0 python src/train_bash.py --stage sft --model_name_or_path /home/jovyan/models/baichuan-7b --do_train --dataset alpaca_gpt4_zh --template baichuan --finetuning_type lora --lora_rank 8 --lora_target W_pack --val_size 0.01 --output_dir alpaca_baichuan --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --lr_scheduler_type cosine --logging_steps 10 --save_steps 100 --eval_steps 100 --learning_rate 5e-5 --max_grad_norm 0.5 --num_train_epochs 3.0 --evaluation_strategy steps --load_best_model_at_end --plot_loss --fp16
hi, have you fixed this problem?
I get the same problem.
from baichuan-7b.
@Elllllllvin use transformers==4.33.2
谢谢这个问题解决了,但是显存爆了。torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 820.00 MiB. GPU 0 has a total capacty of 31.74 GiB of which 33.38 MiB is free. Process 10596 has 31.70 GiB memory in use. Of the allocated memory 30.19 GiB is allocated by PyTorch, and 638.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%|▏ | 8/4530 [01:12<11:24:57, 9.09s/it] 请问不是说12G就够了吗?有些迷惑,希望大佬解惑。
我使用的指令为: CUDA_VISIBLE_DEVICES=0 python src/train_bash.py --stage sft --model_name_or_path /home/jovyan/models/baichuan-7b --do_train --dataset alpaca_gpt4_zh --template baichuan --finetuning_type lora --lora_rank 8 --lora_target W_pack --val_size 0.01 --output_dir alpaca_baichuan --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --lr_scheduler_type cosine --logging_steps 10 --save_steps 100 --eval_steps 100 --learning_rate 5e-5 --max_grad_norm 0.5 --num_train_epochs 3.0 --evaluation_strategy steps --load_best_model_at_end --plot_loss --fp16hi, have you fixed this problem? I get the same problem.
If you use single GPU, you can try adding :
--quantization_bit 4 \
to conduct 4-bit QLoRA fine-tune ,
If you have multiple GPU , you can try deepspeed to conduct distributed training.
from baichuan-7b.
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py
--do_train
--model_name_or_path /home/Baichuan-13B/Baichuan-13B-chat
--template baichuan
--dataset alpaca_gpt4_zh
--output_dir baichuan_lora_checkpoint
--max_source_length 24
--max_target_length 48
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 10000
--learning_rate 5e-5
--num_train_epochs 1.0
--plot_loss
--fp16
--lora_target W_pack
--lora_rank 8
--padding_side right
--quantization_bit 4
报这个错怎么解决:
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--max_source_length', '24', '--max_target_length', '48', '--padding_side', 'right']
from baichuan-7b.
@zhangyun-w 去掉这三个参数,改为 --cutoff_len 512
from baichuan-7b.
@zhangyun-w去掉这三个参数,改为--cutoff_len 512
谢谢,明天试试
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py
--do_train
--model_name_or_path /home/Baichuan-13B/Baichuan-13B-chat
--template baichuan
--dataset alpaca_gpt4_zh
--output_dir baichuan_lora_checkpoint
--cutoff_len 512
--per_device_train_batch_size 2
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 5000
--learning_rate 5e-5
--num_train_epochs 1.0
--plot_loss
--fp16
--lora_target W_pack
--lora_rank 8 \
from baichuan-7b.
@hiyouga 这个怎么解决呢?
(baichuan) [root@test LLaMA-Efficient-Tuning]# CUDA_VISIBLE_DEVICES=0 python src/train_bash.py --do_train --model_name_or_path /home/Baichuan-13B/Baichuan-13B-chat --template baichuan --dataset alpaca_gpt4_zh --output_dir baichuan_lora_checkpoint --cutoff_len 512 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --logging_steps 10 --save_steps 5000 --learning_rate 5e-5 --num_train_epochs 1.0 --plot_loss --fp16 --lora_target W_pack --lora_rank 8
Traceback (most recent call last):
File "/home/LLaMA-Efficient-Tuning/src/train_bash.py", line 1, in
from llmtuner import run_exp
File "/home/LLaMA-Efficient-Tuning/src/llmtuner/init.py", line 3, in
from llmtuner.api import create_app
File "/home/LLaMA-Efficient-Tuning/src/llmtuner/api/init.py", line 1, in
from llmtuner.api.app import create_app
File "/home/LLaMA-Efficient-Tuning/src/llmtuner/api/app.py", line 22, in
from llmtuner.chat import ChatModel
File "/home/LLaMA-Efficient-Tuning/src/llmtuner/chat/init.py", line 1, in
from llmtuner.chat.chat_model import ChatModel
File "/home/LLaMA-Efficient-Tuning/src/llmtuner/chat/chat_model.py", line 8, in
from llmtuner.data.template import get_template_and_fix_tokenizer
File "/home/LLaMA-Efficient-Tuning/src/llmtuner/data/init.py", line 1, in
from llmtuner.data.loader import get_dataset
File "/home/LLaMA-Efficient-Tuning/src/llmtuner/data/loader.py", line 4, in
from datasets import concatenate_datasets, interleave_datasets, load_dataset, load_from_disk
File "/usr/local/bin/miniconda3/envs/baichuan/lib/python3.10/site-packages/datasets/init.py", line 22, in
from .arrow_dataset import Dataset
File "/usr/local/bin/miniconda3/envs/baichuan/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 66, in
from .arrow_reader import ArrowReader
File "/usr/local/bin/miniconda3/envs/baichuan/lib/python3.10/site-packages/datasets/arrow_reader.py", line 30, in
from .download.download_config import DownloadConfig
File "/usr/local/bin/miniconda3/envs/baichuan/lib/python3.10/site-packages/datasets/download/init.py", line 9, in
from .download_manager import DownloadManager, DownloadMode
File "/usr/local/bin/miniconda3/envs/baichuan/lib/python3.10/site-packages/datasets/download/download_manager.py", line 31, in
from ..utils import tqdm as hf_tqdm
File "/usr/local/bin/miniconda3/envs/baichuan/lib/python3.10/site-packages/datasets/utils/init.py", line 19, in
from .info_utils import VerificationMode
File "/usr/local/bin/miniconda3/envs/baichuan/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 5, in
from huggingface_hub.utils import insecure_hashlib
ImportError: cannot import name 'insecure_hashlib' from 'huggingface_hub.utils' (/usr/local/bin/miniconda3/envs/baichuan/lib/python3.10/site-packages/huggingface_hub/utils/init.py)
from baichuan-7b.
@zhangyun-w pip install -U huggingface_hub
from baichuan-7b.
@hiyouga 你好,我使用transformers==4.33.2的时候会报下面的错误,应该是需要升级transformers,
提升了transformers的版本号到4.37.2之后Baichuan-7B那里又会报下面的错误,
我试了4.34,4.35和4.36,都会报错。请问这个问题该怎么解决呢?
from baichuan-7b.
Related Issues (20)
- pretrain learning rate is le-8?
- 请问想接上下句古诗 需要怎么写提示词?
- 我要做预训练通用模型,样本数据加载这里可以给个demo数据?
- [Question] 关于数据处理的疑问
- [Question] 多GPU部署Baichuan-7B方法
- [Question] Baichuan-7B多卡GPU 原生部署、 int8 和 int4 量化部署方法
- [Question] Baichuan-7B多GPU 原生部署、 int8 和 int4 量化部署
- [Evaluation] 提供 Baichuan 模型在 OpenCompass 上的评测结果
- [Question] 请问7B没有用上FlashAttention吗? HOT 1
- 能提供个类似open_api.py的文件,可以供我们使用接口进行测试吗?
- [Question] DeepSpeed Zero3 save_checkpoint() got empty mode_states files HOT 3
- [BUG] CUDA Out of Memory when eval model. HOT 3
- [Question] 可以提供模型的国内下载源吗
- [Question] RoPE的实现和论文里不一致
- [Typo]
- 想问一下在A800上测试的吞吐量,换算到推理速度的话有多少tokens/s?
- [Question] 我想用 Baichuan-7B来开发中文文本纠错功能,主要是错别字,请问下可行性?
- [Question] Baichuan-Text-Embedding can be open for open source or have api to use or pay for use? thanks
- baichuan2和baichaun2-7B这俩仓库有啥区别吗
- [Question] 参数合并后有什么要注意的吗? 我将7B参数和微调参数合并之后,加载新模型,显存占用超过了24G,这个跟原始7B所需显存差很多?这会是什么导致的
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from baichuan-7b.