Giter VIP home page Giter VIP logo

baby-llama2-chinese's Introduction

Hi there 👋

About Me.

  • 🌴 I'm now a AI algorithm engineer at Alibaba Group.
  • 🌱 I graduated from Xi'an Jiaotong University with a master's degree.
  • ⚡ I am a data science competition enthusiast.
  • 🐝 Now I'm very interested in large language models.I have experience in pre-training LLM at a scale of tens of billions.
  • 📫 Wechat:qq2257164884 QQ🐧:2257164884.
  • 🍀 6 times top10 in Ali-Tianchi 7 silver medals in kaggle

Anurag's GitHub stats

baby-llama2-chinese's People

Contributors

billvsme avatar dllxw avatar jh01231230 avatar jianhu-chen avatar zglxjtu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

baby-llama2-chinese's Issues

多个节点多卡的pretrain

请问作者这个能用多个节点多卡进行分布式训练吗,我用4个节点,每个节点两张gpu,但只有一个节点正常工作,另外几个节点的GPU并没有工作。

谢谢!

没有找到medical_qa_144w

其他的预训练和用于微调的数据都找到了,但是没有“medical_qa_144w.csv”,请问这部分数据在哪里呢?

sft使用的checkpoint问题

你好,sft阶段我看代码中加载的是“model.load_state_dict(torch.load('./out/baike_pretrain/epoch_0.pth'))”,epoch是0,但是pretrain的时候epoch是2,为什么接着第1个epoch 做sft而不是第2个呢?

/track1/train_valid.json

这个代码很不错 多谢

我想请问能否分享一下 /track1/train_valid.json 或是它的具体格式 我想跑一下整个流程。

模型参数量计算

max_seq_len = 512
dim = 512
n_layers = 8
n_heads = 8

改变值后,如何计算最终的参数量,有公式吗?

没有找到此文件

Traceback (most recent call last):
File "/share/home/xiongx/px/baby-llama2-chinese-main/data_process.py", line 238, in
with open(data_path,'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/baidubaike_563w_1.bin'

总结下几个问题

很有意思的工作,作为LLM的小白了解到了很多,整个流程很明了。但跑了两天,没跑出比较好的结果,在这总结下遇到的几个问题,后来大佬有能解决的可以相互讨论下 (环境: A100-40G, CUDA11.6,PyTorch2.0,实际使用代码稍微放大了下模型)。

  1. Flash attention问题。这个只有Pytorch2.0以上版本才能用,速度看了下比一般attention能快小一倍,占用显存也稍微少了一些,能用则用吧。Pytorch2.0的话亲测CUDA11.6也是可以安装官网CUDA11.7版本的,可以正常使用。
  2. Loss NAN问题。预训练过程,使用float16,大概几千step就会出现,尝试降低学习率等方法无法解决。回答说使用老版本PyTorch可以避免。
  3. Loss 增加。后来看了其他论坛,说改用bfloat16更加稳定,然而实际使用却出现loss短暂下降后持续上涨的现象,改变学习率,仍然无效。进而继续尝试使用float32,loss也是先下降后上涨,改变学习率亦无效,有点奇怪,遂暂时放弃。

前期数据处理差异

其他预训练llama2数据处理, encode的数据直接是labels:
data["labels"] = data["input_ids"].clone()
我们这里数据处理是PretokDataset里面:
x = chunk[:-1]
y = chunk[1:]
yield x, y
这两者有什么区别吗?
PretokDataset 改成以下有什么影响:
yield chunk,chunk

要训练几个epoch,会有比较好的效果?

非常感谢大佬的分享!
我试了下wiki 数据集,训练3个epoch ,用训练数据中的例子用eval程序测试了下,结果不是很理想,大概要训练多少epoch会有效果?
另外我临时用的是一个低端显卡,batch_size=4,是否太低了,导致效果会很差?

prompt: 介绍一下昭通机场
answer: 布勒代斯 (德塞夫勒省)
布勒代斯(Bourth,bʁe)是法国德塞夫勒省的一个市镇,属于帕.圣洛朗区。
地理
(P625)面积P2046,位于法国新阿基坦大区德塞夫勒省,该省份为法国西部沿海省份,北起曼恩-卢瓦尔省,西接旺代省,南至

Attention!! 推理代码里面的致命笔误是导致大家看到效果不好的原因。望周知!

eval.py代码里面,本人当时提交的版本中,手误把写成了,导致大家看到的推理效果很差。这是一个致命错误,几个月了,今天才发现,望看issue区很多同学说效果不好,我一直以为是大家训练没到位的原因,但是其实是推理代码有个笔误bug。大家更正下就能看到惊艳的结果了。

 x=tokenizer.encode(prompt,add_special_tokens=False)+[tokenizer.special_tokens['<bos>']] #更正后的代码。

当然,我相信稍微会思考的同学应该早就发现了这个问题。但是一些小白可能是很难发现呢。

数据集问题

你好,我首次尝试训练一个模型,请问数据集要下载,放在哪里

eos token是空字符串

查看数据处理脚本时发现文本中没有eos token。试验发现,eos的token id是2,但是tokenizer.decode([2])是空字符串,不清楚是什么问题。

想问下训练的数据和epoch数

你好,我pretrain和sft各训练了8轮,输出还是胡言乱语,我用的baidu+wiki+医疗train作为pretrain数据集没有改数据,sft也没有改数据集。其他参数都没改,用的初始参数,没有用sft_to_pretrain,想问下您那边是怎么训的?

sft dataset

请问sft dataset是用的哪个呢?有无地址

运行预训练报错

运行 pretrain.py 的时候报错

怀疑是机器没有环境变量

在pretrain.py 的212-217行

作者可以补充一下环境变量嘛

多谢

没有SFT的话 推理会抱错,麻烦看看

Traceback (most recent call last):
File "/home/hope/work/baby-llama2-chinese/eval_hope.py", line 67, in
model.load_state_dict(state_dict, strict=False)
File "/home/hope/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Transformer:
size mismatch for tok_embeddings.weight: copying a param with shape torch.Size([64793, 1024]) from checkpoint, the shape in current model is torch.Size([64793, 512]).

Question about tokenizer

请问一下,为什么可以直接用ChatGLM的tokenizer来训练Llama?我的理解是,Llama的embedding 可能和ChatGLM不一样,有可能根本没有见过类似的embedding 或者token。这里借用ChatGLM tokenizer的原理是什么呢?
谢谢

大家在预训练的时候有遇到过loss为nan吗

[2023-08-30 16:04:47,404][pretrain.py][INFO] Epoch:0/2 loss:11.271 lr:0.0000000 epoch_Time:137483.0min:
[2023-08-30 16:08:27,427][pretrain.py][INFO] Epoch:0/2 loss:6.268 lr:0.0001000 epoch_Time:1208.0min:
[2023-08-30 16:12:01,041][pretrain.py][INFO] Epoch:0/2 loss:5.627 lr:0.0001000 epoch_Time:1121.0min:
[2023-08-30 16:15:35,618][pretrain.py][INFO] Epoch:0/2 loss:4.548 lr:0.0000999 epoch_Time:1091.0min:
[2023-08-30 16:19:08,321][pretrain.py][INFO] Epoch:0/2 loss:4.591 lr:0.0000997 epoch_Time:1072.0min:
[2023-08-30 16:22:43,731][pretrain.py][INFO] Epoch:0/2 loss:4.309 lr:0.0000994 epoch_Time:1062.0min:
[2023-08-30 16:26:16,924][pretrain.py][INFO] Epoch:0/2 loss:4.294 lr:0.0000991 epoch_Time:1053.0min:
[2023-08-30 16:29:49,699][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000987 epoch_Time:1044.0min:
[2023-08-30 16:33:33,730][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000983 epoch_Time:1043.0min:
[2023-08-30 16:37:10,391][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000977 epoch_Time:1039.0min:
[2023-08-30 16:40:49,196][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000971 epoch_Time:1035.0min:
[2023-08-30 16:44:29,060][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000965 epoch_Time:1031.0min:
[2023-08-30 16:48:10,314][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000958 epoch_Time:1029.0min:
[2023-08-30 16:51:50,553][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000950 epoch_Time:1025.0min:
[2023-08-30 16:55:41,688][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000941 epoch_Time:1025.0min:
[2023-08-30 16:59:56,754][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000932 epoch_Time:1033.0min:
[2023-08-30 17:04:02,156][pretrain.py][INFO] Epoch:0/2 loss:nan lr:0.0000922 epoch_Time:1036.0min:

在处理百度563baike时Memory error

Traceback (most recent call last):
File "C:\Users\zhou\Desktop\baby_llm\data_process.py", line 145, in
process_baidu()
File "C:\Users\zhou\Desktop\baby_llm\data_process.py", line 126, in process_baidu
doc_ids+=text_id
MemoryError
请问这种情况如何解决呢,有遇到过吗
打扰啦,多谢

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Traceback (most recent call last):
File "/home/afan/worksapce/train/baby-llama2-chinese/infer.py", line 91, in
generated_tokens = model.generate(input_tokens, num_samples, max_new_tokens, temperature=temperature, top_k=top_k)
File "/home/afan/anaconda3/envs/baby/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/afan/worksapce/train/baby-llama2-chinese/model.py", line 341, in generate
idx_next = torch.multinomial(probs, num_samples=1)
RuntimeError: probability tensor contains either inf, nan or element < 0

model.py代码中这个位置
logits = self(idx_cond)
print(logits) 结果是这样
tensor([[[nan, nan, nan, ..., nan, nan, nan]]], device='cuda:1',dtype=torch.float16)

可以帮忙看下是什么原因导致的么?

百度云垃圾

墙裂建议把百度云换成阿里云的分享,百度云限速太严重了。恶心。。。

可以给个测评结果吗?

很多评测样例其实出现在了SFT数据中,所以让我误以为模型具备很流畅的问答能力

这个对于生产其实问题不大,这说明对于生产所需的问答对,也能流畅问答了。我是没想到 50M 就能用了,而平时用 7B 的都笨得要死。

sft.py运行报错 CUDA out of memory,请问咋解决?

运行日志:

(llama2) [llama2-chinese]$ python sft.py 
tokens per iteration will be: 16,384
breaks down as: 1 grad accum steps * 1 processes * 32 batch size * 512 max seq len
                                                   prompt                                             answer
757309  选择以下列表中的一个数学公式并解释它,“a² + b² = c²”、“y = mx + b”...  \n“a² + b² = c²” 表示勾股定理,用于计算直角三角形的斜边长度。\n“y = ...
31228            给出一句话,用另一种语言(如法语、德语等)重新表达。\n生命中最重要的事情是什么  Quelle est la chose la plus importante dans la...
227106  描述如何制作一杯拿铁咖啡,包括所需材料和步骤。 \n所需材料: \n- 2盎司浓缩咖啡 \n...  步骤:\n1. 准备好所需材料。\n2. 在咖啡杯中倒入2盎司的浓缩咖啡。\n3. 在另一个...
53255   提供两个类别,例如“A”和“B”,该为一组数据点分配这两个类别之一,并给出理由。\n类别1:...  数据点1属于产品设计类别,因为它涉及产品的安全和设计方面,需要重新设计产品形状以减少意外伤害...
752602                               提供一份食谱\n煎虾饼需要哪些材料?\n              煎虾饼的材料通常包括虾仁、豆腐、鸡蛋、淀粉、调味品(盐、胡椒粉、姜末等)。
...                                                   ...                                                ...
303642  给定一段文本,编写一个python函数,计算其中单词的数量。\n“编程是一项非常有趣的技能,...  以下是一个计算文本中单词数量的Python函数:\n```\ndef count_words...
560061  给定一段格式混乱的文本,请将其按照规定的格式进行排版,并输出排版后的结果。\n标题: 世界闻...  标题:世界闻名的科学家\n文本:爱因斯坦、牛顿和霍金都是伟大的科学家,他们所做出的贡献推动了...
642915  给定一段文本,请问其中出现最多的单词是什么?\n文本: 散步是我最喜欢的活动之一。我发现它可...                                       出现最多的单词是“我”。
227969  根据给定的文本情感,提供情感分析结果和可信度得分。\n文本:"我喜欢这个电影,演员表现得非常...                                 情感分析结果:积极\n可信度得分:高
45020   为下列一段文本生成一个简洁的标题。\n文本: 这个夏天,因为天气炎热和各种植物的成长,在我们...                                         夏日花园里的多彩花朵

[802899 rows x 2 columns]
Initializing a new model from scratch
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
num decayed parameter tensors: 85, with 218,129,408 parameters
num non-decayed parameter tensors: 25, with 25,600 parameters
using fused AdamW: False
/home/qxj/conda/envs/llama2/lib/python3.8/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
  warnings.warn(warning.format(ret))
[2023-09-04 10:49:52,275][sft.py][INFO] Epoch:[0/2](0/25091) loss:2.822 lr:0.0000000 epoch_Time:759.0min:
Traceback (most recent call last):
  File "sft.py", line 323, in <module>
    train_epoch(epoch)
  File "sft.py", line 75, in train_epoch
    scaler.scale(loss).backward()
  File "/home/qxj/conda/envs/llama2/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/qxj/conda/envs/llama2/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 3.95 GiB (GPU 0; 39.59 GiB total capacity; 33.25 GiB already allocated; 2.56 GiB free; 35.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

处理百度数据集的时间报错

处理到473000 左右的时候,系统就kill掉了,不知道是什么原因,内存128,GPU 显存48G 不过这个时候还用不到显存吧?

transformers最新版本会报错

使用transformers==4.35.2时会报错:AttributeError: 'ChatGLMTokenizer' object has no attribute 'tokenizer'、
需要修改requirement.txt 的tranformers==4.33.2

交个作业

我渴望,
像李白一样,
让我感到喜悦、欢快和宁静。
我想回家,
因为我的家在风景如画的地方。
我的心情,
像李白一样,
让我感到愉悦,
也让我感到喜悦。
我期待着,
我的未来

上下文长度32K

上下文长度扩大到32K,是否直接修改参数就可以呢?是否会引发问题?

交个作业

  • 交个作业
    训出来的模型效果,比较感人,下面是栗子. 不知道作者的模型是否也是这样的
---------------
[prompt]: 自发性幕上脑内出血的手术治疗有些什么?
[answer]:  急性脑内血肿的影像学检查有些什么? 增强检查;头颅CT
---------------

---------------
[prompt]: 请描述口腔黏膜吸收的历史
[answer]:  肺动脉高压会转化成什么? 冠心病
---------------

配置优化器的部分为什么,大于或等于2D的参数会被衰减,小于2D不会衰减?

配置优化器的部分为什么,大于或等于2D的参数会被衰减,小于2D不会衰减?

create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.

i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.

decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
optim_groups = [
{'params': decay_params, 'weight_decay': weight_decay},
{'params': nodecay_params, 'weight_decay': 0.0}
]

为什么在pretrain 309行model.complie要加prefix '_orig_mod'?

运行sft.py时会报错,state_dict无法对齐:

Initializing a new model from scratch
Traceback (most recent call last):
  File "sft.py", line 295, in <module>
    model.load_state_dict(torch.load('./out/20230915_baike_pretrain/epoch_0.pth'))
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Transformer:
        Missing key(s) in state_dict: "tok_embeddings.weight", "layers.0.attention.wq.weight", "layers.0.attention.wk.weight", "layers.0.attention.wv.weight"...

在sft.py 292行初始化模型之后,用resume里面的代码去掉prefix,报错解除

model = init_model()
pretrain_state_dict = torch.load('./out/baike_pretrain/epoch_0.pth')
unwanted_prefix = "_orig_mod."
for k, v in list(pretrain_state_dict.items()):
      if k.startswith(unwanted_prefix):
         pretrain_state_dict[k[len(unwanted_prefix):]] = pretrain_state_dict.pop(k)
model.load_state_dict(pretrain_state_dict)

一个很诡异的错误 IndexError: index 35930 is out of bounds for axis 1 with size 2048

(_) user@calculator:~/Player/baby-llama2-chinese$ python3 __pretrain.py 
tokens per iteration will be: 2,048
breaks down as: 1 grad accum steps * 1 processes * 1 batch size * 2048 max seq len
memmap:True train data.shape:(702015, 2048)
downloading finished.....
Initializing a new model from scratch
num decayed parameter tensors: 85, with 2,746,744,832 parameters
num non-decayed parameter tensors: 25, with 102,400 parameters
using fused AdamW: True
Traceback (most recent call last):
  File "/home/user/Player/baby-llama2-chinese/__pretrain.py", line 317, in <module>
    train_epoch(epoch)
  File "/home/user/Player/baby-llama2-chinese/__pretrain.py", line 51, in train_epoch
    for step, (X, Y) in enumerate(train_loader):
  File "/home/user/anaconda3/envs/_/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/home/user/anaconda3/envs/_/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/home/user/anaconda3/envs/_/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/home/user/anaconda3/envs/_/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/user/anaconda3/envs/_/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/user/anaconda3/envs/_/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/user/anaconda3/envs/_/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/user/Player/baby-llama2-chinese/dataset.py", line 36, in __getitem__
    sample = self.data[index]
  File "/home/user/anaconda3/envs/_/lib/python3.10/site-packages/numpy/core/memmap.py", line 334, in __getitem__
    res = super().__getitem__(index)
IndexError: index 35930 is out of bounds for axis 1 with size 2048

不知道这35930是怎么来的,2048是max_seq_len

关于 GBK 编码的问题

当我运行数据预处理程序时,出现了 UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 27: illegal multibyte sequence 报错。我尝试将 gbk 换成 gb18030,进而使用 ignore 属性忽略非法字符,都出现了错误。此外,我还尝试了其他的数据,如百度和medical,都遇到了此问题。

请问这是编辑器环境的问题还是数据的问题,以及该如何解决?

请教下参数大小如何计算

README.md中提到50M参数的配置为:
max_seq_len = 512
dim = 512
n_layers = 8
n_heads = 8

请问如何计算参数大小?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.