Comments (12)
解决了,把PyTorchLighting的Trainer换成了Hug个ingFace的Trainer,就可以多机多卡训练了,
猜测可能是因为PyTorchLighting中DataLoader的啥的同步问题,卡了大半个月,终于解决了。
from fengshenbang-lm.
@Zyriix 你解决了吗 一样的pytorchlighting多机多卡 我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题,能问下你那边是怎么解决的吗
这个问题一般是因为不同的node之间gather的时候数据不一致导致的,具体来说有这几种可能:
- loss_dict不一致,比如你定义了多个loss,node1跑的是loss1,node2跑的是loss2,如果你的loss_dict没有为这些loss值设置默认值,就会一直waiting
- metric_dict不一致,比如node1你记录了l2loss,node2你记录了IOU,也会有这种情况
一般检查这两个就可以解决,通常直接把log_dict里的sync关掉,即不使用跨节点的metrics计算会减少这种问题出现的概率
你好,我最近训练代码也遇到了相同的问题,感觉应该和您描述很相符,因为在我的程序中,计算图和数据是高度相关的,某些特定数据不会经过部分网络导致各个GPU上的loss不一致,可以加个联系方式进一步聊一下相关问题嘛?
我也是类似的 特定数据过特定分支,在多机开启syncbn后就会卡住,跟你的情况很像,请问有解决吗
from fengshenbang-lm.
的 特定数据过特定分支,在多机开启syncbn后就会卡住,跟你的情况很像,请问有解决吗
我也是这样的,我跑的moe,不同token走得分支不一样,然后backward的时候就卡住,老哥有解决吗?
from fengshenbang-lm.
添加了各种配置之后,现在是卡在训练开始的阶段,Epoch为0的训练一直没有开始,但是两台机器的显存占用了,GPU利用率也是100%,但是就是没有开始训练。
查阅了一些资料:
pytorch 多机多卡卡住问题汇总
Script freezes with no output when using DistributedDataParallel
PyTorch 训练时中遇到的卡住停住等问题
PyTorch训练时,Dataloader卡死、挂起,跑一个epoch停了,问题解决方案
运行开始训练,卡住半小时,一直不动
关于炼丹,你是否知道这些细节?
主要做了以下修改:
- 启动命令前增加了
OMP_NUM_THREADS=1 MKL_NUM_THREADS=1
,避免多线程导致死锁; - 去掉了加载数据时的tqdm;
- 记在数据的DataLoader的drop_last设置为True,pin_memory设置为True,num_workers设置为0;
更新后的启动脚本:
#!/bin/bash
set -x -e
echo "START TIME: $(date)"
MICRO_BATCH_SIZE=1
ROOT_DIR=$(pwd)
ZERO_STAGE=3
config_json="$ROOT_DIR/training_config.json"
cat <<EOT >$config_json
{
"train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
"steps_per_print": 1000,
"gradient_clipping": 1,
"zero_optimization": {
"stage": ${ZERO_STAGE},
"allgather_partitions": false,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"stage3_max_live_parameters" : 2e8,
"stage3_max_reuse_distance" : 2e8,
"stage3_prefetch_bucket_size": 2e8,
"stage3_param_persistence_threshold": 2e8,
"sub_group_size" : 2e8,
"round_robin_gradients": true
},
"bf16": {
"enabled": true
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-5,
"betas": [0.9,0.95],
"eps": 1e-8,
"weight_decay": 1e-2
}
},
"scheduler": {
"type": "WarmupLR",
"params":{
"warmup_min_lr": 5e-6,
"warmup_max_lr": 1e-5
}
}
}
EOT
export PL_DEEPSPEED_CONFIG_PATH=$config_json
TRAINER_ARGS="
--max_epochs 1 \
--num_nodes 2 \
--gpus 8 \
--strategy deepspeed_stage_${ZERO_STAGE}_offload \
--default_root_dir $ROOT_DIR \
--dirpath $ROOT_DIR/ckpt \
--save_top_k 3 \
--monitor train_loss \
--mode min \
--save_last \
"
DATA_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets
DATA_ARGS="
--data_dir $DATA_DIR \
--max_seq_length 64 \
--train_batchsize $MICRO_BATCH_SIZE \
--valid_batchsize $MICRO_BATCH_SIZE \
--train_data test_train.txt \
--valid_data test.txt \
--test_data test.txt
"
PRETRAINED_MODEL_PATH="IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese"
MODEL_ARGS="
--pretrained_model_path ${PRETRAINED_MODEL_PATH} \
--output_save_path $ROOT_DIR/predict.json \
--learning_rate 1e-4 \
--weight_decay 0.1 \
--warmup 0.01 \
"
MASTER_ADDR="IP"
MASTER_PORT="9010"
DISTRIBUTED_ARGS="
--nnodes 2 \
--nproc_per_node=8 \
--master_addr ${MASTER_ADDR} \
--master_port ${MASTER_PORT} \
--node_rank 1 \
--max_restarts=0
"
SCRIPTS_PATH=${ROOT_DIR}/finetune_gpt2.py
export CMD=" \
$DISTRIBUTED_ARGS \
$SCRIPTS_PATH \
$TRAINER_ARGS \
$MODEL_ARGS \
$DATA_ARGS \
"
export NCCL_SOCKET_IFNAME=enp129s0f0
export NCCL_IB_DISABLE=1
#python ${CMD}
OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 torchrun ${CMD}
#python -m torch.distributed.launch ${CMD}
训练脚本脚本:
# -*- coding: utf-8 -*-
# @Time : 2022/8/9 11:46
# @File : finetune_gpt2.py
# @Description : None
# ----------------------------------------------
# ☆ ☆ ☆ ☆ ☆ ☆ ☆
# >>> Author : Alex
# >>> Mail : [email protected]
# >>> Github : https://github.com/koking0
# >>> Blog : https://alex007.blog.csdn.net/
# ☆ ☆ ☆ ☆ ☆ ☆ ☆
import argparse
import os
import pytorch_lightning as pl
import torch as th
from pytorch_lightning import Trainer, loggers
from pytorch_lightning.callbacks import ModelCheckpoint
from torch.distributed.elastic.multiprocessing.errors import record
from torch.utils.data import DataLoader, Dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers.optimization import get_linear_schedule_with_warmup
class GPT2Dataset(Dataset):
"""
Dataset Used for yuyuan medical qa task.
Just surpport small datasets, when deal with large datasets it may be slowly.
for large datasets please use mmapdatasets(doing)
"""
def __init__(self, data_path, name, args):
super().__init__()
self.tokenizer = GPT2Tokenizer.from_pretrained(args.pretrained_model_path)
self.tokenizer.add_special_tokens({'pad_token': '<|endoftext|>'})
self.data_size = os.path.getsize(data_path) / 1024 / 1024 / 1024
self.data_type_name = name
self.data = self.load_data(data_path)
self.max_seq_length = args.max_seq_length
def __len__(self):
return len(self.data)
def __getitem__(self, index):
return self.encode(self.data[index])
def load_data(self, data_path):
# 有进度条展示
if self.data_size <= 5:
with open(data_path, "rt", encoding='utf8') as f:
lines = f.readlines()
data_gen = lines
else:
data_gen = open(data_path, "rt", encoding='utf8')
data = []
for idx, line in enumerate(data_gen):
data.append(line)
if self.data_size > 5:
data_gen.close()
return data
def encode(self, item):
"""
将数据转换成模型训练的输入
"""
inputs_dict = self.tokenizer.encode_plus(item, max_length=self.max_seq_length, padding='max_length',
truncation=True, return_tensors='pt')
target = inputs_dict["input_ids"]
labels = target.clone().detach()
labels[target == self.tokenizer.pad_token_id] = -100
labels = labels.squeeze().numpy().tolist()
if -100 in labels:
labels[labels.index(-100)] = 50256
return {
"input_ids": inputs_dict["input_ids"].squeeze(),
"attention_mask": inputs_dict["attention_mask"].squeeze(),
"labels": th.tensor(labels)
}
class GPT2DataModel(pl.LightningDataModule):
@staticmethod
def add_data_specific_args(parent_args):
parser = parent_args.add_argument_group('GPT2DataModel')
parser.add_argument('--data_dir', type=str, required=True)
parser.add_argument('--num_workers', default=0, type=int)
parser.add_argument('--train_data', default='train.txt', type=str)
parser.add_argument('--valid_data', default='valid.txt', type=str)
parser.add_argument('--test_data', default='test.txt', type=str)
parser.add_argument('--train_batchsize', type=int, required=True)
parser.add_argument('--valid_batchsize', type=int, required=True)
parser.add_argument('--max_seq_length', default=512, type=int)
return parent_args
def __init__(self, args):
super().__init__()
self.args = args
self.train_batchsize = args.train_batchsize
self.valid_batchsize = args.valid_batchsize
if not args.do_eval_only:
self.train_data = GPT2Dataset(os.path.join(args.data_dir, args.train_data), '训练集', args)
self.valid_data = GPT2Dataset(os.path.join(args.data_dir, args.valid_data), '验证集', args)
self.test_data = GPT2Dataset(os.path.join(args.data_dir, args.test_data), '测试集', args)
def train_dataloader(self):
return DataLoader(self.train_data, shuffle=True, batch_size=self.train_batchsize, drop_last=True,
pin_memory=True, num_workers=self.args.num_workers)
def val_dataloader(self):
return DataLoader(self.valid_data, shuffle=False, batch_size=self.valid_batchsize, drop_last=True,
pin_memory=True, num_workers=self.args.num_workers)
def predict_dataloader(self):
return DataLoader(self.test_data, shuffle=False, batch_size=self.valid_batchsize, drop_last=True,
pin_memory=True, num_workers=self.args.num_workers)
class GPT2FinetuneMedicalQAModelCheckpoint:
@staticmethod
def add_argparse_args(parent_args):
parser = parent_args.add_argument_group('BaseModel')
parser.add_argument('--monitor', default='train_loss', type=str)
parser.add_argument('--mode', default='min', type=str)
parser.add_argument('--dirpath', default='./ckpt/', type=str)
parser.add_argument('--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
parser.add_argument('--save_last', action='store_true', default=True)
parser.add_argument('--save_top_k', default=3, type=float)
parser.add_argument('--every_n_train_steps', default=1000, type=float)
parser.add_argument('--save_weights_only', default=True, type=bool)
return parent_args
def __init__(self, args):
self.callbacks = ModelCheckpoint(monitor=args.monitor, save_top_k=args.save_top_k, mode=args.mode,
save_weights_only=args.save_weights_only, dirpath=args.dirpath,
filename=args.filename, save_last=args.save_last)
class GPT2Finetune(pl.LightningModule):
@staticmethod
def add_model_specific_args(parent_args):
parser = parent_args.add_argument_group("BaseModel")
parser.add_argument("--learning_rate", default=1e-4, type=float)
parser.add_argument("--weight_decay", default=0.1, type=float)
parser.add_argument("--warmup", default=0.01, type=float)
return parent_args
def __init__(self, args, num_data):
super().__init__()
self.args = args
self.num_data = num_data
self.model = GPT2LMHeadModel.from_pretrained(args.pretrained_model_path)
def setup(self, stage) -> None:
if stage == 'fit':
num_gpus = self.trainer.gpus if self.trainer.gpus is not None else 0
self.total_step = int(self.trainer.max_epochs * self.num_data /
(max(1, num_gpus) * self.trainer.accumulate_grad_batches))
print('Total training step:', self.total_step)
def training_step(self, batch, batch_idx):
output = self.model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'],
labels=batch['labels'])
# output = self.model(input_ids=batch['input_ids'], labels=batch['labels'])
# acc = self.comput_metrix(output.logits, batch['labels'])
self.log('train_loss', output.loss)
return output.loss
def comput_metrix(self, logits, labels):
y_pred = th.argmax(logits, dim=-1)
y_pred = y_pred.view(size=(-1,))
y_true = labels.view(size=(-1,)).float()
corr = th.eq(y_pred, y_true)
acc = th.sum(corr.float()) / labels.size()[0]
return acc
def validation_step(self, batch, batch_idx):
output = self.model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'],
labels=batch['labels'])
self.log('val_loss', output.loss)
def configure_optimizers(self):
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
paras = list(filter(lambda p: p[1].requires_grad, self.named_parameters()))
paras = [{
'params':
[p for n, p in paras if not any(nd in n for nd in no_decay)],
'weight_decay': self.args.weight_decay
}, {
'params': [p for n, p in paras if any(nd in n for nd in no_decay)],
'weight_decay': 0.0
}]
optimizer = th.optim.AdamW(paras, lr=self.args.learning_rate)
scheduler = get_linear_schedule_with_warmup(
optimizer, int(self.total_step * self.args.warmup),
self.total_step)
return [{
'optimizer': optimizer,
'lr_scheduler': {
'scheduler': scheduler,
'interval': 'step',
'frequency': 1
}
}]
@record
def train():
total_parser = argparse.ArgumentParser("Summary Task")
total_parser.add_argument('--local_rank', type=int)
total_parser.add_argument('--do_eval_only', action='store_true', default=False)
total_parser.add_argument('--pretrained_model_path', default=None, type=str)
total_parser.add_argument('--output_save_path', default='./predict.json', type=str)
# * Args for data preprocessing
total_parser = GPT2DataModel.add_data_specific_args(total_parser)
# * Args for training
total_parser = Trainer.add_argparse_args(total_parser)
total_parser = GPT2FinetuneMedicalQAModelCheckpoint.add_argparse_args(total_parser)
total_parser = GPT2Finetune.add_model_specific_args(total_parser)
# * Args for base model
args = total_parser.parse_args()
data_model = GPT2DataModel(args)
model = GPT2Finetune(args, len(data_model.train_dataloader()))
checkpoint_callback = GPT2FinetuneMedicalQAModelCheckpoint(args).callbacks
logger = loggers.TensorBoardLogger(save_dir=os.path.join(args.default_root_dir, 'log/'), name='MedicalQA-GPT2')
trainer = Trainer.from_argparse_args(args, logger=logger, callbacks=[checkpoint_callback])
trainer.tune(model)
trainer.fit(model, data_model)
model.model.save_pretrained("./models/finetune/gpt2")
if __name__ == '__main__':
train()
Node0机器日志:
$ bash finetune_gpt2.sh
++ date
+ echo 'START TIME: 2022年 09月 01日 星期四 16:34:02 CST'
START TIME: 2022年 09月 01日 星期四 16:34:02 CST
+ MICRO_BATCH_SIZE=1
++ pwd
+ ROOT_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog
+ ZERO_STAGE=3
+ config_json=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ cat
+ export PL_DEEPSPEED_CONFIG_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ PL_DEEPSPEED_CONFIG_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ TRAINER_ARGS='
--max_epochs 1 --num_nodes 2 --gpus 8 --strategy deepspeed_stage_3_offload --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt --save_top_k 3 --monitor train_loss --mode min --save_last '
+ DATA_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets
+ DATA_ARGS='
--data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets --max_seq_length 64 --train_batchsize 1 --valid_batchsize 1 --train_data test_train.txt --valid_data test.txt --test_data test.txt
'
+ PRETRAINED_MODEL_PATH=IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese
+ MODEL_ARGS='
--pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json --learning_rate 1e-4 --weight_decay 0.1 --warmup 0.01 '
+ MASTER_ADDR=IP
+ MASTER_PORT=9010
+ DISTRIBUTED_ARGS='
--nnodes 2 --nproc_per_node=8 --master_addr IP --master_port 9010 --node_rank 0 --max_restarts=0
'
+ SCRIPTS_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py
+ export 'CMD=
--nnodes 2 --nproc_per_node=8 --master_addr IP --master_port 9010 --node_rank 0 --max_restarts=0
/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py
--max_epochs 1 --num_nodes 2 --gpus 8 --strategy deepspeed_stage_3_offload --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt --save_top_k 3 --monitor train_loss --mode min --save_last
--pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json --learning_rate 1e-4 --weight_decay 0.1 --warmup 0.01
--data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets --max_seq_length 64 --train_batchsize 1 --valid_batchsize 1 --train_data test_train.txt --valid_data test.txt --test_data test.txt
'
+ CMD='
--nnodes 2 --nproc_per_node=8 --master_addr IP --master_port 9010 --node_rank 0 --max_restarts=0
/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py
--max_epochs 1 --num_nodes 2 --gpus 8 --strategy deepspeed_stage_3_offload --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt --save_top_k 3 --monitor train_loss --mode min --save_last
--pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json --learning_rate 1e-4 --weight_decay 0.1 --warmup 0.01
--data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets --max_seq_length 64 --train_batchsize 1 --valid_batchsize 1 --train_data test_train.txt --valid_data test.txt --test_data test.txt
'
+ export NCCL_SOCKET_IFNAME=enp129s0f0
+ NCCL_SOCKET_IFNAME=enp129s0f0
+ export NCCL_IB_DISABLE=1
+ NCCL_IB_DISABLE=1
+ OMP_NUM_THREADS=1
+ MKL_NUM_THREADS=1
+ torchrun --nnodes 2 --nproc_per_node=8 --master_addr IP --master_port 9010 --node_rank 0 --max_restarts=0 /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py --max_epochs 1 --num_nodes 2 --gpus 8 --strategy deepspeed_stage_3_offload --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt --save_top_k 3 --monitor train_loss --mode min --save_last --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json --learning_rate 1e-4 --weight_decay 0.1 --warmup 0.01 --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets --max_seq_length 64 --train_batchsize 1 --valid_batchsize 1 --train_data test_train.txt --valid_data test.txt --test_data test.txt
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
initializing deepspeed distributed: GLOBAL_RANK: 5, MEMBER: 6/16
initializing deepspeed distributed: GLOBAL_RANK: 7, MEMBER: 8/16
/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:445: LightningDeprecationWarning: Setting `Trainer(gpus=8)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='gpu', devices=8)` instead.
rank_zero_deprecation(
Loading DeepSpeed config from set PL_DEEPSPEED_CONFIG_PATH environment variable
initializing deepspeed distributed: GLOBAL_RANK: 3, MEMBER: 4/16
initializing deepspeed distributed: GLOBAL_RANK: 4, MEMBER: 5/16
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/16
initializing deepspeed distributed: GLOBAL_RANK: 6, MEMBER: 7/16
initializing deepspeed distributed: GLOBAL_RANK: 1, MEMBER: 2/16
initializing deepspeed distributed: GLOBAL_RANK: 2, MEMBER: 3/16
Total training step: 5776
Total training step: 5776
Total training step: 5776
Total training step: 5776
Total training step:Total training step: 5776 5776
Total training step: 5776
/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:2179: LightningDeprecationWarning: `Trainer.gpus` was deprecated in v1.6 and will be removed in v1.8. Please use `Trainer.num_devices` or `Trainer.device_ids` to get device information instead.
rank_zero_deprecation(
Total training step: 5776
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
You have specified an optimizer and/or scheduler within the DeepSpeed config. It is recommended to define it in `LightningModule.configure_optimizers`.
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/liuzhaofeng/.cache/torch_extensions/py39_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.11814284324646 seconds
Time to load cpu_adam op: 2.935582160949707 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.088909149169922 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.8845674991607666 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.0597143173217773 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.990757942199707 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.0667808055877686 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.0561718940734863 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Emitting ninja build file /home/liuzhaofeng/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.41974496841430664 seconds
Loading extension module utils...
Time to load utils op: 0.20378637313842773 seconds
Loading extension module utils...
Time to load utils op: 0.4045596122741699 seconds
Loading extension module utils...
Time to load utils op: 0.40416693687438965 seconds
Loading extension module utils...
Time to load utils op: 0.30461955070495605 seconds
Loading extension module utils...
Time to load utils op: 0.40419793128967285 seconds
Loading extension module utils...
Time to load utils op: 0.5049059391021729 seconds
Loading extension module utils...
Time to load utils op: 0.4040186405181885 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Time to load utils op: 0.00046753883361816406 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005242824554443359 secondsNo modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005707740783691406 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005829334259033203 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0007224082946777344 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0009551048278808594 seconds
Time to load utils op: 0.0009477138519287109 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004756450653076172 seconds
| Name | Type | Params | Params per Device
--------------------------------------------------------------
0 | model | GPT2LMHeadModel | 3.6 B | 222 M
--------------------------------------------------------------
3.6 B Trainable params
0 Non-trainable params
3.6 B Total params
14,225.080Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:219: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
Sanity Checking DataLoader 0: 100%|█| 2/2 [01:57<00:00, 58.51s/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:219: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
Epoch 0: 0%| | 0/2896 [00:00<?, ?it/s]
Node1机器日志:
$ bash finetune_gpt2.sh
++ date
+ echo 'START TIME: 2022年 09月 01日 星期四 16:34:09 CST'
START TIME: 2022年 09月 01日 星期四 16:34:09 CST
+ MICRO_BATCH_SIZE=1
++ pwd
+ ROOT_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog
+ ZERO_STAGE=3
+ config_json=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ cat
+ export PL_DEEPSPEED_CONFIG_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ PL_DEEPSPEED_CONFIG_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ TRAINER_ARGS='
--max_epochs 1 --num_nodes 2 --gpus 8 --strategy deepspeed_stage_3_offload --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt --save_top_k 3 --monitor train_loss --mode min --save_last '
+ DATA_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets
+ DATA_ARGS='
--data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets --max_seq_length 64 --train_batchsize 1 --valid_batchsize 1 --train_data test_train.txt --valid_data test.txt --test_data test.txt
'
+ PRETRAINED_MODEL_PATH=IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese
+ MODEL_ARGS='
--pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json --learning_rate 1e-4 --weight_decay 0.1 --warmup 0.01 '
+ MASTER_ADDR=IP
+ MASTER_PORT=9010
+ DISTRIBUTED_ARGS='
--nnodes 2 --nproc_per_node=8 --master_addr IP --master_port 9010 --node_rank 1 --max_restarts=0
'
+ SCRIPTS_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py
+ export 'CMD=
--nnodes 2 --nproc_per_node=8 --master_addr IP --master_port 9010 --node_rank 1 --max_restarts=0
/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py
--max_epochs 1 --num_nodes 2 --gpus 8 --strategy deepspeed_stage_3_offload --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt --save_top_k 3 --monitor train_loss --mode min --save_last
--pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json --learning_rate 1e-4 --weight_decay 0.1 --warmup 0.01
--data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets --max_seq_length 64 --train_batchsize 1 --valid_batchsize 1 --train_data test_train.txt --valid_data test.txt --test_data test.txt
'
+ CMD='
--nnodes 2 --nproc_per_node=8 --master_addr IP --master_port 9010 --node_rank 1 --max_restarts=0
/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py
--max_epochs 1 --num_nodes 2 --gpus 8 --strategy deepspeed_stage_3_offload --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt --save_top_k 3 --monitor train_loss --mode min --save_last
--pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json --learning_rate 1e-4 --weight_decay 0.1 --warmup 0.01
--data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets --max_seq_length 64 --train_batchsize 1 --valid_batchsize 1 --train_data test_train.txt --valid_data test.txt --test_data test.txt
'
+ export NCCL_SOCKET_IFNAME=enp129s0f0
+ NCCL_SOCKET_IFNAME=enp129s0f0
+ export NCCL_IB_DISABLE=1
+ NCCL_IB_DISABLE=1
+ OMP_NUM_THREADS=1
+ MKL_NUM_THREADS=1
+ torchrun --nnodes 2 --nproc_per_node=8 --master_addr IP --master_port 9010 --node_rank 1 --max_restarts=0 /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py --max_epochs 1 --num_nodes 2 --gpus 8 --strategy deepspeed_stage_3_offload --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt --save_top_k 3 --monitor train_loss --mode min --save_last --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json --learning_rate 1e-4 --weight_decay 0.1 --warmup 0.01 --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets --max_seq_length 64 --train_batchsize 1 --valid_batchsize 1 --train_data test_train.txt --valid_data test.txt --test_data test.txt
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
initializing deepspeed distributed: GLOBAL_RANK: 9, MEMBER: 10/16
initializing deepspeed distributed: GLOBAL_RANK: 14, MEMBER: 15/16
initializing deepspeed distributed: GLOBAL_RANK: 10, MEMBER: 11/16
initializing deepspeed distributed: GLOBAL_RANK: 13, MEMBER: 14/16
initializing deepspeed distributed: GLOBAL_RANK: 11, MEMBER: 12/16
initializing deepspeed distributed: GLOBAL_RANK: 15, MEMBER: 16/16
initializing deepspeed distributed: GLOBAL_RANK: 12, MEMBER: 13/16
initializing deepspeed distributed: GLOBAL_RANK: 8, MEMBER: 9/16
Total training step:Total training step: 57765776
Total training step: 5776
Total training step: Total training step:5776
Total training step:5776
5776
Total training step: 5776
Total training step: 5776
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/liuzhaofeng/.cache/torch_extensions/py39_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.1943840980529785 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.1576390266418457 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.1438350677490234 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.1594393253326416 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.2506330013275146 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.2735142707824707 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.295503854751587 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.099400281906128 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Emitting ninja build file /home/liuzhaofeng/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.4705345630645752 seconds
Loading extension module utils...
Time to load utils op: 0.4043726921081543 seconds
Loading extension module utils...
Time to load utils op: 0.4040102958679199 seconds
Loading extension module utils...
Time to load utils op: 0.5046920776367188 seconds
Loading extension module utils...
Time to load utils op: 0.40497612953186035 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.5046734809875488 seconds
Loading extension module utils...
Time to load utils op: 0.504218339920044 seconds
Time to load utils op: 0.504563570022583 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004253387451171875 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004715919494628906 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Time to load utils op: 0.0009021759033203125 seconds
Time to load utils op: 0.0009865760803222656 seconds
Time to load utils op: 0.0009179115295410156 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
No modifications detected for re-loaded extension module utils, skipping build step...Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.0006961822509765625 seconds
Time to load utils op: 0.0010385513305664062 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00115966796875 seconds
from fengshenbang-lm.
您好,可以请问一下你是用的torch lightning model吗?我也遇到了这个问题,可以看看你的huggingface trainer怎么写的吗?
from fengshenbang-lm.
from fengshenbang-lm.
您好,可以请问一下你是用的torch lightning model吗?我也遇到了这个问题,可以看看你的huggingface trainer怎么写的吗?
记不太清了,当时的代码没有留存。[手动捂脸哭]
from fengshenbang-lm.
@Zyriix 你解决了吗 一样的pytorchlighting多机多卡 我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题,能问下你那边是怎么解决的吗
from fengshenbang-lm.
@Zyriix 你解决了吗 一样的pytorchlighting多机多卡 我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题,能问下你那边是怎么解决的吗
这个问题一般是因为不同的node之间gather的时候数据不一致导致的,具体来说有这几种可能:
- loss_dict不一致,比如你定义了多个loss,node1跑的是loss1,node2跑的是loss2,如果你的loss_dict没有为这些loss值设置默认值,就会一直waiting
- metric_dict不一致,比如node1你记录了l2loss,node2你记录了IOU,也会有这种情况
一般检查这两个就可以解决,通常直接把log_dict里的sync关掉,即不使用跨节点的metrics计算会减少这种问题出现的概率
from fengshenbang-lm.
@Zyriix 你解决了吗 一样的pytorchlighting多机多卡 我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题,能问下你那边是怎么解决的吗
这个问题一般是因为不同的node之间gather的时候数据不一致导致的,具体来说有这几种可能:
- loss_dict不一致,比如你定义了多个loss,node1跑的是loss1,node2跑的是loss2,如果你的loss_dict没有为这些loss值设置默认值,就会一直waiting
- metric_dict不一致,比如node1你记录了l2loss,node2你记录了IOU,也会有这种情况
一般检查这两个就可以解决,通常直接把log_dict里的sync关掉,即不使用跨节点的metrics计算会减少这种问题出现的概率
你好,我最近训练代码也遇到了相同的问题,感觉应该和您描述很相符,因为在我的程序中,计算图和数据是高度相关的,某些特定数据不会经过部分网络导致各个GPU上的loss不一致,可以加个联系方式进一步聊一下相关问题嘛?
from fengshenbang-lm.
@Zyriix 你解决了吗 一样的pytorchlighting多机多卡 我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题,能问下你那边是怎么解决的吗
这个问题一般是因为不同的node之间gather的时候数据不一致导致的,具体来说有这几种可能:
- loss_dict不一致,比如你定义了多个loss,node1跑的是loss1,node2跑的是loss2,如果你的loss_dict没有为这些loss值设置默认值,就会一直waiting
- metric_dict不一致,比如node1你记录了l2loss,node2你记录了IOU,也会有这种情况
一般检查这两个就可以解决,通常直接把log_dict里的sync关掉,即不使用跨节点的metrics计算会减少这种问题出现的概率
你好,我最近训练代码也遇到了相同的问题,感觉应该和您描述很相符,因为在我的程序中,计算图和数据是高度相关的,某些特定数据不会经过部分网络导致各个GPU上的loss不一致,可以加个联系方式进一步聊一下相关问题嘛?
这个解决方式可以尝试上面提到的,给每个loss设一个默认值来解决。
比如:
loss1 = 0
loss2 = 0
if cond1:
loss1 = mse(net1(input), target)
else:
loss2 = mse(net2(input), target)
loss = loss1 + loss2
loss_dict = {'loss':loss, 'loss1':loss1, 'loss2':loss2}
return loss, loss_dict
除了loss之外,分布式训练还会gather梯度,你们提到的计算图不同可能会导致不同node间的梯度不同,从而使程序一直等待梯度。使用deepspeed stage2好像可以解决这类问题,我不确定,可以尝试。
from fengshenbang-lm.
除了loss之外,分布式训练还会gather梯度,你们提到的计算图不同可能会导致不同node间的梯度不同,从而使程序一直等待梯度。使用deepspeed stage2好像可以解决这类问题,我不确定,可以尝试。
是的,我也是在训练MoE的时候会卡住。我试了一下如果只有1个expert,程序多卡运行没问题。我怀疑是不同卡上选择不同的expert的梯度不同导致的,无法进行反向更新参数?我使用的是deepspeed的stage2也依然没有解决这个问题。
from fengshenbang-lm.
Related Issues (20)
- 关于Uniex 的span HOT 1
- TypeError: not a string
- Erlangshen-Roberta-110M-Similarity训练用数据集 HOT 2
- Ziya-Reader-13B-v1.0长度限制4-8K,请问LongBench中文多文档QA问题都是8K-27K,是如何测评的,采用首尾截断,还是重新构造了数据集? HOT 1
- from modeling_deltalm import DeltalmForConditionalGeneration报错 no known parent package HOT 2
- ziya2预训练的语料拼接是如何通过attention mask规避的 HOT 5
- 配合ziya-reader使用的ziya-embedding和ziya-searching-agent在哪 HOT 1
- Anyone can help me with the "Randeng-BART-139M-SUMMARY" fine-tuning? HOT 1
- 燃灯-denoising系列没有开源出来预训练代码吗? HOT 1
- OSError: Can't load tokenizer for 'IDEA-CCNL/Randeng-Transformer-1.1B-Denoise'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'IDEA-CCNL/Randeng-Transformer-1.1B-Denoise' is the correct path to a directory containing all relevant files for a TransfoXLDenoiseTokenizer tokenizer.
- 合并ziya模型后生成safetensors格式文件,应该如何转化为.bin的格式呢
- Lyrics 啥时候公开 HOT 1
- Taiyi-CLIP-Roberta-102M-Chinese Finetuning报错
- Deberta 预训练的输出如何使用
- 按照 README 微调太乙的 stable-diffusion 模型后,利用from_pretrained() 方法加载模型出错.
- 太乙IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese在训练的过程如何做到中英文兼顾的?
- ziya_llama 切分tp后模型卡死,各位大佬有解决方法吗
- 当我使用Ubert进行实体抽取的微调后,测试集指标问题。 HOT 1
- ziya2-13b-base协议问题
- 用dreambooth训练模型后,加载stable-diffusion-webui日志出现'BertEmbeddings' object has no attribute 'token_embedding'错误 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fengshenbang-lm.