Giter VIP home page Giter VIP logo

Comments (12)

koking0 avatar koking0 commented on June 16, 2024 2

解决了,把PyTorchLighting的Trainer换成了Hug个ingFace的Trainer,就可以多机多卡训练了,
猜测可能是因为PyTorchLighting中DataLoader的啥的同步问题,卡了大半个月,终于解决了。

from fengshenbang-lm.

jiemosang avatar jiemosang commented on June 16, 2024 2

@Zyriix 你解决了吗 一样的pytorchlighting多机多卡 我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题,能问下你那边是怎么解决的吗

这个问题一般是因为不同的node之间gather的时候数据不一致导致的,具体来说有这几种可能:

  1. loss_dict不一致,比如你定义了多个loss,node1跑的是loss1,node2跑的是loss2,如果你的loss_dict没有为这些loss值设置默认值,就会一直waiting
  2. metric_dict不一致,比如node1你记录了l2loss,node2你记录了IOU,也会有这种情况

一般检查这两个就可以解决,通常直接把log_dict里的sync关掉,即不使用跨节点的metrics计算会减少这种问题出现的概率

你好,我最近训练代码也遇到了相同的问题,感觉应该和您描述很相符,因为在我的程序中,计算图和数据是高度相关的,某些特定数据不会经过部分网络导致各个GPU上的loss不一致,可以加个联系方式进一步聊一下相关问题嘛?

我也是类似的 特定数据过特定分支,在多机开启syncbn后就会卡住,跟你的情况很像,请问有解决吗

from fengshenbang-lm.

matrix-yang avatar matrix-yang commented on June 16, 2024 2

的 特定数据过特定分支,在多机开启syncbn后就会卡住,跟你的情况很像,请问有解决吗

我也是这样的,我跑的moe,不同token走得分支不一样,然后backward的时候就卡住,老哥有解决吗?

from fengshenbang-lm.

koking0 avatar koking0 commented on June 16, 2024

添加了各种配置之后,现在是卡在训练开始的阶段,Epoch为0的训练一直没有开始,但是两台机器的显存占用了,GPU利用率也是100%,但是就是没有开始训练。

查阅了一些资料:
pytorch 多机多卡卡住问题汇总
Script freezes with no output when using DistributedDataParallel
PyTorch 训练时中遇到的卡住停住等问题
PyTorch训练时,Dataloader卡死、挂起,跑一个epoch停了,问题解决方案
运行开始训练,卡住半小时,一直不动
关于炼丹,你是否知道这些细节?

主要做了以下修改:

  1. 启动命令前增加了OMP_NUM_THREADS=1 MKL_NUM_THREADS=1,避免多线程导致死锁;
  2. 去掉了加载数据时的tqdm;
  3. 记在数据的DataLoader的drop_last设置为True,pin_memory设置为True,num_workers设置为0;

更新后的启动脚本:

#!/bin/bash

set -x -e

echo "START TIME: $(date)"
MICRO_BATCH_SIZE=1
ROOT_DIR=$(pwd)

ZERO_STAGE=3

config_json="$ROOT_DIR/training_config.json"

cat <<EOT >$config_json
{
  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
  "steps_per_print": 1000,
  "gradient_clipping": 1,
  "zero_optimization": {
    "stage": ${ZERO_STAGE},
    "allgather_partitions": false,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "stage3_max_live_parameters" : 2e8,
    "stage3_max_reuse_distance" : 2e8,
    "stage3_prefetch_bucket_size": 2e8,
    "stage3_param_persistence_threshold": 2e8,
    "sub_group_size" : 2e8,
    "round_robin_gradients": true
  },
  "bf16": {
    "enabled": true
  },
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 1e-5,
      "betas": [0.9,0.95],
      "eps": 1e-8,
      "weight_decay": 1e-2
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params":{
      "warmup_min_lr": 5e-6,
      "warmup_max_lr": 1e-5
    }
  }
}
EOT

export PL_DEEPSPEED_CONFIG_PATH=$config_json
TRAINER_ARGS="
    --max_epochs 1 \
    --num_nodes 2 \
    --gpus 8 \
    --strategy deepspeed_stage_${ZERO_STAGE}_offload \
    --default_root_dir $ROOT_DIR \
    --dirpath $ROOT_DIR/ckpt \
    --save_top_k 3 \
    --monitor train_loss \
    --mode min \
    --save_last \
"

DATA_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets
DATA_ARGS="
    --data_dir $DATA_DIR \
    --max_seq_length 64 \
    --train_batchsize $MICRO_BATCH_SIZE \
    --valid_batchsize $MICRO_BATCH_SIZE \
    --train_data test_train.txt \
    --valid_data test.txt \
    --test_data  test.txt
"

PRETRAINED_MODEL_PATH="IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese"
MODEL_ARGS="
    --pretrained_model_path ${PRETRAINED_MODEL_PATH} \
    --output_save_path $ROOT_DIR/predict.json \
    --learning_rate 1e-4 \
    --weight_decay 0.1 \
    --warmup 0.01 \
"

MASTER_ADDR="IP"
MASTER_PORT="9010"
DISTRIBUTED_ARGS="
    --nnodes 2 \
    --nproc_per_node=8 \
    --master_addr ${MASTER_ADDR} \
    --master_port ${MASTER_PORT} \
    --node_rank 1 \
    --max_restarts=0
"

SCRIPTS_PATH=${ROOT_DIR}/finetune_gpt2.py

export CMD=" \
    $DISTRIBUTED_ARGS \
    $SCRIPTS_PATH \
    $TRAINER_ARGS \
    $MODEL_ARGS \
    $DATA_ARGS \
"

export NCCL_SOCKET_IFNAME=enp129s0f0
export NCCL_IB_DISABLE=1

#python ${CMD}
OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 torchrun ${CMD}
#python -m torch.distributed.launch ${CMD}

训练脚本脚本:

# -*- coding: utf-8 -*-
# @Time        : 2022/8/9 11:46
# @File        : finetune_gpt2.py
# @Description : None
# ----------------------------------------------
# ☆ ☆ ☆ ☆ ☆ ☆ ☆
# >>> Author    : Alex
# >>> Mail      : [email protected]
# >>> Github    : https://github.com/koking0
# >>> Blog      : https://alex007.blog.csdn.net/
# ☆ ☆ ☆ ☆ ☆ ☆ ☆
import argparse
import os

import pytorch_lightning as pl
import torch as th
from pytorch_lightning import Trainer, loggers
from pytorch_lightning.callbacks import ModelCheckpoint
from torch.distributed.elastic.multiprocessing.errors import record
from torch.utils.data import DataLoader, Dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers.optimization import get_linear_schedule_with_warmup


class GPT2Dataset(Dataset):
	"""
	Dataset Used for yuyuan medical qa task.
	Just surpport small datasets, when deal with large datasets it may be slowly.
	for large datasets please use mmapdatasets(doing)
	"""

	def __init__(self, data_path, name, args):
		super().__init__()
		self.tokenizer = GPT2Tokenizer.from_pretrained(args.pretrained_model_path)
		self.tokenizer.add_special_tokens({'pad_token': '<|endoftext|>'})
		self.data_size = os.path.getsize(data_path) / 1024 / 1024 / 1024
		self.data_type_name = name
		self.data = self.load_data(data_path)
		self.max_seq_length = args.max_seq_length

	def __len__(self):
		return len(self.data)

	def __getitem__(self, index):
		return self.encode(self.data[index])

	def load_data(self, data_path):
		# 有进度条展示
		if self.data_size <= 5:
			with open(data_path, "rt", encoding='utf8') as f:
				lines = f.readlines()
			data_gen = lines
		else:
			data_gen = open(data_path, "rt", encoding='utf8')

		data = []
		for idx, line in enumerate(data_gen):
			data.append(line)

		if self.data_size > 5:
			data_gen.close()
		return data

	def encode(self, item):
		"""
		将数据转换成模型训练的输入
		"""
		inputs_dict = self.tokenizer.encode_plus(item, max_length=self.max_seq_length, padding='max_length',
		                                         truncation=True, return_tensors='pt')
		target = inputs_dict["input_ids"]
		labels = target.clone().detach()
		labels[target == self.tokenizer.pad_token_id] = -100

		labels = labels.squeeze().numpy().tolist()
		if -100 in labels:
			labels[labels.index(-100)] = 50256

		return {
			"input_ids": inputs_dict["input_ids"].squeeze(),
			"attention_mask": inputs_dict["attention_mask"].squeeze(),
			"labels": th.tensor(labels)
		}


class GPT2DataModel(pl.LightningDataModule):
	@staticmethod
	def add_data_specific_args(parent_args):
		parser = parent_args.add_argument_group('GPT2DataModel')
		parser.add_argument('--data_dir', type=str, required=True)
		parser.add_argument('--num_workers', default=0, type=int)
		parser.add_argument('--train_data', default='train.txt', type=str)
		parser.add_argument('--valid_data', default='valid.txt', type=str)
		parser.add_argument('--test_data', default='test.txt', type=str)
		parser.add_argument('--train_batchsize', type=int, required=True)
		parser.add_argument('--valid_batchsize', type=int, required=True)
		parser.add_argument('--max_seq_length', default=512, type=int)
		return parent_args

	def __init__(self, args):
		super().__init__()
		self.args = args
		self.train_batchsize = args.train_batchsize
		self.valid_batchsize = args.valid_batchsize
		if not args.do_eval_only:
			self.train_data = GPT2Dataset(os.path.join(args.data_dir, args.train_data), '训练集', args)
			self.valid_data = GPT2Dataset(os.path.join(args.data_dir, args.valid_data), '验证集', args)
		self.test_data = GPT2Dataset(os.path.join(args.data_dir, args.test_data), '测试集', args)

	def train_dataloader(self):
		return DataLoader(self.train_data, shuffle=True, batch_size=self.train_batchsize, drop_last=True,
		                  pin_memory=True, num_workers=self.args.num_workers)

	def val_dataloader(self):
		return DataLoader(self.valid_data, shuffle=False, batch_size=self.valid_batchsize, drop_last=True,
		                  pin_memory=True, num_workers=self.args.num_workers)

	def predict_dataloader(self):
		return DataLoader(self.test_data, shuffle=False, batch_size=self.valid_batchsize, drop_last=True,
		                  pin_memory=True, num_workers=self.args.num_workers)


class GPT2FinetuneMedicalQAModelCheckpoint:
	@staticmethod
	def add_argparse_args(parent_args):
		parser = parent_args.add_argument_group('BaseModel')

		parser.add_argument('--monitor', default='train_loss', type=str)
		parser.add_argument('--mode', default='min', type=str)
		parser.add_argument('--dirpath', default='./ckpt/', type=str)
		parser.add_argument('--filename', default='model-{epoch:02d}-{train_loss:.4f}', type=str)
		parser.add_argument('--save_last', action='store_true', default=True)
		parser.add_argument('--save_top_k', default=3, type=float)
		parser.add_argument('--every_n_train_steps', default=1000, type=float)
		parser.add_argument('--save_weights_only', default=True, type=bool)

		return parent_args

	def __init__(self, args):
		self.callbacks = ModelCheckpoint(monitor=args.monitor, save_top_k=args.save_top_k, mode=args.mode,
		                                 save_weights_only=args.save_weights_only, dirpath=args.dirpath,
		                                 filename=args.filename, save_last=args.save_last)


class GPT2Finetune(pl.LightningModule):

	@staticmethod
	def add_model_specific_args(parent_args):
		parser = parent_args.add_argument_group("BaseModel")
		parser.add_argument("--learning_rate", default=1e-4, type=float)
		parser.add_argument("--weight_decay", default=0.1, type=float)
		parser.add_argument("--warmup", default=0.01, type=float)
		return parent_args

	def __init__(self, args, num_data):
		super().__init__()
		self.args = args
		self.num_data = num_data
		self.model = GPT2LMHeadModel.from_pretrained(args.pretrained_model_path)

	def setup(self, stage) -> None:
		if stage == 'fit':
			num_gpus = self.trainer.gpus if self.trainer.gpus is not None else 0
			self.total_step = int(self.trainer.max_epochs * self.num_data /
			                      (max(1, num_gpus) * self.trainer.accumulate_grad_batches))
			print('Total training step:', self.total_step)

	def training_step(self, batch, batch_idx):
		output = self.model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'],
		                    labels=batch['labels'])
		# output = self.model(input_ids=batch['input_ids'], labels=batch['labels'])
		# acc = self.comput_metrix(output.logits, batch['labels'])
		self.log('train_loss', output.loss)
		return output.loss

	def comput_metrix(self, logits, labels):
		y_pred = th.argmax(logits, dim=-1)
		y_pred = y_pred.view(size=(-1,))
		y_true = labels.view(size=(-1,)).float()
		corr = th.eq(y_pred, y_true)
		acc = th.sum(corr.float()) / labels.size()[0]
		return acc

	def validation_step(self, batch, batch_idx):
		output = self.model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'],
		                    labels=batch['labels'])
		self.log('val_loss', output.loss)

	def configure_optimizers(self):
		no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
		paras = list(filter(lambda p: p[1].requires_grad, self.named_parameters()))
		paras = [{
			'params':
				[p for n, p in paras if not any(nd in n for nd in no_decay)],
			'weight_decay': self.args.weight_decay
		}, {
			'params': [p for n, p in paras if any(nd in n for nd in no_decay)],
			'weight_decay': 0.0
		}]
		optimizer = th.optim.AdamW(paras, lr=self.args.learning_rate)
		scheduler = get_linear_schedule_with_warmup(
			optimizer, int(self.total_step * self.args.warmup),
			self.total_step)

		return [{
			'optimizer': optimizer,
			'lr_scheduler': {
				'scheduler': scheduler,
				'interval': 'step',
				'frequency': 1
			}
		}]


@record
def train():
	total_parser = argparse.ArgumentParser("Summary Task")
	total_parser.add_argument('--local_rank', type=int)
	total_parser.add_argument('--do_eval_only', action='store_true', default=False)
	total_parser.add_argument('--pretrained_model_path', default=None, type=str)
	total_parser.add_argument('--output_save_path', default='./predict.json', type=str)
	# * Args for data preprocessing
	total_parser = GPT2DataModel.add_data_specific_args(total_parser)
	# * Args for training
	total_parser = Trainer.add_argparse_args(total_parser)
	total_parser = GPT2FinetuneMedicalQAModelCheckpoint.add_argparse_args(total_parser)
	total_parser = GPT2Finetune.add_model_specific_args(total_parser)
	# * Args for base model
	args = total_parser.parse_args()

	data_model = GPT2DataModel(args)
	model = GPT2Finetune(args, len(data_model.train_dataloader()))
	checkpoint_callback = GPT2FinetuneMedicalQAModelCheckpoint(args).callbacks
	logger = loggers.TensorBoardLogger(save_dir=os.path.join(args.default_root_dir, 'log/'), name='MedicalQA-GPT2')
	trainer = Trainer.from_argparse_args(args, logger=logger, callbacks=[checkpoint_callback])
	trainer.tune(model)
	trainer.fit(model, data_model)

	model.model.save_pretrained("./models/finetune/gpt2")


if __name__ == '__main__':
	train()

Node0机器日志:

$ bash finetune_gpt2.sh 
++ date
+ echo 'START TIME: 2022年 09月 01日 星期四 16:34:02 CST'
START TIME: 2022年 09月 01日 星期四 16:34:02 CST
+ MICRO_BATCH_SIZE=1
++ pwd
+ ROOT_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog
+ ZERO_STAGE=3
+ config_json=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ cat
+ export PL_DEEPSPEED_CONFIG_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ PL_DEEPSPEED_CONFIG_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ TRAINER_ARGS='
    --max_epochs 1     --num_nodes 2     --gpus 8     --strategy deepspeed_stage_3_offload     --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog     --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt     --save_top_k 3     --monitor train_loss     --mode min     --save_last '
+ DATA_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets
+ DATA_ARGS='
    --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets     --max_seq_length 64     --train_batchsize 1     --valid_batchsize 1     --train_data test_train.txt     --valid_data test.txt     --test_data  test.txt
'
+ PRETRAINED_MODEL_PATH=IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese
+ MODEL_ARGS='
    --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese     --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json     --learning_rate 1e-4     --weight_decay 0.1     --warmup 0.01 '
+ MASTER_ADDR=IP
+ MASTER_PORT=9010
+ DISTRIBUTED_ARGS='
    --nnodes 2     --nproc_per_node=8     --master_addr IP     --master_port 9010     --node_rank 0     --max_restarts=0
'
+ SCRIPTS_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py
+ export 'CMD=     
    --nnodes 2     --nproc_per_node=8     --master_addr IP     --master_port 9010     --node_rank 0     --max_restarts=0
     /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py     
    --max_epochs 1     --num_nodes 2     --gpus 8     --strategy deepspeed_stage_3_offload     --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog     --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt     --save_top_k 3     --monitor train_loss     --mode min     --save_last      
    --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese     --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json     --learning_rate 1e-4     --weight_decay 0.1     --warmup 0.01      
    --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets     --max_seq_length 64     --train_batchsize 1     --valid_batchsize 1     --train_data test_train.txt     --valid_data test.txt     --test_data  test.txt
 '
+ CMD='     
    --nnodes 2     --nproc_per_node=8     --master_addr IP     --master_port 9010     --node_rank 0     --max_restarts=0
     /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py     
    --max_epochs 1     --num_nodes 2     --gpus 8     --strategy deepspeed_stage_3_offload     --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog     --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt     --save_top_k 3     --monitor train_loss     --mode min     --save_last      
    --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese     --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json     --learning_rate 1e-4     --weight_decay 0.1     --warmup 0.01      
    --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets     --max_seq_length 64     --train_batchsize 1     --valid_batchsize 1     --train_data test_train.txt     --valid_data test.txt     --test_data  test.txt
 '
+ export NCCL_SOCKET_IFNAME=enp129s0f0
+ NCCL_SOCKET_IFNAME=enp129s0f0
+ export NCCL_IB_DISABLE=1
+ NCCL_IB_DISABLE=1
+ OMP_NUM_THREADS=1
+ MKL_NUM_THREADS=1
+ torchrun --nnodes 2 --nproc_per_node=8 --master_addr IP --master_port 9010 --node_rank 0 --max_restarts=0 /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py --max_epochs 1 --num_nodes 2 --gpus 8 --strategy deepspeed_stage_3_offload --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt --save_top_k 3 --monitor train_loss --mode min --save_last --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json --learning_rate 1e-4 --weight_decay 0.1 --warmup 0.01 --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets --max_seq_length 64 --train_batchsize 1 --valid_batchsize 1 --train_data test_train.txt --valid_data test.txt --test_data test.txt
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
initializing deepspeed distributed: GLOBAL_RANK: 5, MEMBER: 6/16
initializing deepspeed distributed: GLOBAL_RANK: 7, MEMBER: 8/16
/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:445: LightningDeprecationWarning: Setting `Trainer(gpus=8)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='gpu', devices=8)` instead.
  rank_zero_deprecation(
Loading DeepSpeed config from set PL_DEEPSPEED_CONFIG_PATH environment variable
initializing deepspeed distributed: GLOBAL_RANK: 3, MEMBER: 4/16
initializing deepspeed distributed: GLOBAL_RANK: 4, MEMBER: 5/16
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/16
initializing deepspeed distributed: GLOBAL_RANK: 6, MEMBER: 7/16
initializing deepspeed distributed: GLOBAL_RANK: 1, MEMBER: 2/16
initializing deepspeed distributed: GLOBAL_RANK: 2, MEMBER: 3/16
Total training step: 5776
Total training step: 5776
Total training step: 5776
Total training step: 5776
Total training step:Total training step: 5776 5776

Total training step: 5776
/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:2179: LightningDeprecationWarning: `Trainer.gpus` was deprecated in v1.6 and will be removed in v1.8. Please use `Trainer.num_devices` or `Trainer.device_ids` to get device information instead.
  rank_zero_deprecation(
Total training step: 5776
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
You have specified an optimizer and/or scheduler within the DeepSpeed config. It is recommended to define it in `LightningModule.configure_optimizers`.
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/liuzhaofeng/.cache/torch_extensions/py39_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.11814284324646 seconds
Time to load cpu_adam op: 2.935582160949707 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.088909149169922 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.8845674991607666 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.0597143173217773 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.990757942199707 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.0667808055877686 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.0561718940734863 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Emitting ninja build file /home/liuzhaofeng/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.41974496841430664 seconds
Loading extension module utils...
Time to load utils op: 0.20378637313842773 seconds
Loading extension module utils...
Time to load utils op: 0.4045596122741699 seconds
Loading extension module utils...
Time to load utils op: 0.40416693687438965 seconds
Loading extension module utils...
Time to load utils op: 0.30461955070495605 seconds
Loading extension module utils...
Time to load utils op: 0.40419793128967285 seconds
Loading extension module utils...
Time to load utils op: 0.5049059391021729 seconds
Loading extension module utils...
Time to load utils op: 0.4040186405181885 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Time to load utils op: 0.00046753883361816406 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005242824554443359 secondsNo modifications detected for re-loaded extension module utils, skipping build step...

Loading extension module utils...
Time to load utils op: 0.0005707740783691406 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005829334259033203 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0007224082946777344 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0009551048278808594 seconds
Time to load utils op: 0.0009477138519287109 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004756450653076172 seconds

  | Name  | Type            | Params | Params per Device
--------------------------------------------------------------
0 | model | GPT2LMHeadModel | 3.6 B  | 222 M            
--------------------------------------------------------------
3.6 B     Trainable params
0         Non-trainable params
3.6 B     Total params
14,225.080Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:219: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Sanity Checking DataLoader 0: 100%|█| 2/2 [01:57<00:00, 58.51s/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
  warning_cache.warn(
/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:219: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 0:   0%|                       | 0/2896 [00:00<?, ?it/s]

Node1机器日志:

$ bash finetune_gpt2.sh 
++ date
+ echo 'START TIME: 2022年 09月 01日 星期四 16:34:09 CST'
START TIME: 2022年 09月 01日 星期四 16:34:09 CST
+ MICRO_BATCH_SIZE=1
++ pwd
+ ROOT_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog
+ ZERO_STAGE=3
+ config_json=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ cat
+ export PL_DEEPSPEED_CONFIG_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ PL_DEEPSPEED_CONFIG_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/training_config.json
+ TRAINER_ARGS='
    --max_epochs 1     --num_nodes 2     --gpus 8     --strategy deepspeed_stage_3_offload     --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog     --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt     --save_top_k 3     --monitor train_loss     --mode min     --save_last '
+ DATA_DIR=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets
+ DATA_ARGS='
    --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets     --max_seq_length 64     --train_batchsize 1     --valid_batchsize 1     --train_data test_train.txt     --valid_data test.txt     --test_data  test.txt
'
+ PRETRAINED_MODEL_PATH=IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese
+ MODEL_ARGS='
    --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese     --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json     --learning_rate 1e-4     --weight_decay 0.1     --warmup 0.01 '
+ MASTER_ADDR=IP
+ MASTER_PORT=9010
+ DISTRIBUTED_ARGS='
    --nnodes 2     --nproc_per_node=8     --master_addr IP     --master_port 9010     --node_rank 1     --max_restarts=0
'
+ SCRIPTS_PATH=/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py
+ export 'CMD=     
    --nnodes 2     --nproc_per_node=8     --master_addr IP     --master_port 9010     --node_rank 1     --max_restarts=0
     /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py     
    --max_epochs 1     --num_nodes 2     --gpus 8     --strategy deepspeed_stage_3_offload     --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog     --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt     --save_top_k 3     --monitor train_loss     --mode min     --save_last      
    --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese     --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json     --learning_rate 1e-4     --weight_decay 0.1     --warmup 0.01      
    --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets     --max_seq_length 64     --train_batchsize 1     --valid_batchsize 1     --train_data test_train.txt     --valid_data test.txt     --test_data  test.txt
 '
+ CMD='     
    --nnodes 2     --nproc_per_node=8     --master_addr IP     --master_port 9010     --node_rank 1     --max_restarts=0
     /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py     
    --max_epochs 1     --num_nodes 2     --gpus 8     --strategy deepspeed_stage_3_offload     --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog     --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt     --save_top_k 3     --monitor train_loss     --mode min     --save_last      
    --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese     --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json     --learning_rate 1e-4     --weight_decay 0.1     --warmup 0.01      
    --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets     --max_seq_length 64     --train_batchsize 1     --valid_batchsize 1     --train_data test_train.txt     --valid_data test.txt     --test_data  test.txt
 '
+ export NCCL_SOCKET_IFNAME=enp129s0f0
+ NCCL_SOCKET_IFNAME=enp129s0f0
+ export NCCL_IB_DISABLE=1
+ NCCL_IB_DISABLE=1
+ OMP_NUM_THREADS=1
+ MKL_NUM_THREADS=1
+ torchrun --nnodes 2 --nproc_per_node=8 --master_addr IP --master_port 9010 --node_rank 1 --max_restarts=0 /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py --max_epochs 1 --num_nodes 2 --gpus 8 --strategy deepspeed_stage_3_offload --default_root_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog --dirpath /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/ckpt --save_top_k 3 --monitor train_loss --mode min --save_last --pretrained_model_path IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese --output_save_path /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/predict.json --learning_rate 1e-4 --weight_decay 0.1 --warmup 0.01 --data_dir /home/liuzhaofeng/nlg_pipeline/gpt2/dialog/datasets --max_seq_length 64 --train_batchsize 1 --valid_batchsize 1 --train_data test_train.txt --valid_data test.txt --test_data test.txt
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
num_data: 46213
initializing deepspeed distributed: GLOBAL_RANK: 9, MEMBER: 10/16
initializing deepspeed distributed: GLOBAL_RANK: 14, MEMBER: 15/16
initializing deepspeed distributed: GLOBAL_RANK: 10, MEMBER: 11/16
initializing deepspeed distributed: GLOBAL_RANK: 13, MEMBER: 14/16
initializing deepspeed distributed: GLOBAL_RANK: 11, MEMBER: 12/16
initializing deepspeed distributed: GLOBAL_RANK: 15, MEMBER: 16/16
initializing deepspeed distributed: GLOBAL_RANK: 12, MEMBER: 13/16
initializing deepspeed distributed: GLOBAL_RANK: 8, MEMBER: 9/16
Total training step:Total training step:  57765776

Total training step: 5776
Total training step: Total training step:5776 
Total training step:5776 
5776
Total training step: 5776
Total training step: 5776
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/liuzhaofeng/.cache/torch_extensions/py39_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.1943840980529785 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.1576390266418457 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.1438350677490234 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.1594393253326416 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.2506330013275146 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.2735142707824707 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.295503854751587 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.099400281906128 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Emitting ninja build file /home/liuzhaofeng/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.4705345630645752 seconds
Loading extension module utils...
Time to load utils op: 0.4043726921081543 seconds
Loading extension module utils...
Time to load utils op: 0.4040102958679199 seconds
Loading extension module utils...
Time to load utils op: 0.5046920776367188 seconds
Loading extension module utils...
Time to load utils op: 0.40497612953186035 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.5046734809875488 seconds
Loading extension module utils...
Time to load utils op: 0.504218339920044 seconds
Time to load utils op: 0.504563570022583 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004253387451171875 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004715919494628906 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...

Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Time to load utils op: 0.0009021759033203125 seconds
Time to load utils op: 0.0009865760803222656 seconds
Time to load utils op: 0.0009179115295410156 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
No modifications detected for re-loaded extension module utils, skipping build step...Loading extension module utils...

Loading extension module utils...
Time to load utils op: 0.0006961822509765625 seconds
Time to load utils op: 0.0010385513305664062 seconds
Using /home/liuzhaofeng/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00115966796875 seconds

GPU、显存、CPU、内存占用:
image

from fengshenbang-lm.

Zyriix avatar Zyriix commented on June 16, 2024

您好,可以请问一下你是用的torch lightning model吗?我也遇到了这个问题,可以看看你的huggingface trainer怎么写的吗?

from fengshenbang-lm.

Zyriix avatar Zyriix commented on June 16, 2024

@koking0

from fengshenbang-lm.

koking0 avatar koking0 commented on June 16, 2024

您好,可以请问一下你是用的torch lightning model吗?我也遇到了这个问题,可以看看你的huggingface trainer怎么写的吗?

记不太清了,当时的代码没有留存。[手动捂脸哭]

from fengshenbang-lm.

YIFanH avatar YIFanH commented on June 16, 2024

@Zyriix 你解决了吗 一样的pytorchlighting多机多卡 我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题,能问下你那边是怎么解决的吗

from fengshenbang-lm.

Zyriix avatar Zyriix commented on June 16, 2024

@Zyriix 你解决了吗 一样的pytorchlighting多机多卡 我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题,能问下你那边是怎么解决的吗

这个问题一般是因为不同的node之间gather的时候数据不一致导致的,具体来说有这几种可能:

  1. loss_dict不一致,比如你定义了多个loss,node1跑的是loss1,node2跑的是loss2,如果你的loss_dict没有为这些loss值设置默认值,就会一直waiting
  2. metric_dict不一致,比如node1你记录了l2loss,node2你记录了IOU,也会有这种情况

一般检查这两个就可以解决,通常直接把log_dict里的sync关掉,即不使用跨节点的metrics计算会减少这种问题出现的概率

from fengshenbang-lm.

jacken3 avatar jacken3 commented on June 16, 2024

@Zyriix 你解决了吗 一样的pytorchlighting多机多卡 我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题,能问下你那边是怎么解决的吗

这个问题一般是因为不同的node之间gather的时候数据不一致导致的,具体来说有这几种可能:

  1. loss_dict不一致,比如你定义了多个loss,node1跑的是loss1,node2跑的是loss2,如果你的loss_dict没有为这些loss值设置默认值,就会一直waiting
  2. metric_dict不一致,比如node1你记录了l2loss,node2你记录了IOU,也会有这种情况

一般检查这两个就可以解决,通常直接把log_dict里的sync关掉,即不使用跨节点的metrics计算会减少这种问题出现的概率

你好,我最近训练代码也遇到了相同的问题,感觉应该和您描述很相符,因为在我的程序中,计算图和数据是高度相关的,某些特定数据不会经过部分网络导致各个GPU上的loss不一致,可以加个联系方式进一步聊一下相关问题嘛?

from fengshenbang-lm.

Zyriix avatar Zyriix commented on June 16, 2024

@Zyriix 你解决了吗 一样的pytorchlighting多机多卡 我也遇到了torch.distributed.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)卡住的问题,能问下你那边是怎么解决的吗

这个问题一般是因为不同的node之间gather的时候数据不一致导致的,具体来说有这几种可能:

  1. loss_dict不一致,比如你定义了多个loss,node1跑的是loss1,node2跑的是loss2,如果你的loss_dict没有为这些loss值设置默认值,就会一直waiting
  2. metric_dict不一致,比如node1你记录了l2loss,node2你记录了IOU,也会有这种情况

一般检查这两个就可以解决,通常直接把log_dict里的sync关掉,即不使用跨节点的metrics计算会减少这种问题出现的概率

你好,我最近训练代码也遇到了相同的问题,感觉应该和您描述很相符,因为在我的程序中,计算图和数据是高度相关的,某些特定数据不会经过部分网络导致各个GPU上的loss不一致,可以加个联系方式进一步聊一下相关问题嘛?

这个解决方式可以尝试上面提到的,给每个loss设一个默认值来解决。
比如:
loss1 = 0
loss2 = 0
if cond1:
loss1 = mse(net1(input), target)
else:
loss2 = mse(net2(input), target)
loss = loss1 + loss2
loss_dict = {'loss':loss, 'loss1':loss1, 'loss2':loss2}
return loss, loss_dict

除了loss之外,分布式训练还会gather梯度,你们提到的计算图不同可能会导致不同node间的梯度不同,从而使程序一直等待梯度。使用deepspeed stage2好像可以解决这类问题,我不确定,可以尝试。

from fengshenbang-lm.

YangsongLan avatar YangsongLan commented on June 16, 2024

除了loss之外,分布式训练还会gather梯度,你们提到的计算图不同可能会导致不同node间的梯度不同,从而使程序一直等待梯度。使用deepspeed stage2好像可以解决这类问题,我不确定,可以尝试。

是的,我也是在训练MoE的时候会卡住。我试了一下如果只有1个expert,程序多卡运行没问题。我怀疑是不同卡上选择不同的expert的梯度不同导致的,无法进行反向更新参数?我使用的是deepspeed的stage2也依然没有解决这个问题。

from fengshenbang-lm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.