Giter VIP home page Giter VIP logo

paddlepaddle / plsc Goto Github PK

View Code? Open in Web Editor NEW
144.0 21.0 33.0 2.97 MB

Paddle Large Scale Classification Tools,supports ArcFace, CosFace, PartialFC, Data Parallel + Model Parallel. Model includes ResNet, ViT, Swin, DeiT, CaiT, FaceViT, MoCo, MAE, ConvMAE, CAE.

License: Apache License 2.0

Python 86.36% Shell 12.37% Jupyter Notebook 1.27%
face-recognition arcface cosface partial-fc data-parallel model-parallel large-scale paddlepaddle paddle distributed-training

plsc's Introduction


Introduction

PLSC is an open source repo for a collection of Paddle Large Scale Classification Tools, which supports large-scale classification model pre-training as well as finetune for downstream tasks.

Available Models

Top News 🔥

Update (2023-01-11): PLSC v2.4 is released, we refactored the entire repository based on task types. This repository has been adapted to PaddlePaddle release 2.4. In terms of models, we have added 4 new ones, including FaceViT, CaiT, MoCo v3, MAE. At present, each model in the repository can be trained from scratch to achieve the original official accuracy, especially the training of ViT-Large on the ImageNet21K dataset. In addition, we also provide a method for ImageNet21K data preprocessing. In terms of AMP training, PLSC uses FP16 O2 training by default, which can speed up training while maintaining accuracy.

Update (2022-07-18): PLSC v2.3 is released, a new upgrade, more modular and highly extensible. Support more tasks, such as ViT, DeiT. The static graph mode will no longer be maintained as of this release.

Update (2022-01-11): Supported NHWC data format of FP16 to improve 10% throughtput and decreased 30% GPU memory. It supported 92 million classes on single node 8 NVIDIA V100 (32G) and has high training throughtput. Supported best checkpoint save. And we released 18 pretrained models and PLSC v2.2.

Update (2021-12-11): Released Zhihu Technical Artical and Bilibili Open Class

Update (2021-10-10): Added FP16 training, improved throughtput and optimized GPU memory. It supported 60 million classes on single node 8 NVIDIA V100 (32G) and has high training throughtput.

Update (2021-09-10): This repository supported both static mode and dynamic mode to use paddlepaddle v2.2, which supported 48 million classes on single node 8 NVIDIA V100 (32G). It added PartialFC, SparseMomentum, and ArcFace, CosFace (we refer to MarginLoss). Backbone includes IResNet and MobileNet.

Installation

PLSC provides two usage methods: one is as an external third-party library, and users can use import plsc in their own projects; the other is to develop and use it locally based on this repository.

Note: With the continuous iteration of the PaddlePaddle version, the PLSC master branch adapts to the PaddlePaddle develop branch, and API mismatches may occur in lower versions of PaddlePaddle.

Install plsc as a third-party library

pip install git+https://github.com/PaddlePaddle/PLSC@master

For stable development, you can install a previous version of plsc.

pip install plsc==2.4

Install plsc locally

git clone https://github.com/PaddlePaddle/PLSC.git
cd /path/to/PLSC/
# [optional] pip install -r requirements.txt
python setup.py develop

See Installation instructions.

Getting Started

See Quick Run Recognition for the basic usage of PLSC.

Tutorials

See more tutorials.

Documentation

See documentation for the usage of more APIs or modules.

License

This project is released under the Apache 2.0 license.

Citation

@misc{plsc,
    title={PLSC: An Easy-to-use and High-Performance Large Scale Classification Tool},
    author={PLSC Contributors},
    howpublished = {\url{https://github.com/PaddlePaddle/PLSC}},
    year={2022}
}

plsc's People

Contributors

danleifeng avatar gavin1332 avatar guoxiawang avatar guru4elephant avatar liujie0926 avatar mrxlt avatar qizhaoaoe avatar zhwesky2010 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

plsc's Issues

预训练模型预测示例图片错误率高达83.33%,请教可能出现问题的地方

概述

在 Windows 平台和 Linux 平台分别部署了PLSC运行环境,由于按照 README 文件中的安装提示,使用源码编译安装Paddle失败数次,所以后来使用的pip安装的PaddlePaddle-gpu环境(两个平台都是GPU环境),在下载了预训练模型后,使用测试图片进行预测,结果如图所示:
显然错误率很高,只有Rachel识别正确

预训练模型的下载和预测代码的执行均按照README文件指示,除部分文件路径外,未修改任何代码

friends2

环境

  1. Windows 10 + Quadro P1000 + Cuda 10.1
  2. CentOS 7 64 bit + RTX 2080 Ti + Cuda 10.1
  3. PaddlePaddle-gpu 2.2 (两个平台相同)

需要请教的问题

  1. 运行平台是否必须是Linux?
  2. 示例图片是否是由README文件中所说的预训练模型所预测的?
  3. 我这次的图片预测与示例的正确率相差甚远,有可能是哪些环节有问题需要排查?

请相关开发人员给点提示,谢谢!

程序运行报错

2020-05-22 16:49:30,924-INFO: User manually set optimizer.
Traceback (most recent call last):
File "train_fp16.py", line 346, in
ins.train()
File "/ssd2/wangjian/k8s_debug/plsc-env/python/lib/python2.7/site-packages/plsc/entry.py", line 907, in train
dist_strategy=strategy)
File "/ssd2/wangjian/k8s_debug/plsc-env/python/lib/python2.7/site-packages/plsc/entry.py", line 469, in build_program
dist_optimizer.minimize(loss)
File "/ssd2/wangjian/k8s_debug/plsc-env/python/lib/python2.7/site-packages/paddle/fluid/incubate/fleet/collective/init.py", line 420, in minimize
no_grad_set=no_grad_set)
File "/ssd2/wangjian/k8s_debug/plsc-env/python/lib/python2.7/site-packages/plsc/models/dist_algo.py", line 424, in minimize
optimize_ops = self._optimizer.apply_gradients(scaled_params_grads)
File "/ssd2/wangjian/k8s_debug/plsc-env/python/lib/python2.7/site-packages/paddle/fluid/optimizer.py", line 705, in apply_gradients
optimize_ops = self._create_optimization_pass(params_grads)
File "/ssd2/wangjian/k8s_debug/plsc-env/python/lib/python2.7/site-packages/paddle/fluid/optimizer.py", line 535, in _create_optimization_pass
self._create_global_learning_rate()
File "/ssd2/wangjian/k8s_debug/plsc-env/python/lib/python2.7/site-packages/paddle/fluid/optimizer.py", line 276, in _create_global_learning_rate
"learning rate variable is create outside optimizer,"
TypeError: learning rate variable is create outside optimizer,can not create new learning rate variable for new program

Add the support for arbitrary input

Currently, plsc only support input with name "image" and "label", which cannot cover the common cases. So, we need to add the support for arbitrary input for plsc.

安装PLSC报错

python2 python setup.py egg_info" failed with error code 1 in /tmp/pip-build-OM7fxe/Pillow

分类数目变大,尽管可以将参数拆分到各个GPU上,但是各个GPU上的隐层特征allgather也带来显存消耗

分类数目变大,虽然可以将分类层参数拆分到各个GPU上,但是各个GPU上的隐层特征allgather也带来显存消耗。随着分类层数目变多,虽然可以通过增加GPU数量来保证fc层参数分配到各个GPU上的显存是一个常数,但是隐层特征x,还是会随着GPU数增加而增加。单卡显存有限,这样也限制仅仅通过增加GPU数量来应对分类数量线性增长。这个问题在论文“Partial FC: training 10 million identities on a single machine”提出。

Problems exporting model

I'm trying to do:

export PADDLE_NNODES=1
export PADDLE_MASTER="127.0.0.1:12538"
export CUDA_VISIBLE_DEVICES=0
python -m paddle.distributed.launch \
    --nnodes=$PADDLE_NNODES \
    --master=$PADDLE_MASTER \
    --devices=$CUDA_VISIBLE_DEVICES \
    plsc-export \
    -c ./configs/IResNet100_WebFace42M_CosFace_pfc02_1n8c_dp_mp_fp16o1.yaml \
    -o Global.pretrained_model=/home/angela/Descargas/IResNet100_WebFace42M_CosFace_pfc02_1n8c_dp_mp_fp16o1 \
    -o FP16.level=O0 \
    -o Model.data_format=NCHW

and I get this error:

LAUNCH INFO 2024-03-11 12:29:45,551 ------------------------- ERROR LOG DETAIL -------------------------
Traceback (most recent call last):
  File "/home/angela/Descargas/PLSC-master/ven/bin/plsc-export", line 26, in <module>
    sys.exit(main())
  File "/home/angela/Descargas/PLSC-master/ven/bin/plsc-export", line 19, in main
    engine = Engine(config, mode="export")
  File "/home/angela/Descargas/PLSC-master/ven/lib/python3.7/site-packages/plsc/engine/engine.py", line 98, in __init__
    paddle.set_flags(RELATED_FLAGS_SETTING)
  File "/home/angela/Descargas/PLSC-master/ven/lib/python3.7/site-packages/paddle/fluid/framework.py", line 7493, in set_flags
    "Flag %s cannot set its value through this function." % (key)
ValueError: Flag FLAGS_cudnn_exhaustive_search cannot set its value through this function.

Could someone point me in the right direction? Thank you.

Lr过小时会导致Loss为nan

在训练自定义的数据集时,发现Lr过小时会导致Loss为nan,特别是当Lr缩小为原来的十分之一时(比如0.025变为0.00250.1变化0.01),都会导致Loss变化nan,然后导致训练中断,具体的训练日志请见:

0.01_training.log
0.1_training.log
0.2_training.log

请问该如何解决呢?

另外,在验证时:

  • Loss在刚开始下降到15左右之后,就很难下降;
  • 精度指标XNorm在飞快下降(比如从64288.505847降到0.017919);
  • 而精度指标Accuracy-Highest一直徘徊在0.50903左右;

我是一个刚入门人脸识别的小白,不确定这是什么情况,还请给出一些建议

感谢!

inference.py推理结果的含义是?

inference.py中的主要函数是:

def paddle_inference(args):
    import paddle.inference as paddle_infer

    config = paddle_infer.Config(args.model_file, args.params_file)
    predictor = paddle_infer.create_predictor(config)

    input_names = predictor.get_input_names()
    input_handle = predictor.get_input_handle(input_names[0])

    img = cv2.imread(args.image_path)
    # normalize to mean 0.5, std 0.5
    img = (img - 127.5) * 0.00784313725
    # BGR2RGB
    img = img[:, :, ::-1]
    img = img.transpose((2, 0, 1))
    img = np.expand_dims(img, 0)
    img = img.astype('float32')

    input_handle.copy_from_cpu(img)

    predictor.run()

    output_names = predictor.get_output_names()
    output_handle = predictor.get_output_handle(output_names[0])
    output_data = output_handle.copy_to_cpu()

    print('paddle inference result: ', output_data.shape)

基于MS1M_v3数据集训练出来的模型,去推理MS1M_v3\images\00000001.jpg,根据paddle_inference函数,推理结果是类似 (1, 128)的数组,这个数组的含义是什么呢?输出结果不应该是类似单个类别一样的数字吗(比如0或者其他数字)

谢谢!

输出模型是如何设计的?

我基于MS1M_v3_arcface_MobileFaceNet_128_0.1跑出了模型,输出模型格式如下图:

Screenshot_1

Screenshot_2

请问是不是因为验证时,是三个数据集(类别不一样),所以输出三个模型?
另外请问为什么输出结果是大量的分散文件,而不是类似FresResNet50.pdiparams这样的一个文件,或者说可以将以上的分散文件合并吗?

谢谢!

PLSC多个数据集训练loss问题

PLSC很好解决大规模分类训练。实际模型训练会使用多个数据集,而且这些数据集存在不同程度id重叠,而清理这些数据集间重叠id也比较麻烦。于是训练时采用多个数据集输入,模型主干网络参数共享,针对各个数据集有各自的分类层。目前PLSC代码中
shard_logit = loss._get_info('shard_logit‘)
shard_prob = loss._get_info('shard_prob’)
shard_label = loss._get_info('shard_label‘)
shard_dim = loss._get_info('shard_dim’)
大体上通过这块实现大的分类层权重分拆到多个GPU中(这里描述不准确)。

针对多个分类的连接权重,现在的PLSC代码就不能处理了。我对plsc/model/dist_algo.py中minimize函数进行修改,大概修改为如下形式:

def compute_gradient_multi_branches(self,
loss,
dataset_name,
startup_program=None,
parameter_list=None,
no_grad_set=None,
callbacks=None):
assert loss.get_info('shard_logit{}'.format(dataset_name))

    shard_logit = loss._get_info('shard_logit_{}'.format(dataset_name))
    shard_prob = loss._get_info('shard_prob_{}'.format(dataset_name))
    shard_label = loss._get_info('shard_label_{}'.format(dataset_name))
    shard_dim = loss._get_info('shard_dim_{}'.format(dataset_name))

    op_maker = fluid.core.op_proto_and_checker_maker
    op_role_key = op_maker.kOpRoleAttrName()
    op_role_var_key = op_maker.kOpRoleVarAttrName()
    backward_role = int(op_maker.OpRole.Backward)
    loss_backward_role = int(op_maker.OpRole.Loss) | int(
        op_maker.OpRole.Backward)

    # minimize a scalar of reduce_sum to generate the backward network
    scalar = fluid.layers.reduce_sum(shard_logit)
    block = loss.block

if not self._use_fp16:
#ret = self._optimizer.minimize(scalar)
params_grads = self._optimizer.backward(scalar)
print(loss, scalar, dataset_name)
# remove the unnecessary ops
index = 0
"""
for i, op in enumerate(block.ops):
if op.all_attrs()[op_role_key] == loss_backward_role:
index = i
break
"""
for i,op in enumerate(block.ops):
print(i, dataset_name, block.ops[i])

希望能针对不同分支的分类loss分别求梯度,然后再进行各个分支的梯度聚合操作,但是发现不同数据集对应的分支,op就有很大差异。之前我在tensorflow中实现的,只有最后各个分类层参数不一样也不需共享,因此可以对共同的参数梯度比如取一个均值然后再更新。我实验中,以webface和vggface2为例,发现这两个分支对应op差异很大。其中一个比另一个分支多了很多op

多出的部分op:

inputs {
parameter: "X"
arguments: "prelu_32.w_0@GRAD"
}
outputs {
parameter: "Out"
arguments: "prelu_32.w_0@GRAD"
}
type: "c_sync_calc_stream"
attrs {
name: "op_device"
type: STRING
s: ""
}
attrs {
name: "op_role"
type: INT
i: 1
}
attrs {
name: "op_callstack"
type: STRINGS
strings: "
}
attrs {
name: "op_namescope"
type: STRING
s: "/"
}
attrs {
name: "op_role_var"
type: STRINGS
}

inputs {
parameter: "X"
arguments: "prelu_24.w_0@GRAD"
}
outputs {
parameter: "Out"
arguments: "prelu_24.w_0@GRAD"
}
type: "c_allreduce_sum"
attrs {
name: "op_device"
type: STRING
s: ""
}
attrs {
name: "ring_id"
type: INT
i: 0
}
attrs {
name: "use_calc_stream"
type: BOOLEAN
b: false
}
attrs {
name: "op_role"
type: INT
i: 1
}
attrs {
name: "op_role_var"
type: STRINGS
}

上面是使用多个数据,针对不同数据集使用不同分类层来联合训练。不过遇到上面问题,希望能给个建议

Operator "c_allgather" has not been registered.

使用PLSC调用从源码编译的paddlepaddle,出现了如下问题:
Operator "c_allgather" has not been registered. 不知该如何解决。
直接使用pip安装的paddlepaddle可以正常运行。之所以自己源码编译paddlepaddle是想深入了解代码,方便日后根据需求做一些定制化开发。

具体错误信息如下:

File "train_multi_input.py", line 66, in
train_multi_input()
File "train_multi_input.py", line 62, in train_multi_input
ins.train()
File "PLSC_multi_modelParal_acm/plsc/entry.py", line 948, in train
dist_strategy=strategy,
File "PLSC_multi_modelParal_acm/plsc/entry.py", line 448, in build_program
batch_multi_list=self.train_batch_size_multi
File "PLSC_multi_modelParal_acm/plsc/models/base_model.py", line 169, in get_output
batch_multi_list=batch_multi_list)
File "PLSC_multi_modelParal_acm/plsc/models/dist_algo.py", line 1244, in distributed_arcface_classify
param_attr=param_attr)
File "PLSC_multi_modelParal_acm/plsc/models/dist_algo.py", line 977, in arcface_classify
norm_x, nranks=self.nranks, use_calc_stream=True)
File "python2.7/dist-packages/paddle/fluid/layers/collective.py", line 128, in _c_allgather
'use_calc_stream': use_calc_stream
File "python2.7/dist-packages/paddle/fluid/layer_helper.py", line 43, in append_op
return self.main_program.current_block().append_op(*args, **kwargs)
File "python2.7/dist-packages/paddle/fluid/framework.py", line 2610, in append_op
attrs=kwargs.get("attrs", None))
File "python2.7/dist-packages/paddle/fluid/framework.py", line 1870, in init
proto = OpProtoHolder.instance().get_op_proto(type)
File "python2.7/dist-packages/paddle/fluid/framework.py", line 1751, in get_op_proto
raise ValueError("Operator "%s" has not been registered." % type)
ValueError: Operator "c_allgather" has not been registered

issues of training with dynamic graph

hi, i want to modify the entry module of PLSC to replace the static graph format by dynamic graph. But i don't know how to assign the gpu device to each process with dynamic graph?, such as official code :
image

I would appreciate if you can give me some advice!

dynamic model export onnx error

将PLSC的dynamic案例跑完后,将模型进行导出为onnx,然后oepncv440进行onnx模型加载报import error,尝试将输入shape由原来的[None,3,112,112]修改为[1,3,112,112]再进行export onnx 后进行onnx-simplifier,采用opencv加载模型在全连接层报出错误,不晓得是什么原因,特来咨询,谢谢

运行出错

File "/home/yx/anaconda3/lib/python3.7/site-packages/plsc/entry.py", line 982, in train
acc5))
TypeError: unsupported format string passed to numpy.ndarray.format
有人遇到过吗

训练报错

λ f02d1b16ca1e /home/PLSC mkdir -p ./dataset/
λ f02d1b16ca1e /home/PLSC tar -xzf MS1M_v3_One_Sample.tgz -C ./dataset/

λ f02d1b16ca1e /home/PLSC
λ f02d1b16ca1e /home/PLSC python plsc/data/dataset/tools/lfw_style_bin_dataset_converter.py --bin_path ./dataset/MS1M_v3_One_Sample/agedb_30.bin --out_dir ./dataset/MS1M_v3_One_Sample/agedb_30/ --flip_test
convert 6000 pair images.
plsc/data/dataset/tools/lfw_style_bin_dataset_converter.py:66: DeprecationWarning: FLIP_LEFT_RIGHT is deprecated and will be removed in Pillow 10 (2023-07-01). Use Transpose.FLIP_LEFT_RIGHT instead.
img1 = img1.transpose(Image.FLIP_LEFT_RIGHT)
plsc/data/dataset/tools/lfw_style_bin_dataset_converter.py:73: DeprecationWarning: FLIP_LEFT_RIGHT is deprecated and will be removed in Pillow 10 (2023-07-01). Use Transpose.FLIP_LEFT_RIGHT instead.
img2 = img2.transpose(Image.FLIP_LEFT_RIGHT)
convert 6000 pair horizontal flip images.
λ f02d1b16ca1e /home/PLSC export CUDA_VISIBLE_DEVICES=0
λ f02d1b16ca1e /home/PLSC python tools/train.py -c ./plsc/configs/FaceRecognition/IResNet50_MS1MV3OneSample_ArcFace_0.1_1n8c_dp_fp32.yaml
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
/home/PLSC/plsc/data/preprocess/timm_autoaugment.py:38: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
_RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
/home/PLSC/plsc/data/preprocess/timm_autoaugment.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
_RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
[2022/08/16 02:58:22] plsc INFO: DataLoader :
[2022/08/16 02:58:22] plsc INFO: Eval :
[2022/08/16 02:58:22] plsc INFO: dataset :
[2022/08/16 02:58:22] plsc INFO: cls_label_path : ./dataset/MS1M_v3_One_Sample/agedb_30/label.txt
[2022/08/16 02:58:22] plsc INFO: image_root : ./dataset/MS1M_v3_One_Sample/agedb_30
[2022/08/16 02:58:22] plsc INFO: name : FaceVerificationDataset
[2022/08/16 02:58:22] plsc INFO: transform_ops :
[2022/08/16 02:58:22] plsc INFO: DecodeImage :
[2022/08/16 02:58:22] plsc INFO: channel_first : False
[2022/08/16 02:58:22] plsc INFO: to_rgb : True
[2022/08/16 02:58:22] plsc INFO: NormalizeImage :
[2022/08/16 02:58:22] plsc INFO: mean : [0.5, 0.5, 0.5]
[2022/08/16 02:58:22] plsc INFO: order :
[2022/08/16 02:58:22] plsc INFO: scale : 1.0/255.0
[2022/08/16 02:58:22] plsc INFO: std : [0.5, 0.5, 0.5]
[2022/08/16 02:58:22] plsc INFO: ToCHWImage : None
[2022/08/16 02:58:22] plsc INFO: loader :
[2022/08/16 02:58:22] plsc INFO: num_workers : 0
[2022/08/16 02:58:22] plsc INFO: use_shared_memory : True
[2022/08/16 02:58:22] plsc INFO: sampler :
[2022/08/16 02:58:22] plsc INFO: batch_size : 128
[2022/08/16 02:58:22] plsc INFO: drop_last : False
[2022/08/16 02:58:22] plsc INFO: name : BatchSampler
[2022/08/16 02:58:22] plsc INFO: shuffle : False
[2022/08/16 02:58:22] plsc INFO: Train :
[2022/08/16 02:58:22] plsc INFO: dataset :
[2022/08/16 02:58:22] plsc INFO: cls_label_path : ./dataset/MS1M_v3_One_Sample/label.txt
[2022/08/16 02:58:22] plsc INFO: image_root : ./dataset/MS1M_v3_One_Sample/
[2022/08/16 02:58:22] plsc INFO: name : FaceIdentificationDataset
[2022/08/16 02:58:22] plsc INFO: transform_ops :
[2022/08/16 02:58:22] plsc INFO: DecodeImage :
[2022/08/16 02:58:22] plsc INFO: channel_first : False
[2022/08/16 02:58:22] plsc INFO: to_rgb : True
[2022/08/16 02:58:22] plsc INFO: RandFlipImage :
[2022/08/16 02:58:22] plsc INFO: flip_code : 1
[2022/08/16 02:58:22] plsc INFO: NormalizeImage :
[2022/08/16 02:58:22] plsc INFO: mean : [0.5, 0.5, 0.5]
[2022/08/16 02:58:22] plsc INFO: order :
[2022/08/16 02:58:22] plsc INFO: scale : 1.0/255.0
[2022/08/16 02:58:22] plsc INFO: std : [0.5, 0.5, 0.5]
[2022/08/16 02:58:22] plsc INFO: ToCHWImage : None
[2022/08/16 02:58:22] plsc INFO: loader :
[2022/08/16 02:58:22] plsc INFO: num_workers : 8
[2022/08/16 02:58:22] plsc INFO: use_shared_memory : True
[2022/08/16 02:58:22] plsc INFO: sampler :
[2022/08/16 02:58:22] plsc INFO: batch_size : 128
[2022/08/16 02:58:22] plsc INFO: drop_last : False
[2022/08/16 02:58:22] plsc INFO: name : DistributedBatchSampler
[2022/08/16 02:58:22] plsc INFO: shuffle : True
[2022/08/16 02:58:22] plsc INFO: DistributedStrategy :
[2022/08/16 02:58:22] plsc INFO: data_parallel : True
[2022/08/16 02:58:22] plsc INFO: Export :
[2022/08/16 02:58:22] plsc INFO: export_type : onnx
[2022/08/16 02:58:22] plsc INFO: input_shape : ['None', 3, 112, 112]
[2022/08/16 02:58:22] plsc INFO: Global :
[2022/08/16 02:58:22] plsc INFO: accum_steps : 1
[2022/08/16 02:58:22] plsc INFO: checkpoint : None
[2022/08/16 02:58:22] plsc INFO: device : gpu
[2022/08/16 02:58:22] plsc INFO: distributed : False
[2022/08/16 02:58:22] plsc INFO: epochs : 25
[2022/08/16 02:58:22] plsc INFO: eval_during_train : True
[2022/08/16 02:58:22] plsc INFO: eval_func : face_verification_eval
[2022/08/16 02:58:22] plsc INFO: eval_interval : 200
[2022/08/16 02:58:22] plsc INFO: eval_unit : step
[2022/08/16 02:58:22] plsc INFO: max_num_latest_checkpoint : 0
[2022/08/16 02:58:22] plsc INFO: output_dir : ./output/
[2022/08/16 02:58:22] plsc INFO: pretrained_model : None
[2022/08/16 02:58:22] plsc INFO: print_batch_step : 10
[2022/08/16 02:58:22] plsc INFO: rank : 0
[2022/08/16 02:58:22] plsc INFO: save_interval : 1
[2022/08/16 02:58:22] plsc INFO: seed : 2022
[2022/08/16 02:58:22] plsc INFO: task_type : recognition
[2022/08/16 02:58:22] plsc INFO: train_epoch_func : defualt_train_one_epoch
[2022/08/16 02:58:22] plsc INFO: use_visualdl : True
[2022/08/16 02:58:22] plsc INFO: world_size : 1
[2022/08/16 02:58:22] plsc INFO: LRScheduler :
[2022/08/16 02:58:22] plsc INFO: boundaries : [10, 16, 22]
[2022/08/16 02:58:22] plsc INFO: decay_unit : epoch
[2022/08/16 02:58:22] plsc INFO: name : Step
[2022/08/16 02:58:22] plsc INFO: values : [0.2, 0.02, 0.002, 0.0002]
[2022/08/16 02:58:22] plsc INFO: Loss :
[2022/08/16 02:58:22] plsc INFO: Train :
[2022/08/16 02:58:22] plsc INFO: MarginLoss :
[2022/08/16 02:58:22] plsc INFO: m1 : 1.0
[2022/08/16 02:58:22] plsc INFO: m2 : 0.5
[2022/08/16 02:58:22] plsc INFO: m3 : 0.0
[2022/08/16 02:58:22] plsc INFO: model_parallel : False
[2022/08/16 02:58:22] plsc INFO: s : 64.0
[2022/08/16 02:58:22] plsc INFO: weight : 1.0
[2022/08/16 02:58:22] plsc INFO: Metric :
[2022/08/16 02:58:22] plsc INFO: Eval :
[2022/08/16 02:58:22] plsc INFO: LFWAcc :
[2022/08/16 02:58:22] plsc INFO: flip_test : True
[2022/08/16 02:58:22] plsc INFO: Model :
[2022/08/16 02:58:22] plsc INFO: class_num : 93431
[2022/08/16 02:58:22] plsc INFO: data_format : NCHW
[2022/08/16 02:58:22] plsc INFO: name : IResNet50
[2022/08/16 02:58:22] plsc INFO: num_features : 512
[2022/08/16 02:58:22] plsc INFO: pfc_config :
[2022/08/16 02:58:22] plsc INFO: model_parallel : False
[2022/08/16 02:58:22] plsc INFO: sample_ratio : 0.1
[2022/08/16 02:58:22] plsc INFO: Optimizer :
[2022/08/16 02:58:22] plsc INFO: grad_clip :
[2022/08/16 02:58:22] plsc INFO: always_clip : True
[2022/08/16 02:58:22] plsc INFO: clip_norm : 2.0
[2022/08/16 02:58:22] plsc INFO: clip_norm_max : 2.0
[2022/08/16 02:58:22] plsc INFO: name : ClipGradByGlobalNorm
[2022/08/16 02:58:22] plsc INFO: no_clip_list : ['partialfc']
[2022/08/16 02:58:22] plsc INFO: momentum : 0.9
[2022/08/16 02:58:22] plsc INFO: name : Momentum
[2022/08/16 02:58:22] plsc INFO: use_master_param : False
[2022/08/16 02:58:22] plsc INFO: weight_decay : 0.0005
[2022/08/16 02:58:22] plsc INFO: profiler_options : None
[2022/08/16 02:58:22] plsc INFO: train with paddle 2.3.1 and device Place(gpu:0)
[2022/08/16 02:58:22] plsc INFO: Loading dataset ./dataset/MS1M_v3_One_Sample/label.txt
[2022/08/16 02:58:23] plsc INFO: Load dataset finished, 93431 samples
W0816 02:58:23.308667 876 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.2, Runtime API Version: 11.2
W0816 02:58:23.372398 876 gpu_resources.cc:91] device: 0, cuDNN Version: 8.1.
[2022/08/16 02:58:24] plsc INFO: Number of Parameters is 91.43M.
Traceback (most recent call last):
File "tools/train.py", line 34, in
engine = Engine(config, mode="train")
File "/home/PLSC/plsc/engine/engine.py", line 213, in init
self.lr_scheduler, self.model)
File "/home/PLSC/plsc/optimizer/init.py", line 60, in build_optimizer
param_group[key] = get_fused_params(param_group[key])
File "/home/PLSC/plsc/core/param_fuse.py", line 454, in get_fused_params
var_groups = assign_group_by_size(params)
File "/home/PLSC/plsc/core/param_fuse.py", line 391, in assign_group_by_size
parameters, is_sparse_gradient, [group_size, group_size])
ValueError: (InvalidArgument) argument (position 1) must be list of Tensor, but got ParamBase at pos 0 (at /paddle/paddle/fluid/pybind/eager_utils.cc:240)

多机多卡并行训练,多个节点通信如何配置

多机多卡并行实验出错,错误日志如下:

  File "train.py", line 55, in <module>
    main()
  File "train.py", line 51, in main
    ins.train()
  File "/export/data/PLSC-master/plsc/entry.py", line 927, in train
    exe.run(self.startup_program)
  File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 783, in run
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 778, in run
    use_program_cache=use_program_cache)
  File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 831, in _run_impl
    use_program_cache=use_program_cache)
  File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 905, in _run_program
    fetch_var_name)
paddle.fluid.core_avx.EnforceNotMet:

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1   paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2   paddle::platform::NCCLCommContext::CreateNCCLComm(ncclUniqueId*, int, int, int, int)
3   paddle::operators::CCommInitOp::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
4   paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
5   paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)
6   paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocator<std::string> > const&, bool, bool)

----------------------
Error Message Summary:
----------------------
Error: An error occurred here. There is no accurate error hint for this error yet. We are continuously in the process of increasing hint for this kind of error check. It would be helpful if you could inform us of how this conversion went by opening a github issue. And we will resolve it with high priority.
  - New issue link: https://github.com/PaddlePaddle/Paddle/issues/new
  - Recommended issue content: all error stack information
  [unhandled system error] at (/paddle/paddle/fluid/platform/collective_helper.cc:67)
  [operator < c_comm_init > error]

针对这些问题,我做的尝试与分析
(1)出现上述错误的主要原因,是所使用的两台物理机通信端口可能被公司禁用了(具体我也不了解,比如22端口)。但是机器之间通信并没有禁止。经常通过nc命令或者wget命令可以将不同机器间的数据进行传输。

(2)我进行多机并行实验,一般都是在物理机上分别启动docker容器,然后通过SSH免密方式,让这几台机器可以相互登录。这种方法,在我们的机器上,使用tensorflow和pytorch框架进行多机并行都可以正常运行。但是使用paddle及PLSC就出现了上述问题。
(3)为了验证是否是物理机通信问题,我又在同一台物理机上,新建2个docker容器,然后通过同样的SSH免密方式,让两个容器可以相互登录,然后就代码是可以正常运行的

所以我基于上述分析,只能得出我们机器间通信存在某些限制,但是我们机器确实也是通的

请教一下,你们多机是如何通信,如何才能搭建一个分布式训练的集群,能多机多卡训练?

MobilefaceNet_128_arcface_dynamic_0.1_fp16_NHWC resume KeyERROR

昨天训练了MobileFaceNet_128_arcface_dynamic_0.1_NHWC_FP1620220217,Epoch=25的模型,今天想着resume该模型,然后报出
KeyError: 'create_parameter_49.w_0_velocity_0',训练的其他参数都未改变,仅仅是将resume改为True,
checkpoint_dir=/media/aiis/project/zw/PLSC-release-2.2/Webclear_arcface_dynamic_0.1_NHWC_FP1620220217/MobileFaceNet_128/24,不晓得是个啥情况?麻烦帮忙查查看

两台服务器,每台4张卡,训练出错

server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.