Giter VIP home page Giter VIP logo

distributed-sod-project's Introduction

Distributed Salient Object Detection Project

License: MIT

Folders & Files

  • backbone : Store some code for backbone networks.
  • loss : The code of the loss function.
  • module : The code of important modules.
  • network : The code of the network.
  • output : It saves all results.
  • utils : Some instrumental code.
    • utils/config.py : Configuration file for model training and testing.
    • utils/dataset.py : Some files about creating the dataloader.
  • main.py : I think you can understand.

My Environment

Recommended way to install these packages:

# create env
conda create -n pt python=3.8
conda activate pt

# install pytorch cuda cudnn
conda install pytorch torchvision cudatoolkit=10.2 cudnn -c pytorch

# some tools
pip install tqdm
# (optional) https://github.com/Lyken17/pytorch-OpCounter
# pip install thop
pip install prefetch-generator

# install tensorboard, we can use `from torch.utils.tensorboard import SummaryWriter`
pip install tensorboard

# 我使用了apex来提供混合精度和良好分布式训练支持
# https://github.com/NVIDIA/apex#linux
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

自动混合精度训练尝试

这里的周期的时间不能固定, 因为在我的对比试验中, 使用的是多尺度训练策略, 导致每个batch的输入的尺寸是不同的, 这可能会影响每个周期的训练时间. 另外, 数据的传输速度在一定程度上也会制约训练时间, 所以这里的时间仅供参考.

  • 不使用amp
    • 简单模型:~9009MiB
    • Checkpoint处理模型:~4501MiB
  • 添加amp
    • 简单模型:~5527MiB
    • Checkpoint:~3385MiB
  • 单独的分布式
    • 简单模型:~5733MiB+5733MiB
    • Checkpoint: ~3335MiB+2863MiB
  • amp和分布式一起用
    • 简单模型:~3409MiB+3325MiB
    • Checkpoitn: ~2107MiB+2215MiB

注意

虽然这份代码参考了很多的资料(详见最后的列表), 但是还是存在一些不足, 暂时没有得到解决:

  • 分布式测试(虽然mmdetection中实现了, 但是它嵌套得太深了, 我暂时没有动力去模仿, 自己尝试使用了如下的简单的策略, 和直接单卡测试略有不同, 暂时搞不清楚原因.
    • 使用分布式sampler
    • 对不同GPU上由不同数据子集得到的指标结果, 使用torch的分布式的收集函数 all_reduce 收集数据后, 除以GPU数量得到最终指标
  • 使用分布式训练, 对单卡上的batchsize进行翻倍. 我将学习率也跟着线性增长. 但是这样似乎并没有太大的效果提升.
  • tqdm如何用在这样的分布式训练过程中呢, 该如何设置?

参考

distributed-sod-project's People

Contributors

lartpang avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

distributed-sod-project's Issues

您好,请教一个问题

我用apex来加速训练模型,发现如果我在backbone的浅层(例如resnet50的第二个layer)使用了adaptive pool时,训练的超级慢,我一度以为是浅层特征尺度太大导致计算量过大,但是当我去除apex时,我发现训练速度非常快(当然显存占用相比apex变多了),您知道是什么原因吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.