Giter VIP home page Giter VIP logo

Comments (3)

BestJuly avatar BestJuly commented on August 18, 2024

Hi, @Zhuysheng. Thank you for your interest.

In your case, are the multi-gpu settings such as batchsize, learning rate exactly the same as that in single gpu mode?

And I also tried multi-gpu setting in one of my experimental environments: pytorch 1.4, CUDA 10.1, V100. I show the logs in the following.

Logs when using 2 GPUs

Using 2 GPUs
Train: [1/240][590/596] 	loss 15.278 (17.058)	1_p -0.006 (-0.004)	2_p 0.000 (0.000))
Train: [2/240][590/596] 	loss 15.249 (15.257)	1_p 0.007 (0.002))	2_p -0.000 (0.001)
Train: [3/240][590/596] 	loss 15.245 (15.254)	1_p 0.011 (0.002))	2_p 0.002 (0.002))
Train: [4/240][590/596] 	loss 15.242 (15.253)	1_p 0.021 (0.003))	2_p -0.002 (0.005)
Train: [5/240][590/596] 	loss 15.199 (15.248)	1_p -0.013 (0.004)	2_p 0.088 (0.021))
Train: [6/240][590/596] 	loss 15.241 (15.239)	1_p 0.013 (0.021))	2_p 0.027 (0.047))
Train: [7/240][590/596] 	loss 15.220 (15.222)	1_p 0.058 (0.069)	2_p 0.025 (0.063))
Train: [8/240][590/596] 	loss 15.232 (15.204)	1_p 0.064 (0.109)	2_p 0.029 (0.096)
Train: [9/240][590/596] 	loss 15.214 (15.171)	1_p 0.144 (0.172)	2_p 0.044 (0.136)
Train: [10/240][590/596]	loss 15.246 (15.132)	1_p 0.162 (0.251)	2_p 0.107 (0.201)
Train: [11/240][590/596]	loss 15.073 (15.075)	1_p 0.485 (0.372)	2_p 0.366 (0.288)
Train: [12/240][590/596]	loss 4.763 (14.979)	1_p 1.307 (0.618)	2_p 0.443 (0.399)
...

Logs when using 1 GPU: training script is the same

Train: [1/240][590/596] 	loss 15.279 (17.406))   1_p -0.005 (-0.001)     2_p -0.009 (-0.005)
Train: [2/240][590/596] 	loss 15.252 (15.252)    1_p -0.000 (0.004)      2_p 0.014 (0.005))
Train: [3/240][590/596] 	loss 15.253 (15.250)    1_p 0.007 (0.006))      2_p 0.009 (0.008))
Train: [4/240][590/596] 	loss 15.253 (15.248)    1_p 0.004 (0.012))      2_p 0.006 (0.011))
Train: [5/240][590/596] 	loss 15.268 (15.248)    1_p 0.011 (0.026))      2_p 0.018 (0.019))
Train: [6/240][590/596] 	loss 15.262 (15.243)    1_p 0.024 (0.041))      2_p 0.007 (0.031))
Train: [7/240][590/596] 	loss 15.259 (15.235)    1_p 0.039 (0.072)       2_p 0.017 (0.056))
Train: [8/240][590/596] 	loss 15.254 (15.217)    1_p 0.100 (0.119)       2_p 0.021 (0.086)
...

It seems it runs fine for me: the loss decreases as the number of epochs increases.
The multi-gpu part here uses 'nn.nn.DataParallel'. One thing I observed here is that with multi-gpu, the performance is different. I searched for this part and find this. However, it still can not explain or solve the problem you met.

I am not sure whether your situation is affected by the environment. Or for multi-gpu training, you can also try distributed training. And you can also try different training settings such as adjusting learning rate.

from iic.

Zhuysheng avatar Zhuysheng commented on August 18, 2024

@BestJuly Thanks for your quick reply.

In a single RTX 2080 Ti, I set the batch_size=6 for c3d.
In 4 RTX 2080 Ti with 'nn.DataParallel', batch_size=24.

I think main difference is the environment and the size of batch, I will try your suggestions.

By the way, does batch_size affect SSL learning a lot?

from iic.

BestJuly avatar BestJuly commented on August 18, 2024

@Zhuysheng
In many papers, batch size will affect SSL learning results. However, in our case, I do not have ablation studies/experiments on batch size. And usually, if you change the batch size, the learning rate should be changed according to the batch size.
Ususally, it follows new_lr = old_lr * new_batch_size / old_batch_size.

And if you want to ask whether the learning rate affects the performance of SSL, my answer is YES.
I also do not have ablation studies/experiments on the code of this repo. However, on a modified experimental code version, I found different learning rate (with the same batch size) will affect the performance. 4% improvements on video retrieval and 1% improments on video recognition can be achieved when changing to a different learning rate. Therefore, you can change learning rate to have a try.

from iic.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.