when training the ssl model in multi-gpu setting, loss get stuck at around 15, but in

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

loss stuck in multi-gpu about iic HOT 3 CLOSED

Zhuysheng commented on August 18, 2024

loss stuck in multi-gpu

from iic.

Comments (3)

BestJuly commented on August 18, 2024

Hi, @Zhuysheng. Thank you for your interest.

In your case, are the multi-gpu settings such as batchsize, learning rate exactly the same as that in single gpu mode?

And I also tried multi-gpu setting in one of my experimental environments: pytorch 1.4, CUDA 10.1, V100. I show the logs in the following.

Logs when using 2 GPUs

Using 2 GPUs
Train: [1/240][590/596] 	loss 15.278 (17.058)	1_p -0.006 (-0.004)	2_p 0.000 (0.000))
Train: [2/240][590/596] 	loss 15.249 (15.257)	1_p 0.007 (0.002))	2_p -0.000 (0.001)
Train: [3/240][590/596] 	loss 15.245 (15.254)	1_p 0.011 (0.002))	2_p 0.002 (0.002))
Train: [4/240][590/596] 	loss 15.242 (15.253)	1_p 0.021 (0.003))	2_p -0.002 (0.005)
Train: [5/240][590/596] 	loss 15.199 (15.248)	1_p -0.013 (0.004)	2_p 0.088 (0.021))
Train: [6/240][590/596] 	loss 15.241 (15.239)	1_p 0.013 (0.021))	2_p 0.027 (0.047))
Train: [7/240][590/596] 	loss 15.220 (15.222)	1_p 0.058 (0.069)	2_p 0.025 (0.063))
Train: [8/240][590/596] 	loss 15.232 (15.204)	1_p 0.064 (0.109)	2_p 0.029 (0.096)
Train: [9/240][590/596] 	loss 15.214 (15.171)	1_p 0.144 (0.172)	2_p 0.044 (0.136)
Train: [10/240][590/596]	loss 15.246 (15.132)	1_p 0.162 (0.251)	2_p 0.107 (0.201)
Train: [11/240][590/596]	loss 15.073 (15.075)	1_p 0.485 (0.372)	2_p 0.366 (0.288)
Train: [12/240][590/596]	loss 4.763 (14.979)	1_p 1.307 (0.618)	2_p 0.443 (0.399)
...

Logs when using 1 GPU: training script is the same

Train: [1/240][590/596] 	loss 15.279 (17.406))   1_p -0.005 (-0.001)     2_p -0.009 (-0.005)
Train: [2/240][590/596] 	loss 15.252 (15.252)    1_p -0.000 (0.004)      2_p 0.014 (0.005))
Train: [3/240][590/596] 	loss 15.253 (15.250)    1_p 0.007 (0.006))      2_p 0.009 (0.008))
Train: [4/240][590/596] 	loss 15.253 (15.248)    1_p 0.004 (0.012))      2_p 0.006 (0.011))
Train: [5/240][590/596] 	loss 15.268 (15.248)    1_p 0.011 (0.026))      2_p 0.018 (0.019))
Train: [6/240][590/596] 	loss 15.262 (15.243)    1_p 0.024 (0.041))      2_p 0.007 (0.031))
Train: [7/240][590/596] 	loss 15.259 (15.235)    1_p 0.039 (0.072)       2_p 0.017 (0.056))
Train: [8/240][590/596] 	loss 15.254 (15.217)    1_p 0.100 (0.119)       2_p 0.021 (0.086)
...

It seems it runs fine for me: the loss decreases as the number of epochs increases.
The multi-gpu part here uses 'nn.nn.DataParallel'. One thing I observed here is that with multi-gpu, the performance is different. I searched for this part and find this. However, it still can not explain or solve the problem you met.

I am not sure whether your situation is affected by the environment. Or for multi-gpu training, you can also try distributed training. And you can also try different training settings such as adjusting learning rate.

from iic.

Zhuysheng commented on August 18, 2024

@BestJuly Thanks for your quick reply.

In a single RTX 2080 Ti, I set the batch_size=6 for c3d.
In 4 RTX 2080 Ti with 'nn.DataParallel', batch_size=24.

I think main difference is the environment and the size of batch, I will try your suggestions.

By the way, does batch_size affect SSL learning a lot?

from iic.

BestJuly commented on August 18, 2024

@Zhuysheng
In many papers, batch size will affect SSL learning results. However, in our case, I do not have ablation studies/experiments on batch size. And usually, if you change the batch size, the learning rate should be changed according to the batch size.
Ususally, it follows new_lr = old_lr * new_batch_size / old_batch_size.

And if you want to ask whether the learning rate affects the performance of SSL, my answer is YES.
I also do not have ablation studies/experiments on the code of this repo. However, on a modified experimental code version, I found different learning rate (with the same batch size) will affect the performance. 4% improvements on video retrieval and 1% improments on video recognition can be achieved when changing to a different learning rate. Therefore, you can change learning rate to have a try.

from iic.

loss stuck in multi-gpu about iic HOT 3 CLOSED

Comments (3)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent