Hi，thanks for your excellent work！ I am preparing to reappear your work，but when t

Hi, thanks for your interest. Could you please give <code class="notransl

Yes, you are right. At least this phenomenon was found according to my few exper

the loss of supernet can't converge about single-path-one-shot-nas-mxnet HOT 5 CLOSED

canyonwind commented on May 28, 2024

the loss of supernet can't converge

from single-path-one-shot-nas-mxnet.

Comments (5)

CanyonWind commented on May 28, 2024

Hi, thanks for your interest. Could you please give --train-constraint-method random a try? I used to find that using evolution constraints from the beginning is hard to converge. What I did before was to train the supernet without constraints/ with random constraints for 30/60 epochs then use evolution constraints for the rest. Please feel free to let me know whether it helps.

from single-path-one-shot-nas-mxnet.

CanyonWind commented on May 28, 2024

I tried the evolution constraints for 2 epochs, please refer to the below log.

Namespace(batch_norm=False, batch_size=64, block_choices='0, 0, 3, 1, 1, 1, 0, 0, 2, 0, 2, 1, 1, 0, 2, 0, 2, 1, 3, 2', channel_choices='6, 5, 3, 5, 2, 6, 3, 4, 2, 5, 7, 5, 4, 6, 7, 4, 4, 5, 4, 3', channels_layout='OneShot', crop_ratio=0.875, cs_warm_up=False, data_dir='~/.mxnet/datasets/imagenet', dtype='float16', epoch_start_cs=0, flop_param_method='lookup_table', hard_weight=0.5, ignore_first_two_cs=False, input_size=224, label_smoothing=True, last_conv_after_pooling=True, last_gamma=False, log_interval=50, logging_file='./logs/shufflenas_supernet+_wc.log', lr=0.65, lr_decay=0.1, lr_decay_epoch='40,60', lr_decay_period=0, lr_mode='cosine', mixup=False, mixup_alpha=0.2, mixup_off_epoch=0, mode='imperative', model='ShuffleNas', momentum=0.9, no_wd=True, num_epochs=120, num_gpus=1, num_workers=16, rec_train='/home/alex/imagenet/rec/train.rec', rec_train_idx='/home/alex/imagenet/rec/train.idx', rec_val='/home/alex/imagenet/rec/val.rec', rec_val_idx='/home/alex/imagenet/rec/val.idx', reduced_dataset_scale=1, resume_epoch=0, resume_params='', resume_states='', save_dir='params_shufflenas_supernet+_wc', save_frequency=10, teacher=None, temperature=20, train_bottom_constraints='flops-190-params-2.8', train_constraint_method='evolution', train_upper_constraints='flops-330-params-5.0', use_all_blocks=False, use_all_channels=False, use_gn=False, use_pretrained=False, use_rec=True, use_se=True, warmup_epochs=5, warmup_lr=0.0, wd=4e-05)
Epoch[0] Batch [49]	Speed: 267.692524 samples/sec	accuracy=0.000937	lr=0.000325
Epoch[0] Batch [99]	Speed: 420.260783 samples/sec	accuracy=0.000625	lr=0.000649
Epoch[0] Batch [149]	Speed: 434.132753 samples/sec	accuracy=0.000625	lr=0.000974
Epoch[0] Batch [199]	Speed: 455.834203 samples/sec	accuracy=0.000625	lr=0.001299
Epoch[0] Batch [249]	Speed: 444.076933 samples/sec	accuracy=0.000812	lr=0.001624
Epoch[0] Batch [299]	Speed: 446.958638 samples/sec	accuracy=0.000937	lr=0.001948
Epoch[0] Batch [349]	Speed: 440.735658 samples/sec	accuracy=0.000937	lr=0.002273
Epoch[0] Batch [399]	Speed: 442.374003 samples/sec	accuracy=0.000937	lr=0.002598
Epoch[0] Batch [449]	Speed: 435.325226 samples/sec	accuracy=0.001007	lr=0.002922
Epoch[0] Batch [499]	Speed: 439.740531 samples/sec	accuracy=0.000969	lr=0.003247
Epoch[0] Batch [549]	Speed: 449.363078 samples/sec	accuracy=0.000966	lr=0.003572
Epoch[0] Batch [599]	Speed: 427.282463 samples/sec	accuracy=0.000964	lr=0.003897
Epoch[0] Batch [649]	Speed: 439.999006 samples/sec	accuracy=0.000937	lr=0.004221
Epoch[0] Batch [699]	Speed: 454.338982 samples/sec	accuracy=0.000915	lr=0.004546
Epoch[0] Batch [749]	Speed: 442.066367 samples/sec	accuracy=0.000854	lr=0.004871
Epoch[0] Batch [799]	Speed: 447.217162 samples/sec	accuracy=0.000879	lr=0.005195
Epoch[0] Batch [849]	Speed: 418.756385 samples/sec	accuracy=0.000864	lr=0.005520
Epoch[0] Batch [899]	Speed: 430.115587 samples/sec	accuracy=0.000868	lr=0.005845
Epoch[0] Batch [949]	Speed: 422.384265 samples/sec	accuracy=0.000872	lr=0.006170
Epoch[0] Batch [999]	Speed: 442.137708 samples/sec	accuracy=0.000937	lr=0.006494
...
Epoch[0] Batch [19799]	Speed: 434.382863 samples/sec	accuracy=0.010476	lr=0.128586
Epoch[0] Batch [19849]	Speed: 442.456485 samples/sec	accuracy=0.010524	lr=0.128910
Epoch[0] Batch [19899]	Speed: 431.092918 samples/sec	accuracy=0.010570	lr=0.129235
Epoch[0] Batch [19949]	Speed: 445.330133 samples/sec	accuracy=0.010624	lr=0.129560
Epoch[0] Batch [19999]	Speed: 444.129423 samples/sec	accuracy=0.010666	lr=0.129884
[Epoch 0] training: accuracy=0.010680
[Epoch 0] speed: 437 samples/sec	time cost: 3014.399720
[Epoch 0] validation: err-top1=0.966212 err-top5=0.888407
Epoch[1] Batch [49]	Speed: 441.229930 samples/sec	accuracy=0.030937	lr=0.130326
Epoch[1] Batch [99]	Speed: 431.210921 samples/sec	accuracy=0.029844	lr=0.130651
Epoch[1] Batch [149]	Speed: 451.693710 samples/sec	accuracy=0.028542	lr=0.130975
Epoch[1] Batch [199]	Speed: 453.126118 samples/sec	accuracy=0.027344	lr=0.131300
Epoch[1] Batch [249]	Speed: 439.301388 samples/sec	accuracy=0.027250	lr=0.131625
Epoch[1] Batch [299]	Speed: 452.420660 samples/sec	accuracy=0.028021	lr=0.131950
Epoch[1] Batch [349]	Speed: 456.589121 samples/sec	accuracy=0.028705	lr=0.132274
Epoch[1] Batch [399]	Speed: 441.290773 samples/sec	accuracy=0.028555	lr=0.132599
Epoch[1] Batch [449]	Speed: 443.353213 samples/sec	accuracy=0.028889	lr=0.132924
Epoch[1] Batch [499]	Speed: 455.609001 samples/sec	accuracy=0.029063	lr=0.133248
Epoch[1] Batch [549]	Speed: 435.873114 samples/sec	accuracy=0.029261	lr=0.133573
Epoch[1] Batch [599]	Speed: 435.406145 samples/sec	accuracy=0.028958	lr=0.133898
Epoch[1] Batch [649]	Speed: 432.422730 samples/sec	accuracy=0.028990	lr=0.134223
Epoch[1] Batch [699]	Speed: 445.527597 samples/sec	accuracy=0.028795	lr=0.134547
Epoch[1] Batch [749]	Speed: 445.781965 samples/sec	accuracy=0.028958	lr=0.134872
Epoch[1] Batch [799]	Speed: 437.717070 samples/sec	accuracy=0.029004	lr=0.135197
Epoch[1] Batch [849]	Speed: 450.319020 samples/sec	accuracy=0.028732	lr=0.135521
Epoch[1] Batch [899]	Speed: 446.804164 samples/sec	accuracy=0.028750	lr=0.135846
Epoch[1] Batch [949]	Speed: 448.955765 samples/sec	accuracy=0.028766	lr=0.136171
Epoch[1] Batch [999]	Speed: 429.807388 samples/sec	accuracy=0.028875	lr=0.136496
...
Epoch[1] Batch [19799]	Speed: 445.965960 samples/sec	accuracy=0.054782	lr=0.258587
Epoch[1] Batch [19849]	Speed: 439.394236 samples/sec	accuracy=0.054872	lr=0.258912
Epoch[1] Batch [19899]	Speed: 431.452251 samples/sec	accuracy=0.054946	lr=0.259236
Epoch[1] Batch [19949]	Speed: 445.569749 samples/sec	accuracy=0.054993	lr=0.259561
Epoch[1] Batch [19999]	Speed: 430.832503 samples/sec	accuracy=0.055055	lr=0.259886
[Epoch 1] training: accuracy=0.055072
[Epoch 1] speed: 442 samples/sec	time cost: 2977.685636
[Epoch 1] validation: err-top1=0.901088 err-top5=0.748339
Epoch[2] Batch [49]	Speed: 435.165452 samples/sec	accuracy=0.089375	lr=0.260327
Epoch[2] Batch [99]	Speed: 438.497906 samples/sec	accuracy=0.090313	lr=0.260652
Epoch[2] Batch [149]	Speed: 442.370125 samples/sec	accuracy=0.088438	lr=0.260977
Epoch[2] Batch [199]	Speed: 449.992227 samples/sec	accuracy=0.084687	lr=0.261301
Epoch[2] Batch [249]	Speed: 451.044595 samples/sec	accuracy=0.084187	lr=0.261626
Epoch[2] Batch [299]	Speed: 435.895423 samples/sec	accuracy=0.083646	lr=0.261951
Epoch[2] Batch [349]	Speed: 442.705869 samples/sec	accuracy=0.083571	lr=0.262276
Epoch[2] Batch [399]	Speed: 431.949651 samples/sec	accuracy=0.083086	lr=0.262600
Epoch[2] Batch [449]	Speed: 448.379354 samples/sec	accuracy=0.083403	lr=0.262925
Epoch[2] Batch [499]	Speed: 439.455696 samples/sec	accuracy=0.083531	lr=0.263250
Epoch[2] Batch [549]	Speed: 419.410924 samples/sec	accuracy=0.082812	lr=0.263574
Epoch[2] Batch [599]	Speed: 435.331664 samples/sec	accuracy=0.082474	lr=0.263899
Epoch[2] Batch [649]	Speed: 430.067405 samples/sec	accuracy=0.082187	lr=0.264224
Epoch[2] Batch [699]	Speed: 456.241039 samples/sec	accuracy=0.082388	lr=0.264549
Epoch[2] Batch [749]	Speed: 452.860384 samples/sec	accuracy=0.081917	lr=0.264873
Epoch[2] Batch [799]	Speed: 432.486923 samples/sec	accuracy=0.081738	lr=0.265198
Epoch[2] Batch [849]	Speed: 450.029449 samples/sec	accuracy=0.081801	lr=0.265523
Epoch[2] Batch [899]	Speed: 445.616156 samples/sec	accuracy=0.081233	lr=0.265847
Epoch[2] Batch [949]	Speed: 430.188969 samples/sec	accuracy=0.081299	lr=0.266172
Epoch[2] Batch [999]	Speed: 430.283522 samples/sec	accuracy=0.081641	lr=0.266497
...

from single-path-one-shot-nas-mxnet.

cavalleria commented on May 28, 2024

Hi, thanks for your interest. Could you please give --train-constraint-method random a try? I used to find that using evolution constraints from the beginning is hard to converge. What I did before was to train the supernet without constraints/ with random constraints for 30/60 epochs then use evolution constraints for the rest. Please feel free to let me know whether it helps.

Thanks for your quick reply, I noticed that you modified cs-warm-up = false and epoch-start-cs = 0, I modified my training script according to your training log, and run 3 epochs, acc and val top -1 error looks normal. Then I have some questions
1.The README describes supernet training details as follows

The reason why we did this in the supernet training is that during our experiments we found, for supernet without SE, doing Block Selection from beginning works well, nevertheless doing Channel Selection from the beginning will cause the network not converging at all. The Channel Selection range needs to be gradually enlarged otherwise it will crash with free-fall drop accuracy. And the range can only be allowed for (0.6 ~ 2.0). Smaller channel scales will make the network crashing too. For supernet with SE, Channel Selection with the full choices (0.2 ~ 2.0) can be used from the beginning and it converges. However, doing this seems like harming accuracy. Compared to the same se-supernet with Channel Selection warm-up, the Channel Selection from scratch model has been always left behind 10% training accuracy during the whole procedure.

my understanding is that if use_se = true, channel selection can be used from the beginning and it can converges (epoch-start-cs = 0, cs-warm-up = false), but left behind 10% training accuracy compared to same se-supernet with channel selection warm-up (epoch-start-cs = 0, cs-warm-up = true),is it right?
2.if i train the supernet with use-se=true, epoch-start-cs = 0 and cs-warm-up = true , but can't converges, should I follow --train-constraint-method none / random / evolution (epoch 0 ~ 30/30 ~ 60/60 ~ 120) to progressively train the supernet.
3.when i use 8 titanx gpus, whether the learning rate should be increased 8 times(8*0.65), and i find multi gpu often idling.
thx~

from single-path-one-shot-nas-mxnet.

CanyonWind commented on May 28, 2024

Yes, you are right. At least this phenomenon was found according to my few experiments.
I'm not sure about this part. Because the channel selection warm-up experiment was done quite far ago. I usually just train se-supernet with no warm-up now. BTW, --train-constraint-method none / random / evolution (epoch 0 ~ 30/30 ~ 60/60 ~ 120) is for how to make evolution constraint work but not for channel selection warm up. Nevertheless, you are welcomed to give it a try.
Yes, this is still a pain in the axx for me too... Multi GPU support for supernet training is still problematic. However, I don't have too much time to spend on it right now. Sorry about that.

from single-path-one-shot-nas-mxnet.

CanyonWind commented on May 28, 2024

Close the issue for no further response. Please feel free to reopen if necessary.

from single-path-one-shot-nas-mxnet.

the loss of supernet can't converge about single-path-one-shot-nas-mxnet HOT 5 CLOSED

Comments (5)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent