<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Question about reimplementation about shape-adaptor HOT 9 CLOSED

d-li14 commented on May 30, 2024

Question about reimplementation

from shape-adaptor.

Comments (9)

lorenmt commented on May 30, 2024

Hi Duo,

Thanks for the interest in this work.

For ImageNet training, you may need to slightly modify the code for the multi-gpu training setup. What I did was to follow this repo: "https://github.com/pytorch/examples/tree/master/imagenet".

I am sorry that I could not remember what the loss curve for ImageNet dataset is, but shape-adaptor performance should always be similar or better than the human-designed version.

The initial accuracy seems low might due to the ResNet has been compressed significantly with AutoSC mode, so to ensure fairness, we typically will increase network width, so the compressed "wider" network would be similar to human-designed networks in FLOPs as we showed in MobileNet experiment in the original paper.

Another important note:
The update step for shape adaptor parameters in ImageNet is 1500. This might be the reason why your performance appears low. Please check out and follow the hyper-parameter table in Appendix B.

Please let me know if you have further questions or updated results.

from shape-adaptor.

lorenmt commented on May 30, 2024

Yes, ImageNet results are very easy to reach OOM issue if having a very small step size. But it shouldn't be the case in AutoSC mode where the memory is bounded by the human-designed network. You can check out the printed shape to make sure the learned shape is expected.

I will contact the co-author regarding the ImageNet training script, and will get back to you soon.

from shape-adaptor.

d-li14 commented on May 30, 2024

Thanks for your prompt reply!
I have checked that the learned shape is smaller than human-designed ones, so it is abnormal to incur OOM issues.
Look forward to your news of the ImageNet training script. Thanks.

from shape-adaptor.

lorenmt commented on May 30, 2024

Hello again, I have now updated the ImageNet training file. Sorry for the mistake, we actually trained AutoSC ImageNet in step size 200, standard shape adaptor using step size 1500.

As I mentioned above, we used the official ImageNet training script as a guideline, so the training script for each flag should be similar. Since currently I don't have a multi-gpu setup to test out the script, so I am not 100% confident that the script is bug-free. But if there exists any bug, it should be very minor and quite easy to fix.

Please let me know whether this works.

from shape-adaptor.

d-li14 commented on May 30, 2024

Hi, great thanks for your timely update.
After training one epoch with human-designed mode, it reports

Traceback (most recent call last):                                                                                                                   
  File "model_training_imagenet.py", line 481, in <module>                                                                                           
    main()                                                                                                                                           
  File "model_training_imagenet.py", line 128, in main                                                                                               
    main_worker(args.gpu, ngpus_per_node, args)                                                                                                      
  File "model_training_imagenet.py", line 260, in main_worker                                                                                        
    flops, params = profile(model, inputs=(input_data, ), verbose=False)                                                                             
  File "/home/lid/anaconda3/lib/python3.7/site-packages/thop/profile.py", line 188, in profile                                                       
    model(*inputs)                                                                                                                                   
  File "/home/lid/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__                                           
    result = self.forward(*input, **kwargs)                                                                                                          
  File "/home/lid/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 149, in forward                                    
    "them on device: {}".format(self.src_device_obj, t.device))                                                                                      
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

After changing this line flops, params = profile(model, inputs=(input_data, ), verbose=False) to flops, params = profile(model.module, inputs=(input_data, ), verbose=False), it still reports (should be at the beginning of the second epoch)

Traceback (most recent call last):
  File "model_training_imagenet.py", line 481, in <module>
    main()
  File "model_training_imagenet.py", line 128, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "model_training_imagenet.py", line 254, in main_worker
    loss_val, train_acc = weight_train(train_loader, model, criterion, weight_optimizer, alpha_optimizer, epoch, args)
  File "model_training_imagenet.py", line 342, in weight_train
    output = model(images)
  File "/home/lid/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lid/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 149, in forward
    "them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

I have checked but it seems that none of the parameters of the model is on CPU. Have you ever met similar issues?

from shape-adaptor.

lorenmt commented on May 30, 2024

Really sorry for the issue. But I did not have any related bug as you showed here.

It shouldn't the model.module issue in flop counting operation.

Just to make sure,
Did you run the script using the following command, if in human-designed mode (original ResNet), you should run

python model_training_imagenet.py --dist-url 'tcp://127.0.0.1:0' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 --data YOUR_IMAGENET_PATH --network resnet --mode human-imagenet

from shape-adaptor.

d-li14 commented on May 30, 2024

Thanks a lot for specifying your way of running. Maybe because I run it in the DataParallel mode instead of DistributedDataParallel. Now it runs smoothly following your instruction and I may update the final results.
By the way, I think model.module is needed now to output the model.module.shape_list. How do you think about that?

from shape-adaptor.

lorenmt commented on May 30, 2024

Good to hear it works. Yes, I will modify that accordingly.

So again to make sure, is this model.module.shape_list the only thing we need to change? Is model.module still required to work in flop count as you showed in the previous post?

from shape-adaptor.

d-li14 commented on May 30, 2024

Yes, merely changing the two model.module.shape_list works according to my experience. thop related code does not need modification now. Thanks for all your assistance!

from shape-adaptor.

Question about reimplementation about shape-adaptor HOT 9 CLOSED

Comments (9)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent