Comments (9)
Hi Duo,
Thanks for the interest in this work.
For ImageNet training, you may need to slightly modify the code for the multi-gpu training setup. What I did was to follow this repo: "https://github.com/pytorch/examples/tree/master/imagenet".
I am sorry that I could not remember what the loss curve for ImageNet dataset is, but shape-adaptor performance should always be similar or better than the human-designed version.
The initial accuracy seems low might due to the ResNet has been compressed significantly with AutoSC mode, so to ensure fairness, we typically will increase network width, so the compressed "wider" network would be similar to human-designed networks in FLOPs as we showed in MobileNet experiment in the original paper.
Another important note:
The update step for shape adaptor parameters in ImageNet is 1500. This might be the reason why your performance appears low. Please check out and follow the hyper-parameter table in Appendix B.
Please let me know if you have further questions or updated results.
from shape-adaptor.
Yes, ImageNet results are very easy to reach OOM issue if having a very small step size. But it shouldn't be the case in AutoSC mode where the memory is bounded by the human-designed network. You can check out the printed shape to make sure the learned shape is expected.
I will contact the co-author regarding the ImageNet training script, and will get back to you soon.
from shape-adaptor.
Thanks for your prompt reply!
I have checked that the learned shape is smaller than human-designed ones, so it is abnormal to incur OOM issues.
Look forward to your news of the ImageNet training script. Thanks.
from shape-adaptor.
Hello again, I have now updated the ImageNet training file. Sorry for the mistake, we actually trained AutoSC ImageNet in step size 200, standard shape adaptor using step size 1500.
As I mentioned above, we used the official ImageNet training script as a guideline, so the training script for each flag should be similar. Since currently I don't have a multi-gpu setup to test out the script, so I am not 100% confident that the script is bug-free. But if there exists any bug, it should be very minor and quite easy to fix.
Please let me know whether this works.
from shape-adaptor.
Hi, great thanks for your timely update.
After training one epoch with human-designed mode, it reports
Traceback (most recent call last):
File "model_training_imagenet.py", line 481, in <module>
main()
File "model_training_imagenet.py", line 128, in main
main_worker(args.gpu, ngpus_per_node, args)
File "model_training_imagenet.py", line 260, in main_worker
flops, params = profile(model, inputs=(input_data, ), verbose=False)
File "/home/lid/anaconda3/lib/python3.7/site-packages/thop/profile.py", line 188, in profile
model(*inputs)
File "/home/lid/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/lid/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 149, in forward
"them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
After changing this line flops, params = profile(model, inputs=(input_data, ), verbose=False)
to flops, params = profile(model.module, inputs=(input_data, ), verbose=False)
, it still reports (should be at the beginning of the second epoch)
Traceback (most recent call last):
File "model_training_imagenet.py", line 481, in <module>
main()
File "model_training_imagenet.py", line 128, in main
main_worker(args.gpu, ngpus_per_node, args)
File "model_training_imagenet.py", line 254, in main_worker
loss_val, train_acc = weight_train(train_loader, model, criterion, weight_optimizer, alpha_optimizer, epoch, args)
File "model_training_imagenet.py", line 342, in weight_train
output = model(images)
File "/home/lid/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/lid/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 149, in forward
"them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
I have checked but it seems that none of the parameters of the model is on CPU. Have you ever met similar issues?
from shape-adaptor.
Really sorry for the issue. But I did not have any related bug as you showed here.
It shouldn't the model.module
issue in flop counting operation.
Just to make sure,
Did you run the script using the following command, if in human-designed mode (original ResNet), you should run
python model_training_imagenet.py --dist-url 'tcp://127.0.0.1:0' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 --data YOUR_IMAGENET_PATH --network resnet --mode human-imagenet
from shape-adaptor.
Thanks a lot for specifying your way of running. Maybe because I run it in the DataParallel mode instead of DistributedDataParallel. Now it runs smoothly following your instruction and I may update the final results.
By the way, I think model.module
is needed now to output the model.module.shape_list
. How do you think about that?
from shape-adaptor.
Good to hear it works. Yes, I will modify that accordingly.
So again to make sure, is this model.module.shape_list
the only thing we need to change? Is model.module
still required to work in flop count as you showed in the previous post?
from shape-adaptor.
Yes, merely changing the two model.module.shape_list
works according to my experience. thop
related code does not need modification now. Thanks for all your assistance!
from shape-adaptor.
Related Issues (8)
- ShapeAdaptor issue HOT 4
- some problems about experiments HOT 8
- I wonder if it can be use in other data format like nlp data HOT 1
- resnet shape-adaptor HOT 5
- Is it possible to train with Multi-GPUs? HOT 3
- Question about default weight decay in model_training_imagenet.py HOT 1
- GMACs for ImageNet dataset? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from shape-adaptor.