Fast-MobileNetV2

Optimized CUDA Kernels for Fast MobileNetV2 Inference

Develop Steps

① Implement MobileNetV2 with PyTorch, and parse the given ONNX model with Python to analyze the network structure. --- mobilenet_v2/nn/onnx/
② Implement MobileNetV2 with C++ (only sequential layer structures and weights, no forward computation), and parse the given ONNX model with Python to extract the weights. --- mobilenet_v2/nn/
③ Implement wrappers and tests for cuDNN/cuBLAS primitives: Conv, Gemm, and Pool. --- mobilenet_v2/cudnn/
- Here, Gemm can be implemented using cuBLAS, or seen as 1x1 Conv2d using cuDNN, we take the former way)
④ Implement cuDNN-accelerated MobileNetV2 with wrappers and C++ network implemented above. --- mobilenet_v2/cudnn/
⑤ Implement and optimize CUDA kernels: Conv, Gemm, and Pool. --- mobilenet_v2/fast_mobilenet/
- Here, Conv can be implemented using Im2Col + Gemm, or Winograd Algorithm (we only implemented the former)
⑥ Implement our Fast-MobileNetV2 as a whole. --- mobilenet_v2/fast_mobilenet/
⑦ Compare and Optimize: e.g. parameters tuning, model-specific / hardware-specific optimization, ...

Test Steps

nn

Re-implement MobileNetV2 ONNX model with PyTorch and test inference:

(conda) >> cd mobilenet_v2/nn/onnx/
(conda) >> python pytorchMobileNetV2.py

Save weights in MobileNetV2 ONNX model to plain-text files:

(conda) >> cd mobilenet_v2/nn/weights/
(conda) >> python save_weights.py

Show MobileNetV2 topology in C++ and check loaded weights:

>> cd mobilenet_v2/nn/examples/
>> make show
>> ./show.out
>> make check
>> ./check.out

cudnn

Show version of CUDA and CUDNN:

>> cd mobilenet_v2/cudnn/
>> bash version.sh

Operator tests:

>> cd mobilenet_v2/cudnn/tests/test_op/
>> make
>> ./testConv.o
>> ./testGemm.o
>> ./testPool.o
>> ./testAdd.o

Network test:

(conda) >> cd mobilenet_v2/cudnn/tests/test_net/
(conda) >> python generate_data.py
(conda) >> conda deactivate
>> make
>> ./testCudnnMobileNetV2.o
>> source ~/.bashrc
(conda) >> python compare_cudnn_onnx.py

our kernels

Operator tests:

>> cd mobilenet_v2/fast_mobilenet/tests/test_op/
>> make
>> ./testConv.o
>> ./testGemm.o
>> ./testPool.o
>> ./testAdd.o
>> ./testIm2Col.o

Network test:

(conda) >> cd mobilenet_v2/fast_mobilenet/tests/test_net/
(conda) >> python generate_data.py
(conda) >> conda deactivate
>> make
>> ./testFastMobileNetV2.o
>> source ~/.bashrc
(conda) >> python compare_fast_onnx.py

Test Environment

NVIDIA Tesla V100 GPU
CUDA version 10.2.89
CUDNN version 8.2.4
Run Python source of this repo in an Anaconda environment, and we have Python version 3.9.7
Do NOT Run CUDA source of this repo in an Anaconda environment

Tech Stack

MobileNetV2: Inverted Residuals and Linear Bottlenecks
ONNX Python API
cuDNN and cuBLAS API
CUDA C++ Programming
GPU Architecture and Compiler Optimization

Reference

[1] Sandler, Mark, et al. "Mobilenetv2: Inverted residuals and linear bottlenecks." Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2018.

[2] NVIDIA Corporation. "NVIDIA cuDNN Documentation." available at: https://docs.nvidia.com/deeplearning/cudnn/api/index.html

[3] NVIDIA Corporation. "NVIDIA cuBLAS Documentation." available at: https://docs.nvidia.com/cuda/cublas/index.html

[4] Lavin, Andrew, and Scott Gray. "Fast algorithms for convolutional neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2016.

[5] Mark Harris. "CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops." available at: https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

[6] Mark Harris. "Optimizing Parallel Reduction in CUDA." available at: https://vuduc.org/teaching/cse6230-hpcta-fa12/slides/cse6230-fa12--05b-reduction-notes.pdf

zhliuworks / fast-mobilenetv2 Goto Github PK