Giter VIP home page Giter VIP logo

conv2d_cfu's Introduction

Conv2D_CFU

Conv2D acceleration using CFU Playground framework, Mini-Project, Jan - May 2022

This repository holds code/experiments for the project Conv2D Acceleration using the framework CFU Playground. This is a forked repository of:

CFU-Playground

Brief Description

The convolution operation forms a major chunk of cycles spent in the case of Deep Neural Networks, and the acceleration of same could reduce the inference time. There exists several aspects of parallelism in the case of a Convolution which are to be coupled with better dataflow to gain from the benefits of memory re-use. There exists several frameworks to accelerate the convolution operation, but most of them end up being designed in isolation, or tend to accelerate the whole network on hardware where the CPU core merely places the roll initial data-transfer. The CFU playground enables the development of accelerator in an integrated SoC environment solving the storage and network bottlenecks that might arise when designed in isolation, while at the same time significant operations are performed on the VexRiscV core. The accelerator called the CFU (Custom Function Units) are invoked from the TFlite kernels using macros, and since a given Kernel could be re-used for multiple Networks, this offers more flexibility in terms of hardware.

Setting up the environment

The documentation of the framework provides clear guidelines to setup the environment. Most of the dependencies are open-source. Only proprietary toolchain that would be needed is Xilinx Vivado. The following link guides the user on building the environment:

Setup Guide

Getting Started and Software Baseline

Few examples from source were implemented using Renode Simulation and these could help to get started with the framework. The files are available at:

Renode Examples

However, as renode is not cycle accurate, in practise, either Verilator is to be used or an actual FPGA. A software baseline for an MNIST Neural Network was developed using the TFLite model and the same was profiled to identify the bottlenecks. It was further analysed to identify aspects for parallelism as well as the possibilities of input data re-use. The code is made available at:

Software Baseline

CFU Hardware Accelerator

The hardware accelerator was developed on the inferences drawn and same was optimised iteratively including aspects like changing the cache structure, degree of data re-use, amount of parallelism, till significant peformance was obtained. The accelerator was placed and rounted on a Nexys4 Artix-7 FPGA, and was succesfully tested. The code for the Hardware accelerator is available at:

Accelerator

Results and Conlusions

  • The framework CFU-Playground was reviewed in contrast to other existing frameworks, and particularly, its advantages like Integrated SoC environment, significant usage of the CPU core apart from the initial data transfer etc., have been exploited in this work.

  • The Conv2D operation for a 3x3 case in an MNIST Neural Network was analysed and the Software Baseline was set-up in order to check for the bottlenecks. The software baseline for the given network took 335 M cycles for execution on a VexRiscV core placed, routed on a Nexys4 Artix-7 FPGA. The Conv2D operations consumed 334 M cycles of the total cycle count, and within the Conv2D operation the MAC operations were the bottleneck being executed for 310 M cycles. The code was unrolled to reduce the loop overheads and gain from the spaciality of cache, which enhanced the cycle count to 220 M cycles.

  • Using methods for parallelism like SIMD accumulation along input depth, parallel computation of independent strides and input data re-use between strides as well as multiple output channels, an accelerator was built. The integrated core when synthesised had a critical path of 9.446 ns in comparision to 8.785 ns for the core, indicating a minimal increase owing to the Integrated SoC environment. The inference took 15 M cycles on this integrated core.

  • The cache of core was slightly modified to lower number of bytes per line and on synthesis the the critical path was enhanced to 9.338 ns. The network took 13 M cycles to execute on this integrated core.

  • The overall speed-up obtained was 26x for the base network and when tested for kernel re-use on a larger network the speed-up obtained was 33x. This was using the assumption that the baseline and the integrated accelerator were running at same clock since the difference in critical paths is minimal. Thus, the framework provides a better environment for the development of accelerators and the same was used to achieve the acceleration of 3x3 Conv2D kernels in case of a MNIST Neural Network, with a significant reduction in cycles.

References

conv2d_cfu's People

Contributors

acomodi avatar ajelinski avatar akioolin avatar alanvgreen avatar apium avatar cfu-playground-bot avatar danc86 avatar davidlattimore avatar fkokosinski avatar guztech avatar josephbushagour avatar kboronski-ant avatar kgugala avatar marcmerlin avatar mgielda avatar mithro avatar mkurc-ant avatar piotrzierhoffer avatar proppy avatar rachsug avatar robertszczepanski avatar sgauthamr2001 avatar shvetankprakash avatar tcal-x avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

lapnd

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.