The softsyscudai from kzhang8850

CudAI

Sean Foley, Kaitlyn Keil, Kevin Zhang

The Goal

This project is an exploration into machine learning and CUDA programming. We intend to make a vanilla machine learning (ML) algorithm using only vectors to accomplish a basic task and then parallelize it with our graphics card’s computational power. Specifically, this takes the form of a backpropagation neural network (BNN) that can start from nothing and learn how to do a simple classification task, such as predicting XOR outputs. A possible stretch goal is increasing the complexity of our BNN to make predictions on data with more features, such as the MNIST digits dataset or house price prediction. We will also make our code as approachable as we can, and informative enough that other people can use our repo to learn about CUDA and ML.

Learning Goals

Our main goals for this project are to learn the theory behind neural networks and their physical implementation in code, understand the basics behind how parallelized computing works, which includes learning the popular graphics programming API CUDA and seeing how it boosts performance, and to expand on our current knowledge of C and jump into the more practical world of C++.

Our Accomplishment

By the end of this project, we were able to successfully implement our own neural network. Our neural network is made up of many neurons arranged in layers. Each of these neurons is what is known as a perceptron. Every neuron has a weighted connection to all the neurons in the previous layer and all the neurons in the next layer. It sums the results of all the inputs, modified by the weights and the network’s transfer function, and passes this value to the next layer, until we reach the output layer. To begin with, these weights are random. During the supervised training, the network takes inputs, pushes it through the layers, and reports the output. Based on the error between the reported output and the desired output, it then corrects the weights through backpropagation. This is a method where the error is used to calculate the gradient and the weights are adjusted to minimize error through a gradient descent algorithm.

Figure 1: Code architecture. Networks contain layers, layers contain neurons, and neurons contain connections to other neurons.

Our project consists of two major milestones. The first is a working, non-parallel neural network in C++. As shown in Figure 1, it is structured by classes--specifically, a Network, a number of Layers in the network, and several Neurons per layer, each of which holds a Connection struct for each of the other Neurons in the neighboring layers and which hold the weights and deltas. This network reports a final average error of about 0.002 on an XOR data set, taking 0.335 seconds to train on 100,000 samples (averaged over 100 runs), as shown in Figure 2.

Figure 2: Output from the C++ version of the Neural Network, our first iteration. The network starts off with a fairly high error and incorrect predictions, but after 10000 trials brings the error pretty close to 0, with basically correct predictions every time.

The next milestone was incorporating CUDA to parallelize the processing and speed up the neural network. Transferring from C++ to CUDA, we built a number of simple programs to help understand CUDA programming. For instance, managed_working_example.cu is a testbed for us to understand the unified memory system.

Figure 3: A high level graphic showing the relationship between the CPU, where all processes generally start off from (the blue), and the GPU, which is used in parallel for its speed (the green). The left shows how managing memory is difficult between the two processors when the two memory banks are separate, but the right shows how unified memory can help make using the GPU easier by simplifying memory usage. Note that “dramatically lowering developer effort” only applies to the API itself, the user still has just as hard a time learning how to use it… Image source

The most notable obstacle with incorporating CUDA is that of memory. Using CUDA means using your graphics card’s computational power to boost your process’s CPU. In general, a program will start on a process that utilizes the CPU’s memory to read and write data, as well as perform operations. CUDA allows you to switch onto the GPU temporarily instead, thus increasing your read/write/operate speeds significantly. The GPU also has the ability to multithread at significantly larger scales than the CPU, thus allowing you to parallelize your program at massively high orders of magnitude. Moving from the CPU (the “host”) to the GPU (the “device”) so the GPU’s power can be utilized boils down to a memory issue, as shown by the graphic in Figure 3, which shows a simplified view of the CPU and GPU. In order to use CUDAs interface with the GPU, memory must be allocated properly such that the GPU can understand how to interact with it. This includes creating unified memory, proper initializations of variables, tagging functions for correct identification of being host-side or device-side, and more.

Trying to integrate our object-oriented program with unified memory brings up a number of peculiarities of CUDA programming. From our research, it seems that there are not many others who have tried to use nested classes with CUDA. There seem to be two main reasons for this. First, for CUDA code to be fast, data and functions need to be loaded locally onto the GPU. We need to explicitly allocate space on the correct device for data, and tag functions as GPU- or CPU-specific. As a result, basic functions like C++ vector operations are not available on the GPU. We have to explicitly create most functions we want on the GPU. The second reason for avoiding nested classes builds off of that. Creating object oriented code becomes very complicated when you have to be explicit about the memory locations of functions and attributes. In particular, initializing custom objects containing other objects is tricky, and the documentation on this is sparse.

By inundating our class definitions with cudaMallocManaged calls and jumping back and forth between host and device, we are able to create a neural network that uses multiple threads on the GPU to train. Running it against a profiler, we get the results seen in Figure 4. The first thing that stands out is how much slower this is, almost 300 times slower at 1 minute 37 seconds (only run once, multiple time trials not desired), despite parallelization. The time sink here is most likely due to how often we try to port between device and host, which is a costly switch. Most kernels called in CUDA programs only copy over memory and enter the GPU once, then perform all their operations on device side with one final return back to the CPU host to prevent this slowdown.

Figure 4: Time trial (top) and nvidia breakdown (bottom) of the runtime of the Neural Network using CUDA. The slowdown in speed in comparison to the C++ run above is thought to come from the large number of swaps between host and device in our implementation. For example, in the breakdown, about 36GB of memory was used in Device to Host unified memory, and the longest runtime function was `cudaDeviceSynchronize`, which is a blocking function meant to wait until all threads had finished on the device side before coming back to the host side.

In light of this, we started cuda_BNN_faster.cu, which is meant to be an optimized version. We intended to initialize everything on the CPU, then pass values to the GPU and let it go from there, with no switching. Figure 5 shows an example of how we might create our network to be passed as a parameter to the GPU. However, this dive into structuring classes to be used inside __global__ functions reveals a flaw in our original implementation of the nested classes. Because they all inherited from a class we call Managed, which sets aside managed memory for each instance, all of our classes are constructed using a __host__ function, cudaMallocManaged. While this was not a problem when we set everything up on the host side, trying to initialize the network on the GPU could not work, instead exiting silently without completing the requested functionality. The GPU can only operate inside __device__ and __global__ functions, thus any host side operations after the program has crossed over to the GPU cause undefined behavior. Because of this, and due to how we would have to restructure our program, the classes give us nothing that functions would not do better in this setup.

Figure 5: An example script of our plan to use OOP with GPU `global` functions, which inevitably show the flaw in this method when applied to our program. In order for a class to be passed to a `global` function, all methods must be `host` and `device`. But when creating a class inside a class (i.e. an array of Layers in a Network), the internal classes must be dynamically allocated using `cudaMallocManaged`, a `host` function. Thus when passed, the `global` function will break on the `host` function call because of invalid memory access of a host side function from the device side processor.

Our next step would’ve been to optimize the cuda_BNN_faster.cu by transforming its architecture into a more functional programming style. This would remove the overhead of OOP and allow for easier memory management between host and device, our largest hurdle. However, our project ran out of time before we could finish implementing that framework. Despite the lack of time, we do understand the relationship between C++ and CUDA. CUDA shines best when there a lot of computations that can take place all at once, and machine learning definitely lends itself to that with the need to process tons of data in the feed forward direction and updating all its weights in the back propagation direction of the network. C++ is a great low level language that can utilize the optimized memory allocation functionality of CUDA to its full potential. Our research has shown us that C++ and CUDA can be very powerful when paired together and used correctly, but the small scale of our work and the steep learning curve might have prevented us from seeing the whole picture clearly in time.

Reflection and Final Thoughts

While our final product did not reach as far as what we had hoped for, we did learn a lot. We understand what machine learning is and how it is implemented in code from scratch. We were exposed to the C++ language and some of the advantages and disadvantages of OOP. We learned a lot about the layout of GPU blocks and how to efficiently access GPU memory. And we figured out CUDA and how to incorporate it all into a C++ program, albeit a bit slow, but with the knowledge base needed to continue self-learning and moving forward on the topic. We believe reached the lower bound of the project, with enough information needed to have made our upper bound if we had more time.

Whether or not this was an initial learning goal, this project helped us understand memory and how it is handled between the GPU and CPU, as well as the nuances of GPU functions. These nuances lead to learning more about class inheritance, overriding functions, and other fun aspects of OOP in C++.

To different extents, we achieved our goal of learning both C++ and CUDA. We weren’t able to develop the neural network as much as we had planned, but we learned everything we wanted to and more. As a team, we believe that we had gotten as much as we could’ve out of this project within the time allowed. Learning more than enough to build a CUDA-enabled neural network that could predict a simple task is a satisfying conclusion to this project.

Resources

CUDA Basics

An Even Easier Introduction to CUDA This is a basic introduction to how CUDA works, include elements like tagging functions, memory allocation for the GPU (in this case, unified memory), synchronization, and how to call kernels. It also makes for a good test over whether all elements are working.

Unified Memory in CUDA 6 While the introduction to CUDA mentioned functions like cudaMallocManaged and unified memory, the details remained fuzzy for us. Particularly when we started defining classes that we wanted to be accessible on both the CPU and the GPU, we needed this more detailed explanation of how unified memory worked and ways to use it in inheritance.

Neural Networks

15 Steps to Implement a Neural Net A fairly high-level walkthrough of creating a backpropogation neural network. This is valuable for wanting a more general picture of what the steps are, without having the specifics defined.

A Neural Network in 10 Lines of C++ Code and sister article A Neural Network in 10 Lines of CUDA C++ Code These articles are liars, as the final file ends up being much more than 10 lines, but the core of the learning algorithm is short and simple. They provide a basic introduction to the math behind neural networks and a good insight into the differences between CUDA and straight C++.

David Miller’s Neural Network in C++ Tutorial We use this tutorial, and the basic structure that it set for neurons, layers, and a network, as the backbone of our own BNN. Though rather long, it provides a great breakdown of all the different parts, and a more object-centered way of thinking about the network than most matrix-based tutorials.

kzhang8850 / softsyscudai Goto Github PK

softsyscudai's Introduction

CudAI

Sean Foley, Kaitlyn Keil, Kevin Zhang

The Goal

Learning Goals

Our Accomplishment

Figure 1: Code architecture. Networks contain layers, layers contain neurons, and neurons contain connections to other neurons.

Figure 2: Output from the C++ version of the Neural Network, our first iteration. The network starts off with a fairly high error and incorrect predictions, but after 10000 trials brings the error pretty close to 0, with basically correct predictions every time.

Reflection and Final Thoughts

Resources

CUDA Basics

Neural Networks

softsyscudai's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent