Giter VIP home page Giter VIP logo

gradient-descent-scratch's Introduction

Gradient Descent from Scratch

It is a gradient descent algorithm for classification implemented from scratch using numpy library.

Dependencies

  • Numpy
  • Matplotlib
  • Pandas

Importing Dataset

Datasets are first imported as pandas dataframe and then converted into numpy arrays
train_data_frame = pd.read_csv('train_dataset.csv', header=None)
test_data_frame  = pd.read_csv('test_dataset.csv',  header=None)
train_dataset = np.array(train_data_frame)
test_dataset  = np.array(test_data_frame)

Splitting data into input and label

Here in MNIST, column 0 is label and all other are inputs
train_lable = np.array([train_dataset[:, 0]])
train_data  = np.array(train_dataset[:, 1:785]).T

test_lable  = np.array([test_dataset[:, 0]]).T
test_data   = np.array(test_dataset[:, 1:785])

It is good practice to shuffle data at first
numpy.random.shuffle() will shuffle array in place

np.random.shuffle(train_dataset)

Parameter Initialization

In This network, weights are initialized randomly while biases are initialized zero as a list of numpy array
def __init__(self, size):
	self.biases  = [np.zeros([y, 1]) for y in size[1:]]
	self.weights = [np.random.randn(y, x)*0.01 for x, y in zip(size[:-1], size[1:])]

Choosing Hyperparameter

Mini Batch Size is size of input data flowing through network at a time for calculating error as a whole
Learning Rate Alpha decides the rate at which, weights and biases will update while back propagation
Number of Epochs decides number of times, the whole dataset will be used to train the network
Set Mini Batch Size to 1/10th of total data available. And update it manually after every train of network to find its optimum value
Alpha should be selected such that learning isn't very slow as well as it didn't take long jump or else, network will start diverging from local minima
Number of epochs are selected such that network don't overfit itself over noise

Feed Forward

For layer l = 2, 3,..., L compute
  • z[l] = W[l].A[l-1] + B[l]
  • A[l] = σ(Z[l])
def train_feed_forward(self, size, input, activators):
    self.z = [np.zeros([y, input]) for y in size[:]]
    i=0
    self.z[0] = input
    for bias, weight in zip(self.biases, self.weights):
        input = (np.dot(weight, input) + bias)
        self.z[i+1] = input
        input = activation(input)
        i=i+1
    return input

Activation Functions

Applying activation functions will change nature of network from linear from to non linear so that it could fit the outputs more accurately or else, it would be no different than any linear regression
def sigmoid(z, derivative=False):
    if derivative==True:
        return (activator.sigmoid(z=z, derivative=False) * (1 - activator.sigmoid(z=z, derivative=False)))
    return (1.0 / (1.0 + np.exp(-z)))
def softmax(z, derivative=False):
    if derivative==True:
        return (activator.softmax(z=z, derivative=False) * (1 - activator.softmax(z=z, derivative=False)))
    return (np.exp(z) / np.sum(np.exp(z)))
def tanh(z, derivative=False):
    if derivative==True:
        return (activator.tanh(z=z, derivative=False) * (1 - activator.tanh(z=z, derivative=False)))
    return (np.tanh(z))
def relu(z, derivative=False):
    if derivative==True:
        der_z = np.zeros(z.shape)
        for i in range(len(z.shape)):
            for j in range(len(z[i])):
                if(z[i, j]>0):
                    der_z[i, j] = 1
        return der_z
    return (np.maximum(z, 0))

Error and Loss Function

For error calculation, mean squared error is used
Output Error δ[L] = ∇aC ⊙ σ′(Z[L])
Mean Squared Error = (Predicted_value - Expected_value)2
def loss(self, Y, Y_hat, derivative=False):
    if derivative==True:
        return (Y_hat-Y)
    return ((Y_hat - Y) ** 2)

Backpropagation

In ANN, output will depend on every neuron it pass through
For output layer, we have label according to which, it is possible to find it's expected value
But for all other layers, there is no single solution available
So, finding optimum value is little harder for that

For each l=L−1,L−2,…,2 compute
δl    =    (Wl+1T . δl+1) ⊙ σ′Zl
δl    =    ((W[l+1])T δ[l+1]) ⊙ σ′(Z[l])
delta_nabla = self.find_nabla(size=size, activators=activators, mini_batch=mini_batch, mini_batch_size=mini_batch_size, y=y, alpha=alpha)
y_hat = self.train_feed_forward(size=size, input=mini_batch, activators=activators, mini_batch_size=mini_batch_size)
delta_nabla_b = [np.zeros([y, 1]) for y in size[1:]]
delta_nabla_w = [np.zeros([y, x]) for x, y in zip(size[:-1], size[1:])]
delta = self.loss(Y=y, Y_hat=y_hat, derivative=True) * activator.sigmoid(z=y_hat, derivative=True)
delta_nabla_b[-1] += np.sum(delta)
delta_nabla_w[-1] += np.dot(delta, self.z[-2].T)
delta = np.dot(self.weights[layer_no].T, delta) * activator.sigmoid(z=self.z[layer_no-1], derivative=True)

delta_nabla_b[layer_no-1] += np.sum(delta)
delta_nabla_w[layer_no-1] += np.dot(delta, self.z[layer_no-2].T)

delta_nabla = [delta_nabla_b, delta_nabla_w]
Updating Weight and Biases
∂C/∂Wlj, k     =     Al-1k . δlj
∂C/∂Bl         =    δlj
self.biases  = [b-((alpha/mini_batch_size)*n_b) for b, n_b in zip(self.biases, delta_nabla[0])]
self.weights = [w-((alpha/mini_batch_size)*n_w) for w, n_w in zip(self.weights, delta_nabla[1])]

Creating Network

In size_layers, define number of neurons on every layer of network and activation functions of every layer in activations

Network will work if more layers are added or deleted
neuron_layer = {"size_layers": [784, 2800, 10], "activations": ["tanh", "sigmoid"] }
my_network = network(neuron_layer["size_layers"])

Training Network

my_network.grad_descn(size=neuron_layer["size_layers"], expected_value=train_lable, training_data=train_data, activators=neuron_layer["activations"], alpha=0.01, mini_batch_size=2000, epochs=40)

Testing Network

result = test_feed_forward(size=neuron_layer["size_layers"], input=test_data.T, activators=neuron_layer["activations"])

no_trues = 0

for i in range(len(test_data)):
    max_ans = result[0, i]
    max_ind = 0
    for j in range(10):
        if(result[j, i]>max_ans):
            max_ind = j
            max_ans = result[j, i]
    if(test_lable[i]==max_ind):
        no_trues+=1

gradient-descent-scratch's People

Contributors

pushpulldecoder avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

fitrialif

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.