104: The one with the “Learning”

Hi readers!

In the past two post we saw the different parameters and functions used in a NN. In this post, we are going to see about the very last topic, Backpropagation.

So let’s dive in!

Backpropagation

This topic is going to be a bit mathematical. If you want to understand the fundamental concepts of it, you need to know high school calculus. But you can get and idea even without knowing the underlying maths. This post will cover the underlying mathematical intuitions behind this concept.

Before we look at learning in NN, let’s just think how we learn. We perform a task and we get it’s results or feedback. If the results are bad, then we work on our feedback and improve our performance.

The learning in NN is analogous to this. Here, our performance measure is the cost function (J). If we can find how this cost varies with respect to the weights W and bias b we can find out a way to reduce the cost.

J as a function of weights and biases

As you might have studied in High School, a 3D plot is also known as a Contour plot. So first let’s plot J as a function of w and b.

Image source: medium.com

Our ultimate goal is to reach that point B, which is also known as the “global minimum“. We need to reach that point irrespective of where we start. We can start at point A or any point on the surface of the curve, but our destination is going to be that point B.

To put the learning of NN in simple terms, it learns the route to this point, the global minimum.

Let’s see how it determines the route. (it certainly doesn’t have Google Maps 😅😜)

Gradient Descent

This is the most commonly used algorithm to determine the route.

We will try to determine how much the cost will vary when we tweak the weights and how much the cost will vary when we tweak the biases. In mathematical terms we want to determine the rate of change of J with respect to W and b. Ah… we have heard this phrase somewhere…

YOU GOT IT! WE ARE GOING TO USE DIFFERENTIALS.

Equations used in gradient descent. (img 1)

We calculate the partial derivatives since J is a function of both W and b. I encourage you to calculate the derivative terms for the error functions we have discussed in the previous posts.

There is a new term “alpha” which is known as the learning rate. I will discuss about it below. The fancy “:=” operator is just your normal assignment operator.

So for simplicity let’s just take a 2D graph of J plotted against W.

Sorry for the poor drawing…

That’s the curve we’ll be getting and point “a” is our destination.

Let’s start from point “1”. You very well know that the slope(dw) at point 1 is positive. Thus T2 from img 1 is positive and hence w decreases. If you look at the above image w at point 1 is actually higher than the w at global minimum. So we need our w value to decrease and this is precisely what gets done.

Similarly at point “2”, dw is negative. Hence T2 is going to be negative and hence w increases. The value of w at “2” is lesser than the global minimum value of w and it has to increase. This is what is done.

So since we take gradients, the name “gradient” and since we descend, “descent” and there is our magical name “gradient descent”!!! Voila!!

Congrats on making it till here. This is one of the key concepts in DL and make sure you understand the working of it. If you want to go a step further, calculate the gradients for the cost functions and try to bring the update equations to simple matrix operations (you can reduce it down to that).

Learning Rate

I will now explain about the concept of learning rate (alpha) in NN. It determines the size of the step we take as we descend. It can either be too big or too small or just the correct size.

Your first instinct will be to set it high so that the model might learn faster. But it might overshoot the global minimum and just keep bouncing back and forth. In other words it might not converge.

When alpha is large
gif maker: ezgif.com

On the other hand if alpha is too small, it might take forever to converge.

So the ideal learning rate should make the learning process faster, but also make sure that the model converges.

Ideal value of alpha
gif maker: ezgif.com

There is also a gradient ascent algorithm that considers the global maximum instead of the global minimum. Refer here for more information.

Types of Gradient Descent Algorithm

The most commonly used algorithms are

Batch Gradient Descent: All the training examples are taken as a single batch and gradient descent is done on the entire batch.

Mini-Batch Gradient Descent: The training set is split into smaller batches and gradient descent is done on each of these batches separately.

Stochastic Gradient Descent (SGD): If the batch size is 1, that is gradient is done on each example separately then it is known as SGD. The most famous algorithms like Adam, AdaGrad and RMSProp are all extensions of SGD.

Adam is the most popularly used algorithm for Gradient Descent and is the go-to algorithm if you are not sure on what to choose.

So that brings us to the end of this post. In the next post we’ll put it all together and construct our first Neural Network!

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.