Intro to Gradient Descent

Article
06/05/2016

Hello World!

This is a pull from my privately owned blog: www.dacrook.com -> to find latest and greatest, thats a great place to head.

In this article we are going to cover a simple version of Gradient Descent. It is important to note that this version of gradient descent is using Sum of Squares as its cost function to reduce. This implementation utilizes vectorized algorithms. Lets start off with...

What is Gradient Descent

There are a ton of confusing explanations, including wikipedia. Lets just keep it simple. Gradient descent is simply a method to minimize the output of a particular function by taking baby steps in the downward sloping direction. These steps reduce the slope until it reaches a local minimum or the slope becomes zero. This is more easily visualized with the below graphic.

You can see here at step 1 we have a very steep slope. We take a big step, overshoot a bit, but at least we are closer, the slope is reduced, take another step, overshoot again, but again we are closer and finally in step 4, we have reached the local minimum of that function, where our slope is zero.

What does slope have to do with the minimum of a function?

Above is a parabolic equation $latex y = x^2 $ the minimum of this function is when x = 0. You can think of step 1 being x = -100, step 2 x = +50, and step 3, x = -25. Each of these steps the slope reduces as x becomes closer to the minimum, which is x = 0. Basically the slope tells us how big of a step we need to take in order to get closer to our minimum as well as in what direction. In step 1, we need to increase x, in step 2 we need to decrease x, and in step 3 we need to again increase it. At step 4, we are just right.

Awesome, so how do we determine slope?

Well now thats a tricky question isn't it? Basically it boils down to remembering your calculus 2 classes from college (or high school). I will cover derivatives and partial derivatives in depth in a seperate article, but the jist of it is that a derivative is the rate of change. In this example, the rate of change for step 1 is very high, while the rate of change for step 2 is lower, step 3 it is lower and step 4 is is zero. Derivatives is a way to understand the rate of change in a localized context for immediately near neighbors. Here is a great article on the topic for now.

So lets cut to the chase, derivatives basically have a series of rules and you can read all about them on wikipedia or by taking a calculus class. A great rule though is that if your equation is y = (ANYTHING)^2 (like sum squares), your derivative is y = 2 * (ANYTHING).

Ok, but I have a bunch of variables and weights

Yup, so you need to use "Partial Derivatives", basically partial derivatives allow you to see to what extent and individual component is contributing to the entirety of the slope. See the below graphic for a representation.

The above figure is an excerpt from a "contour plot". Basically it allows us to display 3 dimensional information on a 2 dimensional plot, in the instance we have an equation such as $latex y = \theta_1 \cdot x + \theta_2 \cdot z + \theta_3 $ think of $latex \theta_1 $ as green and $latex \theta_2 $ as blue. So to attain the full derivative, which is pictured in red, we can simply take the sum of the 2 partials displayed in green and blue. This gives us the contribution that each component has in the whole. By doing this, we can seperate out each weight and move it into the correct direction an appropriate amount. Partial derivatives give us the power to do this with any polynomial.

Ok, thats great, what does it mean for sum of squares? Basically we take the partial derivative with respect to each theta value. The rule here is that you take the anything theta is multiplied with and multiply the entire equation by that as well. So you really end up with this equation for each theta of J.
$latex \frac{ \sum_i^m \left(H_{\theta}\left(x\right) - y\right) \cdot x_j } { m } $

where the original sum of squares equation is

$latex \frac{ \sum_i^m \left(H_{\theta}\left(x\right) - y\right)^2 } { m } $
You can look up the rules here. It gets hairy quick.

So what does the code look like?

Great question! First I want to mention all code samples in this article are in F# (I will have a python version at a later date for folks). Feel free to read this article on why F# as well as go here to check out the language. Here is the original sum of squares. Notice we are using linear algebra to vectorize our equation. This is an optimization step we are taking. Read more about vectorizing here.

 let SumSquares (theta:Vector<float>) (y:Vector<float>) (trainingData:Matrix<float>) =
let m = trainingData.RowCount |> float
(trainingData |> LinearRegressionPredict theta)
|> subtract y
|> square
|> divideBy m

Here is the new optimization equation

 let StepThetas (y:Vector<float>) (initTheta:Vector<float>) (alpha:float) (predict:Prediction) (trainingData:Matrix<float>) =
let m = trainingData.RowCount |> float
let err = trainingData
|> predict initTheta
|> subtract y
let nT = (trainingData * err)
|> divideVecBy m
|> multiply alpha
initTheta |> subtract nT

Woah, thats way more stuff...

Sure does. That's because this is the "theta optimization" equation and not simply the partial derivative. That is the code implementation of this mathematical equation:
$latex \theta_j = \theta_j - \alpha \cdot \frac{ \sum_i^m \left(H_{\theta}\left(x\right) - y\right) \cdot x_j } { m } $

Notice the partial derivative of sum of squares is right in there. But we have this $latex \alpha $ thing in there and this $latex \theta_j $ thing too.

This is the core of gradient descent

The optimization portion is the meat and potatoes of it. Lets review what we have here. We have a partial derivative, the equation that says "Here is my contribution to the total error for this particular theta". We then multiply by an arbitrary $latex \alpha $ value. This is basically a scaling term that lets us make baby steps towards our goal. Ill speak more deeply about $latex \alpha $ (alpha) later. We then subtract our original theta by this.

Lets think about this, subtraction is basically addition of the inverse. So in essence we are adding to our weight the inverse of its contribution to error, thereby reducing it. We multiply by alpha to help ensure the step is of appropriate magnitude.

To Gradient Descent

Here is the final wrapping code.

 let GradientDescent (y:Vector<float>) (initTheta:Vector<float>) (alpha:float) (maxIters:int) (H:HypoEquation) (trainingData:Matrix<float>) =
let m = trainingData.RowCount |> float
let mutable nThetas = initTheta
for i = 0 to maxIters do
let r = trainingData
|> StepThetas y nThetas alpha H
nThetas <- r
nThetas

So you can see have a rather naive' implementation of gradient descent. It simply iterates some number, optimizes the weights and returns the trained weights. I do want to call attention to one neat detail in this implementation. Notice the parameter H, HypoEquation. This is the hypothesis equation, or the prediction equation. For linear regression it is:

 let LinearRegressionPredict : HypoEquation =
fun(theta:Vector<float>) (trainingData:Matrix<float>) ->
(trainingData * theta)

Simple vectorized equation. Whats interesting here is that HypoEquation is defined as a typed function that takes a Vector and a Matrix and produces a Vector. As long as something adheres to this (which most prediction equations do), our gradient descent will deal with it perfectly fine. This means our implementation of gradient descent for Linear Regression also will work for logistic regression and a variety of others.

Sweet, so now we can pump our gradient descent with all sorts of brains such as automatic alpha setting, data normalization etc etc and it will just simply work.

So How do we Test it?

Well lets start with some test data...

 let tData = matrix [[ 1.0; 2.0; 5.0]
[1.0; 3.0; 6.0]
[1.0; 3.0; 7.0]]
let yVals = vector [ 5.0; 6.0; 7.0]

Basically, we have a bi-nomial of the form:

$latex y = \theta_1 \cdot x + \theta_2 \cdot z + \theta_3 $

We appended our Matrix with a column-vector of 1's to be our third theta. The yVals is our actual values. We also need a set of initial theta values...

 let theta = vector [0.0; 0.0; 0.0]

We could probably have picked better initials, but meh, Gradient Descent will optimize anything. Ok, so to the test, lets see it do some work...

 let weights =
tData
|> GradientDescent yVals theta 0.001 2000 LinearRegressionPredict
tData
|> SumSquares weights yVals

We create a variable "weights" pass the training data to Gradient Descent, the yvals, the initial theta values. We want an alpha of .001, we want 2000 training steps and pass it the LinearRegressionPrediction function. Finally we want to see how much error our weights have against the fake training data by passing tData to SumSquares, the trained weights (instead of theta) and the actual yVals. Feel free to try various numbers of training steps and watch your errors drop as gradient descent optimizes your values.

Summary

Well this ended up a larger article than I had anticipated, but I think we covered a fair number of key topics.

Functional Languages Rock for ML. I personally like F#.
Derivatives & Partial Derivatives
Cost function vs Optimization Function
Gradient Descent
Implementation

I hope this was useful and please feel free to leave comments, questions, requests.