Sigmoid Classifiers Decoded

Hello World,

Another pull from my primary dev blog.
Sigmoid really isn't that complicated (once your understand it of course).  Some back knowledge in case you are coming at this totally fresh is that the Sigmoid function is used in machine learning primarily as a hypothesis function for classifiers.  What is interesting is that this same function is used for binary classifiers, multi-class classifiers and is the backbone of modern neural networks.

Here is the sigmoid function:    $latex \frac{ 1 }{ 1 + e^{-z}} $

Bounded Function

So lets think about this for a minute.  It is used for a variety of classifiers.  Sigmoid is basically a bounded function.  The output of sigmoid is a value that is between 0 and 1.  In the case of binary classifiers, this is straight forward.

Sigmoid's Output

If sigmoid outputs .75, you translate this as there is a 75% probability that this thing is 1 and a 25% probability that this thing is 0.  In this scenario we would tell the user "this thing is a 1 with 75% probability".  In the case of multi-class classifiers, you use the one-vs-all methodology and pick the highest.  So lets say you have classes A, B, C and D.  The first call would be "is it A or not", the second would be "is it B or not" etc.  You would get back an answer that looks like this: [ .02, .01, .94, .03].  Where .94 is representative of C.  We would then return to the user "this thing is a C with 94% probability".

So how does it work?

Well sigmoid is a "trained" function just like our linear regression, so first, be aware of that.  If we take a look at the sigmoid function $latex \frac{ 1 }{ 1 + e^{-z}} $ you will notice this z thing.  This is the magic parameter.  This is z is actually the vector of prediction parameters times the vector of theta values times - 1 or $latex X * \theta * -1 $

What does this mean?  Well same as everything else, you are calculating a contribution by defining a polynomial function, some weights and fitting in the values of the sample.  So to create a basic linear decision boundary, with 2 parameters, you could do: $latex y = \theta_{x1} * x_1 + \theta_{x2} * x_2 + \theta_b $ you could also define more complex polynomial functions to create more complex decision boundaries or account for potential acceleration between combinations of parameters.

decisionBoundary

As you can see here, the blue line divides two classes of data.

What about the 1 or 0 thing?

Thats what the sigmoid function does.  So now that we have our "linear regression".  We compress that term such that it can only output values between 1 and 0.  Take note $latex \frac{ 1 }{ 1 + e^{-z}} $ since the entire function is 1 / something, the largest it can ever get is 1.  the larger z gets, the closer to zero the term goes, the smaller z gets the closer to 1 the result goes.

So where do the magic parameters come from?

I'll do an article later on generating complex polynomials on the fly, but for now, just think "combinations of parameters that can possibly be circular and have other funny curves".  so $latex y = \theta_{x1} * x_1 + \theta_{x2} * x_2 + \theta_{x12} * x_1 * x_2 + \theta_{x1-2} * x_1^2 + \theta_{x2-2} * x_2^2 $ is a great example, but looking at your data is crucial, and also remember overfitting can become problematic, so use regularization.

Finally the last part is, gradient descent is used for training those magic theta values.