March 2019

Volume 34 Number 3

### [C#]

# Support Vector Machines Using C#

A support vector machine (SVM) is a software system that can make predictions using data. The original type of SVM was designed to perform binary classification, for example predicting whether a person is male or female, based on their height, weight, and annual income. There are also variations of SVMs that can perform multiclass classification (predicting one of three or more classes) and regression (predicting a numeric value).

In this article I present a complete working example of an SVM that’s implemented using only raw C# without any external libraries. A good way to see where this article is headed is to examine the demo program in **Figure 1**.

**Figure 1 SVM Demo Program in Action**

The demo begins by setting up eight dummy training data items. Each item has three predictor values, followed by the class to predict encoded as -1 or +1. The dummy data doesn’t represent a real problem, so the predictor values have no particular meaning. In a realistic SVM problem, it’s important to normalize the predictor values, usually by applying min-max normalization so that all values are scaled between 0.0 and 1.0.

The demo creates an SVM classifier that uses a polynomial kernel function. (I’ll explain kernel functions shortly.) Training the SVM model generated three support vectors: (4, 5 7), (7, 4, 2) and (9, 7, 5). These are training data items [0], [1] and [4]. Each support vector has an associated weight value: (-0.000098, -0.000162, 0.000260). Additionally, an SVM model has a single value called a bias, which is -2.505727 for the demo data.

The support vectors, weights and biases define the trained SVM model. The demo program uses the model to generate predicted class values for each of the eight training items. A decision value that’s negative corresponds to a predicted class label of -1 and a positive decision value corresponds to a predicted class of +1. Therefore, the trained model correctly predicts all eight of the training items. This isn’t too surprising because the problem is so simple.

The demo concludes by predicting the class for a new, previously unseen item with predictor values (3, 5, 7). The computed decision value is -1.274 and therefore the predicted class label is -1. The computation of the decision value is explained later in this article.

This article assumes you have intermediate or better programming skill with C#, but doesn’t assume you know anything about SVMs. The demo program is coded using C# and though it’s complicated, you should be able to refactor it to another language, such as Java or Python, if you wish.

The code for the demo program is too long to present in its entirety in this article, but the complete source code is available in the accompanying file download. All normal error checking has been removed to keep the main ideas as clear as possible.

## Overall Program Structure

The structure of the demo program is shown in **Figure 2**. All the control logic is contained in a single Main method. Program-defined class SupportVectorMachine declares all member fields as public so you can more easily inspect them programmatically.

I used Visual Studio 2017 to create the demo program, but there are no significant .NET Framework dependencies so any version of Visual Studio will work fine. I created a new C# console application and named it SVM_CSharp. After the template code loaded into the editor window, I removed all unneeded using statements and then added a reference to the Collections.Generic assembly. In the Solution Explorer window, I right-clicked on file Program.cs and renamed it to SVM_Program.cs and allowed Visual Studio to automatically rename class Program.

Figure 2 Demo Program Structure

```
using System;
using System.Collections.Generic;
namespace SVM_CSharp
{
class SVM_Program
{
static void Main(string[] args)
{
// Set up training data
// Create SVM object, set parameters
// Train the SVM
// Display SVM properties
// Use trained SVM to make a prediction
}
}
public class SupportVectorMachine
{
// All member fields are declared public
public SupportVectorMachine(string kernelType,
int seed) . .
public double PolyKernel(double[] v1, double[] v2) . .
public double ComputeDecision(double[] input) . .
public int Train(double[][] X_matrix,
int[] y_vector, int maxIter) . .
public double Accuracy(double[][] X_matrix,
int[] y_vector) . .
private bool TakeStep(int i1, int i2,
double[][] X_matrix, int[] y_vector) . .
private int ExamineExample(int i2, double[][] X_matrix,
int[] y_vector) . .
private double ComputeAll(double[] vector,
double[][] X_matrix, int[] y_vector) . .
}
}
```

## Using the SVM

The demo program sets up eight hardcoded training items:

```
double[][] train_X = new double[8][] {
new double[] { 4,5,7 },
...
new double[] { 8,9,10 } };
int[] train_y = new int[8]{ -1, -1, -1, -1, 1, 1, 1, 1 };
```

In a non-demo scenario, you should normalize the predictor values, and you’d likely store your data in a text file. The SVM classifier is created like so:

```
var svm = new SupportVectorMachine("poly", 0);
svm.gamma = 1.0;
svm.coef = 0.0;
svm.degree = 2;
```

The “poly” argument is really a dummy value because the SVM is hardcoded to use a polynomial kernel function. The 0 argument is a seed value for the random component of the training algorithm. The gamma, coef (also called constant), and degree arguments are parameters for the polynomial kernel function. Next, parameters for the training algorithm are specified:

```
svm.complexity = 1.0;
svm.epsilon = 0.001;
svm.tolerance = 0.001;
int maxIter = 1000;
```

All of these values are hyperparameters that must be determined by trial and error. The main challenge when using any implementation of an SVM is understanding which kernel to use, the kernel parameters and the training parameters. Training is performed by a single statement:

```
int iter = svm.Train(train_X, train_y, maxIter);
```

Training an SVM classifier is an iterative process and method Train returns the actual number of iterations that were executed, as an aid for debugging when things go wrong. After training, the SVM object holds a List<double[]> collection of the support vectors, an array that holds the model weights (one per support vector) and a single bias value. They’re displayed like this:

```
foreach (double[] vec in svm.supportVectors) {
for (int i = 0; i < vec.Length; ++i)
Console.Write(vec[i].ToString("F1") + " ");
Console.WriteLine("");
}
for (int i = 0; i < svm.weights.Length; ++i)
Console.Write(svm.weights[i].ToString("F6") + " ");
Console.WriteLine("");
Console.WriteLine("Bias = " + svm.bias.ToString("F6") + "\n");
```

The demo concludes by making a prediction:

```
double[] unknown = new double[] { 3, 5, 7 };
double predDecVal = svm.ComputeDecision(unknown);
Console.WriteLine("Predicted value for (3.0 5.0 7.0) = " +
predDecVal.ToString("F3"));
int predLabel = Math.Sign(predDecVal);
Console.WriteLine("Predicted label for (3.0 5.0 7.0) = " +
predLabel);
```

The decision value is type double. If the decision value is negative, the predicted class is -1 and if the decision value is positive, the predicted class is +1.

## Understanding SVMs

SVMs are quite difficult to understand, and they’re extremely difficult to implement. Take a look at the graph in **Figure 3**. The goal is to create a rule that distinguishes between the red data and the blue data. The graph shows a problem where the data has just two dimensions (number of predictor variables) only so that the problem can be visualized, but SVMs can work with data with three or more dimensions.

**Figure 3 Basic SVM Concepts**

An SVM works by finding the widest possible lane that separates the two classes and then identifies the one or more points from each class that are closest to the edge of the separating lane.

To classify a new, previously unseen data point, all you have to do is see which side of the middle of the lane the new point falls. In **Figure 3**, the circled red point at (0.3, 0.65) and the circled blue points at (0.5, 0.75) and (0.65, 0.6) are called the support vectors. In my head, however, I think of them as “support points” because I usually think of vectors as lines.

There are three major challenges that must be solved to implement a useable SVM. First, what do you do if the data isn’t linearly separable as it is in **Figure 3**? Second, just how do you find the support vectors, weights and biases values? Third, how do you deal with training data points that are anomalous and are on the wrong side of the boundary lane?

As this article shows, you can deal with non-linearly separable data by using what’s called a kernel function. You can determine the support vectors, weights and biases using an algorithm called sequential minimal optimization (SMO). And you can deal with inconsistent training data using an idea known as complexity, which penalizes bad data.

## Kernel Functions

There are many different types of kernel functions. Briefly, a kernel function takes two vectors and combines them in some way to produce a single scalar value. Although it’s not obvious, by using a kernel function, you can enable an SVM to handle data that’s not linearly separable. This is called “the kernel trick.”

Suppose you have a vector v1 = (3, 5, 2) and a second vector v2 = (4, 1, 0). A very simple kernel is called the linear kernel and it returns the sum of the products of the vector elements:

```
K(v1, v2) = (3 * 4) + (5 * 1) + (2 * 0) = 17.0
```

Many kernel functions have an optional scaling factor, often called gamma. For the previous example, if gamma is set to 0.5, then:

```
K(v1, v2) = 0.5 * [(3 * 4) + (5 * 1) + (2 * 0)] = 8.5
```

The demo program uses a polynomial kernel with degree = 2, gamma = 1.0 and constant = 0. In words, you compute the sum of products, then multiply by gamma, then add the constant, then raise to the degree. For example:

```
K(v1, v2) = [1.0 * ((3*4) + (5*1) + (2*0)) + 0]^2 = (1 * 17 + 0)^2 = 289.0
```

The polynomial kernel is implemented by the demo program like so:

```
public double PolyKernel(double[] v1, double[] v2)
{
double sum = 0.0;
for (int i = 0; i < v1.Length; ++i)
sum += v1[i] * v2[i];
double z = this.gamma * sum + this.coef;
return Math.Pow(z, this.degree);
}
```

The values of gamma, degree and constant (named coef to avoid a name clash with a language keyword) are class members and their values are supplied elsewhere. The demo program hard codes the kernel function so if you want to experiment with a different function, you’ll have to implement it yourself.

A common alternative to the polynomial kernel is the radial basis function (RBF) kernel. In words, you sum the squared differences between elements, multiply by negative gamma, then raise to e (Euler’s number, approximately 2.718). In code the RBF kernel could look like:

```
public double RbfKernel(double[] v1, double[] v2)
{
double sum = 0.0;
for (int i = 0; i < v1.Length; ++i)
sum += (v1[i] - v2[i]) * (v1[i] - v2[i]);
return Math.Exp(-this.gamma * sum);
}
```

Notice that different kernel functions have different parameters. The linear and RBF kernels only require gamma, but the polynomial kernel requires gamma, degree and constant. The choice of kernel function to use, and the values of the parameters that apply to the kernel function being used, are free parameters and must be determined by trial and error. All machine learning classification techniques have hyperparameters, but SVMs tend to be particularly sensitive to their hyperparameter values.

## The Sequential Minimal Optimization Algorithm

There are many algorithms that can be used to determine the support vectors for an SVM problem. The SMO algorithm is the most common. The demo program follows the original explanation of SMO given in the 1998 research paper, “Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines,” which can be found in many places on the Internet.

The SMO algorithm is very complex and a full explanation would require roughly 200 pages (I know because I once reviewed an entire book dedicated just to SMO). SMO has three key functions: a top-level Train function that calls a helper ExamineExample function, which calls a helper TakeStep function.

The signature of TakeStep is: private bool TakeStep(int i1, int i2, double[][] X_matrix, int[] y_vector). Parameter X_matrix holds the training data predictor values. Parameter y_vector holds the training data target values, which are either -1 or +1. The i1 and i2 parameters are a first index and a second index pointing into the training data. On each call to TakeStep, the algorithm attempts to find a better solution and returns true if an improvement is found using the two training data items, false otherwise.

The signature of ExamineExample is: private int ExamineExample(nt i2, double[][] X_matrix, int[] y_vector). The function returns the number of changes that occurred so that TakeStep can determine if progress has been made.

Both TakeStep and ExamineExample use a class-scope variable named complexity. Larger values of complexity increasingly penalize outlier training data and force the SMO algorithm to try to find a solution that deals with them, at the expense of model overfitting. Because complexity is a parameter used by SMO, it will always be present, unlike parameters associated with the kernel function used, which might be present or absent.

The TakeStep function uses a class-scope variable named epsilon, and the ExamineExample function uses a class-scope variable named tolerance. Both are small values, set to 0.001 by default in the demo. Epsilon is used internally to determine when to stop iterating, which in turn affects the number of support vectors found. Tolerance is used when computing error. The values of epsilon and tolerance are free parameters and the effect of changing them varies quite a bit (from a very small to a very large effect) depending on the particular problem at which you’re looking.

The code for method Train is presented in **Figure 4**. The method is iterative and returns the number of iterations that were performed. The method accepts a maxIter parameter to set a hard limit on the number of iterations performed. In theory, the SMO algorithm will always converge and stop iterating, but theory doesn’t always match practice with an algorithm as complex as SMO.

Figure 4 The Training Method

```
public int Train(double[][] X_matrix, int[] y_vector, int maxIter)
{
int N = X_matrix.Length;
this.alpha = new double[N];
this.errors = new double[N];
int numChanged = 0;
bool examineAll = true;
int iter = 0;
while (iter < maxIter && numChanged > 0 || examineAll == true) {
++iter;
numChanged = 0;
if (examineAll == true) {
for (int i = 0; i < N; ++i)
numChanged += ExamineExample(i, X_matrix, y_vector);
}
else {
for (int i = 0; i < N; ++i)
if (this.alpha[i] != 0 && this.alpha[i] !=
this.complexity)
numChanged += ExamineExample(i, X_matrix, y_vector);
}
if (examineAll == true)
examineAll = false;
else if (numChanged == 0)
examineAll = true;
} // While
List<int> indices = new List<int>(); // support vectors
for (int i = 0; i < N; ++i) {
// Only store vectors with Lagrange multipliers > 0
if (this.alpha[i] > 0) indices.Add(i);
}
int num_supp_vectors = indices.Count;
this.weights = new double[num_supp_vectors];
for (int i = 0; i < num_supp_vectors; ++i) {
int j = indices[i];
this.supportVectors.Add(X_matrix[j]);
this.weights[i] = this.alpha[j] * y_vector[j];
}
this.bias = -1 * this.bias;
return iter;
}
```

In addition to explicitly returning the number of iterations performed, Train finds and stores the indices of the training data items that are support vectors. After the number of support vectors is known, the array that holds the values of the weights can be allocated. The weight values are alpha values that are non-zero.

The Train method has many possible customization points. For example, the demo code stores support vectors as a List<double[]> collection. An alternative is to store just the indices of the support vectors, in a List<int> collection or in an int[] array object. Examining the Train method carefully is the best way to start to understand SVMs and the SMO algorithm.

## Understanding the SVM Mechanism

If you refer to **Figure 1**, you’ll see the trained SVM has three support vectors: (4, 5, 7), (7, 4, 2) and (9, 7, 5). And the model has three weight values = (-0.000098, -0.000162, 0.000260) and bias = -2.506. The decision value for input (3, 5, 7) is computed by calculating the value of the kernel function with each of the three support vectors, then multiplying each kernel value by its corresponding weight, summing, then adding the bias:

```
x = (3, 5, 7)
sv1 = (4, 5, 7)
sv2 = (7, 4, 2)
sv3 = (9, 7, 5)
K(x, sv1) * wt1 = 7396.0 * -0.000098 = -0.725
K(x, sv2) * wt2 = 3025.0 * -0.000162 = -0.490
K(x, sv3) * wt3 = 9409.0 * 0.000260 = 2.446
decision = -0.725 + -0.490 + 2.446 + -2.506 = -1.274
prediction = Sign(decision) = -1
```

Notice that if the predictor values are not normalized, as in the demo, the values of the kernels can become very large, forcing the values of the weights to become very small, which could possibly lead to arithmetic errors.

The SVM mechanism points out strengths and weaknesses of the technique. SVM focuses only on the key support vectors, and therefore tends to be resilient to bad training data. When the number of support vectors is small, an SVM is somewhat interpretable, an advantage compared to many other techniques. Compared to many other classification techniques, notably neural networks, SVMs can often work well with limited training data, but SVMs can have trouble dealing with very large training datasets. The major disadvantages of SVMs is that SVMs are very complex and they require you to specify the value of many hyperparameters.

## Wrapping Up

As this article shows, implementing a support vector machine is quite complex and difficult. Because of this, there are very few SVM library implementations available. Most SVM libraries are based on a C++ implementation called LibSVM, which was created by a group of researchers. Because calling C++ is often difficult, there are several libraries that provide wrappers over LibSVM, written in Python, Java, C# and other languages.

By experimenting with the code presented in this article, you’ll gain a good understanding of exactly how SVMs work and be able to use a library implementation more effectively. Because the code in this article is self-contained and simplified, you’ll be able to explore alternative kernel functions and their parameters, and the SMO training algorithm parameters epsilon, tolerance, and complexity.

**Dr. James McCaffrey** *works for Microsoft Research in Redmond, Wash. He has worked on several key Microsoft products including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.*

Thanks to the following Microsoft technical experts who reviewed this article: Yihe Dong, Chris Lee