May 2019

Volume 34 Number 5

### [Test Run]

# Weighted k-NN Classification Using C#

The goal of a machine learning (ML) classification problem is to predict a discrete value. For example, you might want to predict the political leaning (conservative, moderate, liberal) of a person based on their age, annual health care expenses, sex, years of education and so on. There are many different classification techniques, including neural networks, decision trees and logistic regression. In this article I show how to implement the weighted k-nearest neighbors algorithm, using the C# language.

The best way to understand where this article is headed is to take a look at the demo run in **Figure 1** and a graph of the associated data in **Figure 2**. The demo program sets up 33 dummy data items. Each item represents a person’s age, annual health care expenses and an arbitrary class to predict (0, 1, 2), which you can think of as political leaning, for concreteness. The data has only two predictor variables so it can be displayed in a graph, but k-NN works with any number of predictors.

**Figure 1 Weighted k-NN Classification Demo Run**

**Figure 2 Weighted k-NN Data**

The data has been normalized. When using k-NN, it’s important to normalize the data so that large values, such as an annual expense of $30,000, don’t overwhelm small values, such as an age of 25. The goal of the demo is to predict the class of a new person who has normalized age and expenses of (0.35, 0.38). The demo sets k = 6 and finds the six labeled data items that are closest to the item-to-classify.

After identifying the six closest labeled data items, the demo uses a weighted voting technique to reach a decision. The weighted voting values for classes (0, 1, 2) are (0.3879, 0.4911, 0.1209), so the item at (0.35, 0.38) would be classified as class 1.

This article assumes you have intermediate or better programming skills with C# or a C-family language such as Python or Java, but doesn’t assume you know anything about the weighted k-NN algorithm. The complete demo code and the associated data are presented in this article. The source code and the data are also available in the accompanying download. All normal error checking has been removed to keep the main ideas as clear as possible.

## Understanding the Weighted k-NN Algorithm

The weighted k-NN algorithm is one of the simplest of all ML techniques, but there are a few tricky implementation details. The first question most developers have is, “How do you determine the value of k?” The somewhat unsatisfying answer is that the value of k must be determined by trial and error.

When using k-NN, you must compute the distances from the item-to-classify to all the labeled data. There are many distance functions, but using Euclidean distance is simple and effective. The Euclidean distance between the two items is the square root of the sum of the squared differences of coordinates. For example, if a labeled data item is (0.20, 0.60) and the item to classify is (0.35, 0.38) then:

```
dist = sqrt( (0.20 - 0.35)^2 + (0.60 - 0.38)^2 )
= sqrt(0.0225 + 0.0484)
= 0.2663
```

After computing all distances and finding the k-nearest distances, you must use a voting algorithm to determine the predicted class. There are many ways to compute a prediction from distances. The demo program uses the inverse weights technique, which is best explained by example. Suppose, as in the demo, the six closest distances are (0.0728, 0.0781, 0.1118, 0.1217, 0.1300, 0.1414) and the associated labeled classes are (0, 1, 0, 1, 1, 2). You compute the inverse of each distance, find the sum of the inverses, then divide each inverse by the sum. For the demo, the inverses are (13.7363, 12.8041, 8.9445, 8.2169, 7.6923, 7.0721). The sum of the inverses is 58.4663. Dividing each inverse distance by the sum gives the weights: (0.2349, 0.2190, 0.1530, 0.1405, 0.1316, 0.1210).

Each weight is a vote for its associated class. In the demo, the first and third closest items have class 0, so the vote for class 0 is the sum of the first and third weights: 0.2349 + 0.1530 = 0.3879. Similarly, the vote for class 1 is 0.2190 + 0.1405 + 0.1316 = 0.4911. And the vote for class 2 is just the sixth weight, 0.1210. The final prediction is the class that has the largest vote.

## The Demo Program

The complete demo program, with a few minor edits to save space, is presented in **Figure 3**. To create the program, I launched Visual Studio and created a new console application named WeightedKNN. I used Visual Studio 2017 but the demo has no significant .NET Framework dependencies.

Figure 3 The Weighted k-NN Demo Program

```
using System;
namespace WeightedKNN
{
class WeightedKNNProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin weighted k-NN demo ");
Console.WriteLine("Normalized age-expenses data: ");
Console.WriteLine("[id = 0, 0.25, 0.43, class = 0]");
Console.WriteLine(" . . . ");
Console.WriteLine("[id = 32, 0.46, 0.22, class = 2]");
double[][] data = new double[33][];
data[0] = new double[] { 0, 0.25, 0.43, 0 };
data[1] = new double[] { 1, 0.26, 0.54, 0 };
data[2] = new double[] { 2, 0.27, 0.60, 0 };
data[3] = new double[] { 3, 0.37, 0.31, 0 };
data[4] = new double[] { 4, 0.38, 0.70, 0 };
data[5] = new double[] { 5, 0.49, 0.32, 0 };
data[6] = new double[] { 6, 0.46, 0.70, 0 };
data[7] = new double[] { 7, 0.55, 0.32, 0 };
data[8] = new double[] { 8, 0.58, 0.74, 0 };
data[9] = new double[] { 9, 0.67, 0.42, 0 };
data[10] = new double[] { 10, 0.69, 0.51, 0 };
data[11] = new double[] { 11, 0.66, 0.63, 0 };
data[12] = new double[] { 12, 0.29, 0.43, 1 };
data[13] = new double[] { 13, 0.35, 0.51, 1 };
data[14] = new double[] { 14, 0.39, 0.63, 1 };
data[15] = new double[] { 15, 0.47, 0.40, 1 };
data[16] = new double[] { 16, 0.48, 0.50, 1 };
data[17] = new double[] { 17, 0.45, 0.61, 1 };
data[18] = new double[] { 18, 0.55, 0.41, 1 };
data[19] = new double[] { 19, 0.57, 0.53, 1 };
data[20] = new double[] { 20, 0.56, 0.62, 1 };
data[21] = new double[] { 21, 0.68, 0.12, 1 };
data[22] = new double[] { 22, 0.78, 0.24, 1 };
data[23] = new double[] { 23, 0.86, 0.30, 1 };
data[24] = new double[] { 24, 0.86, 0.22, 1 };
data[25] = new double[] { 25, 0.86, 0.12, 1 };
data[26] = new double[] { 26, 0.78, 0.14, 1 };
data[27] = new double[] { 27, 0.28, 0.13, 2 };
data[28] = new double[] { 28, 0.25, 0.21, 2 };
data[29] = new double[] { 29, 0.39, 0.14, 2 };
data[30] = new double[] { 30, 0.37, 0.24, 2 };
data[31] = new double[] { 31, 0.45, 0.13, 2 };
data[32] = new double[] { 32, 0.46, 0.22, 2 };
double[] item = new double[] { 0.35, 0.38 };
Console.WriteLine("Nearest (k=6) to (0.35, 0.38):");
Analyze(item, data, 6, 3); // 3 classes
Console.WriteLine("End weighted k-NN demo ");
Console.ReadLine();
} // Main
// ------------------------------------------------------
static void Analyze(double[] item, double[][] data,
int k, int c)
{
// 1. Compute all distances
int N = data.Length;
double[] distances = new double[N];
for (int i = 0; i < N; ++i)
distances[i] = DistFunc(item, data[i]);
// 2. Get ordering
int[] ordering = new int[N];
for (int i = 0; i < N; ++i)
ordering[i] = i;
double[] distancesCopy = new double[N];
Array.Copy(distances, distancesCopy, distances.Length);
Array.Sort(distancesCopy, ordering);
// 3. Show info for k nearest
double[] kNearestDists = new double[k];
for (int i = 0; i < k; ++i)
{
int idx = ordering[i];
Show(data[idx]);
Console.WriteLine(" distance = " +
distances[idx].ToString("F4"));
kNearestDists[i] = distances[idx];
}
// 4. Vote
double[] votes = new double[c]; // one per class
double[] wts = MakeWeights(k, kNearestDists);
Console.WriteLine("Weights (inverse technique): ");
for (int i = 0; i < wts.Length; ++i)
Console.Write(wts[i].ToString("F4") + " ");
Console.WriteLine("\n\nPredicted class: ");
for (int i = 0; i < k; ++i) {
int idx = ordering[i];
int predClass = (int)data[idx][3];
votes[predClass] += wts[i] * 1.0;
}
for (int i = 0; i < c; ++i)
Console.WriteLine("[" + i + "] " +
votes[i].ToString("F4"));
} // Analyze
static double[] MakeWeights(int k, double[] distances)
{
// Inverse technique
double[] result = new double[k]; // one perneighbor
double sum = 0.0;
for (int i = 0; i < k; ++i)
{
result[i] = 1.0 / distances[i];
sum += result[i];
}
for (int i = 0; i < k; ++i)
result[i] /= sum;
return result;
}
static double DistFunc(double[] item, double[] dataPoint)
{
double sum = 0.0;
for (int i = 0; i < 2; ++i) {
double diff = item[i] - dataPoint[i+1];
sum += diff * diff;
}
return Math.Sqrt(sum);
}
static void Show(double[] v)
{
Console.Write("idx = " + v[0].ToString().PadLeft(3) +
" (" + v[1].ToString("F2") + " " +
v[2].ToString("F2") + ") class = " + v[3]);
}
} // Program
} // ns
```

After the template code loaded, in the editor window I removed all unneeded namespace references, leaving just the reference to the top-level System namespace. In the Solution Explorer window, I right-clicked on file Program.cs, renamed it to the more descriptive WeightedKNNProgram.cs, and allowed Visual Studio to automatically rename class Program.

For simplicity, the demo program uses a static code approach rather than an object-oriented programming approach. Most of the work is done in the Analyze method.

## Loading Data into Memory

The demo program hardcodes the dummy data and the item-to-classify:

```
double[][] data = new double[33][];
data[0] = new double[] { 0, 0.25, 0.43, 0 };
// Etc.
data[32] = new double[] { 32, 0.46, 0.22, 2 };
double[] item = new double[] { 0.35, 0.38 };
```

In a non-demo scenario, you’d likely store your data in a text file or SQL table and write a helper method to load the data into an array-of-arrays-style matrix. I named the data as just “data,” rather than something like trainData because data used by k-NN isn’t really used to train a general model.

The first value in each data item is an arbitrary ID. I used consecutive integers from zero to 32 but, in many cases, you’d have meaningful IDs, such as a person’s employee ID number. The k-NN algorithm doesn’t need data IDs, so the ID column could’ve been omitted. The second and third values are the normalized predictor values. The fourth value is the class ID. Class IDs for k-NN don’t need to be zero-based integers, but using this scheme is programmatically convenient.

## Computing Distances

The first step in the k-NN algorithm is to compute the distance from each labeled data item to the item-to-be classified. Based on my experience and a few research results, the choice of distance function to apply isn’t as critical as you might think, and simple Euclidean distance usually works well.

The demo implementation of Euclidean distance is:

```
static double DistFunc(double[] item, double[] dataPoint)
{
double sum = 0.0;
for (int i = 0; i < 2; ++i) {
double diff = item[i] - dataPoint[i+1];
sum += diff * diff;
}
return Math.Sqrt(sum);
}
```

Notice that the for-loop indexing is hardcoded with 2 for the number of predictor values, and the difference between item[i] and dataPoint[i+1] is offset to account for the data ID value at position [0]. The point here is that there’s usually a tradeoff between coding an ML algorithm for a specific problem and coding the algorithm with a view toward general-purpose use. Because k-NN is so simple and easy to implement, I usually prefer the code-for-specific-problem approach.

At first thought, the requirement of computing a distance function seems to impose the major restriction on k-NN that all predictor variables must be strictly numeric. And until recently this idea was true, for the most part. However, one of the reasons for the increased interest in k-NN is that it’s possible for k-NN to deal with mixed numeric and non-numeric data by using neural encoding of the predictor variables.

Briefly, the idea is to encode predictor variables using a neural autoencoder. An autoencoder can deal with non-numeric data using one-hot or 1-of-(N-1) encoding. An autoencoder predicts its output values to match its input values. The central hidden layer of an autoencoder is a completely numeric representation of the source numeric and non-numeric data.

## Sorting or Ordering the Distances

After all distances have been computed, the k-NN algorithm must find the k-nearest (smallest) distances. One approach is to augment the entire labeled dataset with each distance value, then explicitly sort the augmented data. An alternative approach, used by the demo, is to use a neat overload of the .NET Array.Sort method to sort just the distances and the associated data indices in parallel. The key code is:

```
int[] ordering = new int[N];
for (int i = 0; i < N; ++i)
ordering[i] = i;
double[] distancesCopy = new double[N];
Array.Copy(distances, distancesCopy, distances.Length);
Array.Sort(distancesCopy, ordering);
```

The array named ordering initially holds 0, 1, 2 . . 32. A copy of the 33 distances to the item-to-classify is created because you don’t want to lose the distances in their original order. The statement Array.Sort(distancesCopy, ordering) sorts the values in the distancesCopy array from smallest to largest using the Quicksort algorithm, and simultaneously rearranges the index values in the ordering array. The result is that the first value in array ordering is the index to the data of the item with the smallest distance, the second value in ordering holds the index to the second-closest distance and so on. Nice!

## Determining k-NN Weights and Voting

The most primitive form of using the k-nearest distances to predict a class is to use a simple majority rule approach. For example, if k = 4 and c = 3, and two of the closest distances are associated with class 2, and one closest distance is associated with class 0, and one closest distance is associated with class 1, then a majority rule approach predicts class 2. Note that weighted k-NN using uniform weights, each with value 1/k, is equivalent to the majority rule approach.

The majority rule approach has two significant problems: First, ties are possible. Second, all distances are equally weighted. The most common weighting scheme for weighted k-NN is to apply the inverse weights approach used by the demo program. But there are many other techniques. For example, the reverse distances technique sums all distances, divides all distances by the sum, then reverses the order.

Another approach is to use the rank of the k-nearest distances (1, 2, . . 6) instead of the distances themselves. For k = 6, the rank order centroid weights would be calculated as (1 + 1/2 + 1/3 + 1/4 + 1/5 + 1/6) / 6 = 0.4083, (1/2 + 1/3 + 1/4 + 1/5 + 1/6) / 6 = 0.2417, and so on.

For the demo program, the inverse, uniform, reverse and centroids weights are:

```
Inverse Uniform Reverse Centroids
---------------------------------------
0.2349 0.1667 0.2156 0.4083
0.2190 0.1667 0.1982 0.2417
0.1530 0.1667 0.1856 0.1583
0.1405 0.1667 0.1705 0.1028
0.1316 0.1667 0.1191 0.0611
0.1210 0.1667 0.1110 0.0278
```

## Wrapping Up

The weighted k-NN classification algorithm has received increased attention recently for two reasons. First, by using neural autoencoding, k-NN can deal with mixed numeric and non-numeric predictor values. Second, compared to many other classification algorithms, the results of weighted k-NN are relatively easy to interpret. Interpretability has become increasingly important due to legal requirements of ML techniques from regulations such as the European Union’s GDPR.

**Dr. James McCaffrey** *works for Microsoft Research in Redmond, Wash. He has worked on several key Microsoft products including Azure and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.*

Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee, Ricky Loynd