November 2019

Volume 34 Number 11

### [Test Run]

# Mixture Model Clustering Using C#

Data clustering is the process of grouping data items so that similar items are in the same group/cluster and dissimilar items are in different clusters. The most commonly used clustering algorithm is called k-means. The k-means approach is simple and effective, but it doesn’t always work well with a dataset that has skewed distributions.

In this article I explain how to implement mixture model data clustering using the C# language. The best way to understand what mixture model clustering is and to see where this article is headed is to examine the demo program in **Figure 1**. The demo sets up a tiny dummy dataset with eight items. Each data item represents the height and width of a package of some sort. The first item is (0.2000, 0.7000) and the last item is (0.7000, 0.1000).

**Figure 1 Mixture Model Clustering Demo Run**

There are many variations of mixture model clustering. The variation presented in this article works only with numeric data and requires that the data be normalized or scaled so that all values are roughly in the same range. Normalization prevents large values, such as annual incomes, from overwhelming small values such as high school GPAs.

The demo program specifies the number of clusters as K = 3. Most clustering techniques, including mixture model clustering, require you to set the number of clusters, and this must be determined by trial and error. Mixture model clustering uses a technique called expectation-maximization (EM) optimization, which is an iterative process. The demo iterates five times.

The final clustering result is contained in a matrix called membership weights, w. The w matrix has eight rows. Each row of w corresponds to a row of the dataset.

The first row of w is (0.9207, 0.0793, 0.0000). Notice that the values in each row of w sum to 1.0 so they can be loosely interpreted as the probability that the corresponding data item belongs to a cluster. Therefore, data item [0] belongs to cluster 0 because column [0] of w has the largest value. Similarly, the value of w[7] is (0.0000, 0.2750, 0.7250), so data item [7] belongs to cluster 2.

The demo program displays the values of internal data structures named Nk (column sums of w), a (mixture weights), u (cluster means) and V (cluster variances). These values are used by the EM optimization algorithm part of mixture model clustering.

This article assumes you have intermediate or better programming skills with C#, but doesn’t assume you know anything about mixture model data clustering. The demo is coded using C#, but you shouldn’t have any trouble refactoring the code to another language such as JavaScript or Visual Basic. The complete demo code is presented in this article. The source code is also available in the accompanying download.

## Understanding the Multivariate Gaussian Distribution

Mixture model clustering and EM are really general techniques rather than specific algorithms. The key math equations for the variation of mixture model clustering and EM optimization used in this article are shown in **Figure 2**. Depending on your math background, the equations may appear intimidating, but they’re actually quite simple, as you’ll see shortly.

**Figure 2 Mixture Model Expectation-Maximization Optimization Equations**

The key to the variation of mixture model clustering presented here is understanding the multivariate Gaussian (also called normal) probability distribution. And to understand the multivariate Gaussian distribution, you must understand the univariate Gaussian distribution.

Suppose you have a dataset that consists of the heights (in inches) of men who work at a large tech company. The data would look like 70.5, 67.8, 71.4 and so forth (all inches). If you made a bar graph for the frequency of the heights, you’d likely get a set of bars where the highest bar is for heights between 68.0 to 72.0 inches (frequency about 0.6500). You’d get slightly lower bars for heights between 64.0 to 68.0 inches, and for 72.0 inches to 76.0 inches (about 0.1400 each), and so on. The total area of the bars would be 1.0. Now if you drew a smooth line connecting the tops of the bars, you’d see a bell-shaped curve.

The univariate Gaussian distribution is a mathematical abstraction of this idea. The exact shape of the bell-shaped curve is determined by the mean (average) and variance (measure of spread). The math equation that defines the bell-shaped curve of a univariate Gaussian distribution is called the probability density function (PDF). The math equation for the PDF of a univariate Gaussian distribution is shown as equation (1) in **Figure 2**. The variance is given by sigma squared. For example, if x = 1.5, u = 0, and var = 1.0 then f(x, u, var) = 0.1295, which is the height of the bell-shaped curve at point x = 1.5.

The value of the PDF for a given value of x isn’t a probability. The probability of a range of values for x is the area under the curve of the PDF function, which can be determined using calculus. Even though a PDF value by itself isn’t a probability, you can compare PDF values to get relative likelihoods. For example, if PDF(x1) = 0.5000 and PDF(x2) = 0.4000 you can conclude that the probability of x1 is greater than the probability of x2.

Now, suppose your dataset has items where each has two or more values. For example, instead of just the heights of men, you have (height, weight, age). The math abstraction of this scenario is called the multivariate Gaussian distribution. The dimension, d, of a multivariate Gaussian distribution is the number of values in each data item, so (height, weight, age) data has dimension d = 3.

The equation for the PDF of a multivariate Gaussian distribution is shown as equation (2) in **Figure 2**. The x is in bold to indicate it’s a multivalued vector. For a multivariate Gaussian distribution, instead of a variance, there is a (d x d) shape covariance matrix, represented by Greek letter uppercase sigma. Calculating the PDF for a univariate x is simple, but calculating the PDF for a multivariate x is extremely difficult because you must calculate the determinant and inverse of a covariance matrix.

Instead of using the multivariate Gaussian distribution, you can greatly simplify mixture model calculations by assuming that each component of a multivalued data item follows an independent univariate Gaussian distribution. For example, suppose d = 3 and x = (0.4, 0.6, 0.8). Instead of calculating a multivalued mean and a (d x d) multivalued covariance matrix and then using equation (2), you can look at each data component separately. This would give you three means, each a single value, and three variances, each also a single value. Then you can calculate three separate PDF values using equation (1) and average the three PDF values to get a final PDF value.

This technique is called a Gaussian mixture model because it assumes the data is clustered according to a mixture of K multivariate Gaussian probability distributions. It’s known as a naive approach because the assumption that the data components are mathematically independent of each other is rarely true for real-life data. However, the naive Gaussian PDF approach is dramatically simpler than calculating a true multivariate Gaussian PDF, and the naive technique often (but not always) works well in practice.

## The Demo Program

The complete demo program, with several minor edits to save space, is presented in **Figure 3**. To create the program, I launched Visual Studio and created a new console application named MixtureModel. I used Visual Studio 2017, but the demo has no significant .NET Framework dependencies.

Figure 3 The Mixture Model Clustering Demo Program

```
using System;
namespace MixtureModel
{
class MixtureModelProgram
{
const int N = 8; const int K = 3; const int d = 2;
static void Main(string[] args)
{
Console.WriteLine("Begin mixture model with demo ");
double[][] x = new double[N][]; // 8 x 2 data
x[0] = new double[] { 0.2, 0.7 };
x[1] = new double[] { 0.1, 0.9 };
x[2] = new double[] { 0.2, 0.8 };
x[3] = new double[] { 0.4, 0.5 };
x[4] = new double[] { 0.5, 0.4 };
x[5] = new double[] { 0.9, 0.3 };
x[6] = new double[] { 0.8, 0.2 };
x[7] = new double[] { 0.7, 0.1 };
Console.WriteLine("Data (height, width): ");
Console.Write("[0] "); VectorShow(x[0]);
Console.Write("[1] "); VectorShow(x[1]);
Console.WriteLine(". . . ");
Console.Write("[7] "); VectorShow(x[7]);
Console.WriteLine("K=3, initing w, a, u, S, Nk");
double[][] w = MatrixCreate(N, K);
double[] a = new double[K] { 1.0/K, 1.0/K, 1.0/K };
double[][] u = MatrixCreate(K, d);
double[][] V = MatrixCreate(K, d, 0.01);
double[] Nk = new double[K];
u[0][0] = 0.2; u[0][1] = 0.7;
u[1][0] = 0.5; u[1][1] = 0.5;
u[2][0] = 0.8; u[2][1] = 0.2;
Console.WriteLine("Performing 5 E-M iterations ");
for (int iter = 0; iter < 5; ++iter) {
UpdateMembershipWts(w, x, u, V, a); // E step
UpdateNk(Nk, w); // M steps
UpdateMixtureWts(a, Nk);
UpdateMeans(u, w, x, Nk);
UpdateVariances(V, u, w, x, Nk);
}
Console.WriteLine("Clustering done. \n");
ShowDataStructures(w, Nk, a, u, V);
Console.WriteLine("End mixture model demo");
Console.ReadLine();
} // Main
static void ShowDataStructures(double[][] w,
double[] Nk, double[] a, double[][] u, double[][] V)
{
Console.WriteLine("w:"); MatrixShow(w, true);
Console.WriteLine("Nk:"); VectorShow(Nk, true);
Console.WriteLine("a:"); VectorShow(a, true);
Console.WriteLine("u:"); MatrixShow(u, true);
Console.WriteLine("V:"); MatrixShow(V, true);
}
static void UpdateMembershipWts(double[][] w,
double[][] x, double[][] u, double[][] V, double[] a)
{
for (int i = 0; i < N; ++i) {
double rowSum = 0.0;
for (int k = 0; k < K; ++k) {
double pdf = NaiveProb(x[i], u[k], V[k]);
w[i][k] = a[k] * pdf;
rowSum += w[i][k];
}
for (int k = 0; k < K; ++k)
w[i][k] = w[i][k] / rowSum;
}
}
static void UpdateNk(double[] Nk, double[][] w)
{
for (int k = 0; k < K; ++k) {
double sum = 0.0;
for (int i = 0; i < N; ++i)
sum += w[i][k];
Nk[k] = sum;
}
}
static void UpdateMixtureWts(double[] a, double[] Nk)
{
for (int k = 0; k < K; ++k)
a[k] = Nk[k] / N;
}
static void UpdateMeans(double[][] u, double[][] w,
double[][] x, double[] Nk)
{
double[][] result = MatrixCreate(K, d);
for (int k = 0; k < K; ++k) {
for (int i = 0; i < N; ++i)
for (int j = 0; j < d; ++j)
result[k][j] += w[i][k] * x[i][j];
for (int j = 0; j < d; ++j)
result[k][j] = result[k][j] / Nk[k];
}
for (int k = 0; k < K; ++k)
for (int j = 0; j < d; ++j)
u[k][j] = result[k][j];
}
static void UpdateVariances(double[][] V, double[][] u,
double[][] w, double[][] x, double[] Nk)
{
double[][] result = MatrixCreate(K, d);
for (int k = 0; k < K; ++k) {
for (int i = 0; i < N; ++i)
for (int j = 0; j < d; ++j)
result[k][j] += w[i][k] * (x[i][j] - u[k][j]) *
(x[i][j] - u[k][j]);
for (int j = 0; j < d; ++j)
result[k][j] = result[k][j] / Nk[k];
}
for (int k = 0; k < K; ++k)
for (int j = 0; j < d; ++j)
V[k][j] = result[k][j];
}
static double ProbDenFunc(double x, double u, double v)
{
// Univariate Gaussian
if (v == 0.0) throw new Exception("0 in ProbDenFun");
double left = 1.0 / Math.Sqrt(2.0 * Math.PI * v);
double pwr = -1 * ((x - u) * (x - u)) / (2 * v);
return left * Math.Exp(pwr);
}
static double NaiveProb(double[] x, double[] u,
double[] v)
{
// Poor man's multivariate Gaussian PDF
double sum = 0.0;
for (int j = 0; j < d; ++j)
sum += ProbDenFunc(x[j], u[j], v[j]);
return sum / d;
}
static double[][] MatrixCreate(int rows, int cols,
double v = 0.0)
{
double[][] result = new double[rows][];
for (int i = 0; i < rows; ++i)
result[i] = new double[cols];
for (int i = 0; i < rows; ++i)
for (int j = 0; j < cols; ++j)
result[i][j] = v;
return result;
}
static void MatrixShow(double[][] m, bool nl = false)
{
for (int i = 0; i < m.Length; ++i) {
for (int j = 0; j < m[0].Length; ++j) {
Console.Write(m[i][j].ToString("F4") + " ");
}
Console.WriteLine("");
}
if (nl == true)
Console.WriteLine("");
}
static void VectorShow(double[] v, bool nl = false)
{
for (int i = 0; i < v.Length; ++i)
Console.Write(v[i].ToString("F4") + " ");
Console.WriteLine("");
if (nl == true)
Console.WriteLine("");
}
} // Program class
} // ns
```

After the template code loaded, at the top of the editor window I removed all unneeded namespace references, leaving just the reference to the top-level System namespace. In the Solution Explorer window, I right-clicked on file Program.cs, renamed it to the more descriptive MixtureModelProgram.cs and allowed Visual Studio to automatically rename class Program.

All normal error checking has been omitted to keep the main ideas as clear as possible. I used static methods and class-scope constants rather than OOP for simplicity.

## The Data Structures

The demo code sets constants N = 8 (number of data items), d = 2 (data dimensionality), and K = 3 (number of clusters). There are six data structures. Matrix X is an array-of-arrays with size (8 x 2) and it holds the data. In a non-demo scenario you’d load normalized data into memory from a text file.

Matrix w, which has size (8 x 3), holds the membership weights and the clustering information. Each row of w represents a data item, and each column of w corresponds to a cluster. Mixture model clustering assumes that each cluster follows some probability distribution. The most commonly assumed distribution is the multivariate Gaussian, so the technique is called Gaussian mixture model (GMM). The demo uses a simplified Gaussian, so I call the technique naive Gaussian mixture model, but this isn’t a standard name.

Helper array Nk has 3 cells and holds the sums of the 3 columns of w. Array a has 3 cells and is called the mixture weights. The values in array a sum to 1.0 and are weights that indicate the contribution of each Gaussian distribution to the model. For example, if a = (0.30, 0.60, 0.10), then the first Gaussian distribution contributes 30 percent to the mixture. Each cell of the a array is initialized to 1 / 3 = 0.3333 so that each Gaussian initially contributes equally.

Matrices u and V have size (3 x 2) and hold the univariate means and variances. For example, if u[1][0] = 0.55, the mean for cluster 1, data component 0 (package height), is 0.55. If V [2][1] = 0.33, the variance for cluster 2, data component 1 (package width), is 0.33.

## Updating Membership Weights

The membership weights stored in matrix w are updated in each EM iteration according to equation (3) in **Figure 2**. In words, for each N = 8 rows of w, you compute the K = 3 naive probabilities using the corresponding values of X, u, and V and then multiply each naive probability by the corresponding mixture weight stored in the a vector. After you compute the weighted probabilities for a row, each of the values in the row is normalized by dividing by the row sum.

Updating the membership weights is the expectation step (often shortened to E-step) of the EM optimization algorithm. After membership weights in w are updated, the Nk column sums and mixture weights in vector a are updated using equation (4), the means in matrix u are updated using equation (5), and the variances in matrix V are updated using equation (6) in **Figure 2**. Updating Nk, a, u and V is the maximization step of EM optimization. The idea is to find the values in a, u and V that best match the data.

## Wrapping Up

The code and most of the math notation used in this article are based on a short paper titled “The EM Algorithm for Gaussian Mixtures.” The paper isn’t on a stable Web site and has no author given, but the subtitle and URL indicate the paper is part of lecture notes from a UC Irvine computer science class. You can easily find the paper as a .pdf file by doing an Internet search for the title. I used this somewhat obscure paper because it’s clear and accurate, and almost all of the other Gaussian mixture model resources I found had significant technical errors.

Data clustering is usually an exploratory process. Clustering results are typically examined visually to see if any interesting patterns appear. Most of my colleagues start a clustering investigation by looking at a graph of the source data. If the data appears evenly distributed, then using the k-means algorithm (or one of its many variations) usually works well. But if the data appears skewed, a mixture model clustering approach such as the one presented in this article often gives better results. Most clustering algorithms, including naive Gaussian mixture model, are extremely sensitive to initial values of the means and variances, so these values are often supplied via a preliminary k-means analysis.

## Aloha, Servus and Ciao

It’s been a tremendous pleasure writing articles about software development, testing and machine learning for *MSDN Magazine* over the past 17 years. Rather than look back, I prefer to look forward. The essence of *MSDN Magazine* has always been filling the gap between low-level software documentation (too much detail) and high-level Hello World-style examples (not enough detail). The need to fill that gap won’t go away so I speculate that the soul of *MSDN Magazine* will quickly reemerge in an online format of some sort. So goodbye/hello for now and keep your eyes open for *MSDN Magazine* authors and editors providing useful information in a new format to the software developer community.

– James McCaffrey

**Dr. James McCaffrey** *works for Microsoft Research in Redmond, Wash. He has worked on several key Microsoft products including Azure and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.*

Thanks to the following Microsoft technical experts who reviewed this article: Ricky Loynd, Kirk Olynyk