October 2016

Volume 31 Number 10

[Test Run]

ANOVA with C#

By James McCaffrey

James McCaffreyAnalysis of variance (ANOVA) is a classical statistics technique that’s used to infer if the means (averages) of three or more groups are all equal, in situations where you only have sample data. For example, suppose there are three different introductory computer science classes at a university. Each class is taught by the same teacher, but uses a different textbook and a different teaching philosophy. You want to know if student performance is the same or not.

You have an exam that evaluates computer science proficiency on a one-to-15 scale, but because the exam is very expensive and time-consuming, you can give the exam to only six randomly selected students in each class. You administer the exam and perform ANOVA on the samples to infer if the means of all three classes are the same or not.

If you’re new to ANOVA, the name of the technique may be mildly confusing because the goal is to analyze the means of data sets. ANOVA is named as it is because behind the scenes it analyzes variances to make inferences about means.

A good way to get an idea of what ANOVA is and to see where this article is headed is to take a look at the demo program in Figure 1. The demo sets up hardcoded scores for three groups. Notice that there are only four scores in Group1 and only five scores in Group3—it’s quite common for sample sizes to be unequal because test subjects can drop out or data can be lost or corrupted.

ANOVA with C# Demo Program
Figure 1 ANOVA with C# Demo Program

There are two main steps to ANOVA. In the first step, an F-statistic value and a pair of values called the “degrees of freedom” (df) are calculated using the sample data. In the second step, the values of F and df are used to determine the probability that all population means are the same (the p-value). The first step is relatively easy. The second step is very difficult.

In the demo, the value of F is 15.884. In general, the larger F is, the less likely it is that all population means are equal. I’ll explain why df = (2, 12) shortly. Using F and df, the p-value is calculated to be 0.000425. This is very small, so you’d conclude that the population means are likely not all the same. At this point, you could perform additional statistical tests to determine which population means are different from each other. For the demo data, it appears that Group1 (sample mean = 4.50) is worse than Group2 (mean = 9.67) and Group3 (mean = 10.60).

The Demo Program

To create the demo program, I launched Visual Studio, clicked on File | New | Project and selected the C# Console Application option. The demo program has no significant .NET dependencies, so any version of Visual Studio will work. After the template code loaded in the Solution Explorer window, I right-clicked on file Program.cs and renamed it to AnovaProgram.cs and allowed Visual Studio to automatically rename class Program for me.

At the top of the editor window, I deleted all unnecessary using statements, leaving just the one that references the top-level System namespace. The overall structure of the program is shown in Figure 2. The demo program is too long to present in its entirety, but the complete demo source code is available in the download that accompanies this article.

Figure 2 Demo Program Structure

using System;
namespace Anova
{
  class AnovaProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin ANOVA using C# demo");
      // Set up sample data
      // Use data to calculate F and df
      // Use F and df to calculate p-value
      Console.WriteLine("End ANOVA demo");
    }
    static double Fstat(double[][] data, out int[] df) { . . }
    static double LogGamma(double z) { . . }
    static double BetaIncCf(double a, double b, double x) { . . }
    static double BetaInc(double a, double b, double x) { . . }
    static double PF(double a, double b, double x) { . . }
    static double QF(double a, double b, double x) { . . }
    static void ShowData(double[][] data, string[] colNames) { . . }
  }
}

Static method Fstat computes and returns an F-statistic based on data stored in an array-of-arrays object. The method also calculates and returns two df values in an array out-parameter. Function ShowData is just a little helper function to display the sample means.

The remaining five methods are all used to calculate the p-value. Method QF is the primary method. It calls method PF, which in turn calls method BetaInc, which in turn calls methods BetaIncCf and LogGamma.

After some preliminary WriteLine messages, the Main method sets up and displays the sample data:

double[][] data = new double[3][]; // 3 groups
data[0] = new double[] { 3, 4, 6, 5 };
data[1] = new double[] { 8, 12, 9, 11, 10, 8 };
data[2] = new double[] { 13, 9, 11, 8, 12 };
string[] colNames = new string[] {  "Group1", "Group2", "Group3" };
ShowData(data, colNames);

In a non-demo scenario, your data would likely be stored in a text file and you’d write a helper function to read and load the data into an array-of-arrays.

The F-statistic and df are calculated like so:

int[] df = null;
double F = Fstat(data, out df);

For ANOVA, the df for a data set is a pair of values. The first value is K - 1 where K is the number of groups, and the second value is N - K where N is the total number of sample values. So for the demo data, df = (K-1, N-K) = (3-1, 15-3) = (2, 12).

The p-value is computed and displayed like this:

double pValue = QF(df[0], df[1], F);
Console.Write("p-value = ");

In short, when performing an ANOVA, the calling statements are very simple. But there’s a lot of work that goes on behind the scenes.

Calculating the F-Statistic

Calculating the value of an F-statistic has several sub-steps. Suppose the sample data values are the ones from the demo:

Group1: 3.00, 4.00, 6.00, 5.00
Group2: 8.00, 12.00, 9.00, 11.00, 10.00, 8.00
Group3: 13.00, 9.00, 11.00, 8.00, 12.00

The first sub-step is to calculate the means of each group, and the overall mean of all sample values. For the demo data:

means[0] = (3.0 + 4.0 + 6.0 + 5.0) / 4 = 4.50
means[1] = (8.0 + 12.0 + 9.0 + 11.0 + 10.0 + 8.0) / 6 = 9.67
means[2] = (13.0 + 9.0 + 11.0 + 8.0 + 12.0) / 5 = 10.60
gMean = (3.0 + 4.0 + . . . + 12.0) / 15 = 8.60

The definition of method Fstat starts with:

static double Fstat(double[][] data, out int[] df)
{
  int K = data.Length; // Number groups
  int[] n = new int[K]; // Number items each group
  int N = 0; // total number data points
  for (int i = 0; i < K; ++i) {
    n[i] = data[i].Length;
    N += data[i].Length;
  }
...

At this point, local array n has the number of values in each group, K has the number of groups, and N is the total number of values in all groups. Next, the group means are calculated into an array named means, and the overall grand mean is calculated into variable gMean:

double[] means = new double[K];
double gMean = 0.0;
for (int i = 0; i < K; ++i) {
  for (int j = 0; j < data[i].Length; ++j) {
    means[i] += data[i][j];
    gMean += data[i][j];
  }
  means[i] /= n[i];
}
gMean /= N;

The next sub-step is to calculate the “sum of squares between groups” (SSb) and “mean square between groups” (MSb). SSb is the weighted sum of squared differences between each group mean and the overall mean. MSb = SSb / (K-1) where K is the number of groups. For the demo data:

SSb = (4 * (4.50 - 8.60)^2) +
 (6 * (9.67 - 8.60)^2) +
 (5 * (10.60 - 8.60)^2) = 94.07
MSb = 94.07 / (3-1) = 47.03

The code that calculates SSb and MSb is:

double SSb = 0.0;
for (int i = 0; i < K; ++i)
  SSb += n[i] * (means[i] - gMean) * (means[i] - gMean);
double MSb = SSb / (K - 1);

The next sub-step is to calculate the “sum of squares within groups” (SSw) and the “mean square within groups (MSw). SSw is the sum of squared differences between each sample value and its group mean. MSw = SSw / (N-K). For the demo data:

SSw = (3.0 - 4.50)^2 + . . + (8.0 - 9.67)^2 +
 . . + (12.0 - 10.60)^2 = 35.53
MSw = 35.53 / (15-3) = 2.96

The code that calculates SSw and MSw is:

double SSw = 0.0;
for (int i = 0; i < K; ++i)
  for (int j = 0; j < data[i].Length; ++j)
    SSw += (data[i][j] - means[i]) * (data[i][j] - means[i]);
double MSw = SSw / (N - K);

The final sub-step is to calculate the two df values and the F-statistic. The two df values are K - 1, and N - K. And F = MSb / MSw. For the demo data:

df = (K-1, N-K) = (3-1, 15-3) = (2, 12)
F = 47.03 / 2.96 = 15.88.

The demo code that calculates df and F is:

...
  df = new int[2];
  df[0] = K - 1;
  df[1] = N - K;
  double F = MSb / MSw;
  return F;
} // Fstat

I think you’ll agree that calculating an F-statistic and df values from a set of data is mechanical and relatively easy once you know the math equations.

Calculating the p-value

Converting an F-statistic and df values into a p-value that tells you the probability that all population means are equal based on the sample data that produce F and df is simple in principle, but extremely difficult in practice. I’ll explain as briefly as possible, leaving out an enormous amount of detail that would require a huge amount of additional explanation. Take a look at the graph in Figure 3.

Calculating the p-Value from an F-Statistic and df Values
Figure 3 Calculating the p-Value from an F-Statistic and df Values

Each possible pair of df values determines a graph called the F-distribution. The shape of an F-distribution can vary wildly based on the values of df. The graph in Figure 3 shows an F-distribution for df = (4, 12). I used df = (4, 12) rather than the df = (2, 12) from the demo data because the shape of the df = (2, 12) F-distribution is very atypical.

The total area under any F-distribution is exactly 1.0. If you know the value of the F-statistic, then the p-value is the area under the F-distribution from F to positive infinity. Somewhat confusingly, the area under the F-distribution from zero to the F-statistic is often called PF, and the area under the F-distribution from the F-statistic to positive infinity (representing the p-value) is often called QF. Because the total area under the distribution is 1, PF + QF = 1. It turns out that it’s a bit easier to calculate PF than QF, so to find the p-value (QF), you typically calculate PF then subtract that from 1 to get QF.

Calculating PF is brutally difficult but, luckily, magic estimation equations have been known for decades. These math equations, and hundreds of others, can be found in a famous reference, “Handbook of Mathematical Functions” by M. Abramowitz and I.A. Stegun. The book is often called simply “A&S” among scientific programmers. Each A&S equation has an ID number.

In the demo, method PF is really just a wrapper around method BetaInc:

static double PF(double a, double b, double x)
{
  double z = (a * x) / (a * x + b);
  return BetaInc(a / 2, b / 2, z);
}

The name of method BetaInc stands for “incomplete Beta.” Method BetaInc uses A&S equations 6.6.2 and 26.5.8. Those equations call a LogGamma function and a BetaIncCf function. The LogGamma function is extremely difficult to explain and to implement. Briefly, the mathematical Gamma function extends the notion of factorial to real-valued numbers. Just like factorials, the values of the Gamma function can become astronomically large, so to handle them it’s common to compute the log of Gamma to keep values smaller.

Calculating LogGamma is very tricky and there are several algo­rithms you can use. The demo program uses an algorithm called the Lanczos approximation with (g=5, n=7). The A&S reference has different algorithms that can calculate LogGamma, but the Lanczos approximation, which was not known when A&S was published, gives more accurate results.

The name of method BetaIncCf stands for “incomplete Beta computed by continued fraction.” The demo program uses A&S equation 26.5.8 for method BetaIncCf.

Wrapping Up

An ANOVA test makes three mathematical assumptions: that the group data items are mathematically independent; that the population data sets are distributed normally (as in the Gaussian distribution); and that the population data sets have equal variances.

There are several ways you can test these assumptions, but inter­preting their results is challenging. The problem is that it’s highly unlikely that real-world data is exactly normal and has exactly equal variances, though ANOVA still works when data is somewhat non-normal or has non-equal variances. The bottom line is that it’s extremely difficult to prove ANOVA assumptions so you should be very conservative when interpreting results.

ANOVA is closely related to the t-test. The t-test determines if the population means of exactly two groups are equal, in situations where you have only sample data. So, if you have three groups, as in the demo, instead of using ANOVA, you could conceivably perform three t-tests, comparing groups 1 and 2, groups 1 and 3, and groups 2 and 3. However, this approach isn’t recommended because it introduces what’s called a Type 1 error (a false positive).

The kind of ANOVA explained in this article is called one-way (or one-factor) ANOVA. A different technique, called two-way ANOVA, is used when there are two factors.

ANOVA is based on the calculated value of an F-statistic from a data set. There are other statistical tests that use an F-statistic. For example, you can use an F-statistic to infer if the variances of two groups of data are the same.


Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. Dr. McCaffrey can be reached at jammc@microsoft.com.

Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee and Kirk Olynk


Discuss this article in the MSDN Magazine forum