November 2014

Volume 29 Number 11

# Test Run : Consensus Classification Using C#

James McCaffrey In machine learning (ML), classification is the process of creating a model (typically a math equation of some sort, or a set of rules) that predicts a value that can take on discrete, non-numeric values. For example, you might want to predict the political party (Democrat or Republican) of a member of Congress based on his or her voting record. Training the model is the process of finding the set of constants (for a math equation model) or set of rules (for a rule-based model) so that when presented with training data with known output-dependent variable values, the computed outputs closely match the known outputs. Then the model can be used to make predictions for new data with unknown outputs.

Although there are many standard classification algorithms and techniques, including Naive Bayes classification, logistic regression classification and neural network classification, for some problems a custom classification algorithm is useful. This article presents a custom technique that, for lack of a better name, I call consensus classification. The best way to get a feel for what consensus classification is and to see where this article is headed is to take a look at the demo program in Figure 1. Figure 1 Consensus Classification in Action

The goal of the demo program is to create a model that predicts the political party, Democrat or Republican, of a member of the U.S. House of Representatives based on the representative’s voting record on 16 legislative bills. The raw data set consists of 100 items, each of which corresponds to a Representative and has 17 fields. The first 16 fields in each item are votes where character “y” is a yes vote and character “n” is a no vote. The last field in each item is the Representative’s actual political party. For example, the first two data items are:

``````n, y, y, n, y, y, n, n, n, n, n, n, y, y, y, y, democrat
n, y, n, y, y, y, n, n, n, n, n, y, y, y, n, y, republican
``````

The demo data is a subset of a well-known benchmark set called the Congressional Voting Records Data Set. The complete benchmark data set has 435 items, some of which have a vote value of “?,” indicating an unknown vote, or abstention. The demo subset consists of the first 100 items, from the full set, that have no “?” votes. Additionally, the source data set has political party in the first column; I moved party to the last column, which is much more convenient when programming.

Although it’s not necessary to know what legislative bill each of the 16 votes correspond to, the topic of each bill is suggested by the 16 following short terms: handicapped-infants, water-project, adopt-budget, physician-fees, el-salvador, school-religion, anti-satellite, nicaraguan-contras, mx-missile, immigration-bill, synfuels-cutback, education-spending, superfund-sue, crime-bill, duty-free, south-africa.

The demo program splits the 100-item data set into an 80-item set used to train the model, and a 20-item data set used to estimate the accuracy of the resulting model. A consensus classification model consists of a set of simple rules such as, “If the Representative voted ‘y’ on bill 0 and ‘n’ on bill 3 and ‘y’ on bill 15, then the Representative is a Republican.” In the demo, the number of Boolean conditions in each simple rule is set to 5, and the total number of rules is set to 500. Furthermore, each simple rule is required to be at least 90 percent accurate on the training data items for which the rule is applicable.

Behind the scenes, the training process generated the 500 simple rules. After the model was created, the demo program applied the model rules to the training data and got a 93.75 percent accuracy. Then the model was applied to the 20-item test set, resulting in 84.21 percent accuracy—16 correct predictions, three incorrect predictions, and one data item where none of the 500 rules in the model was applicable.

This article assumes you have at least intermediate programming skills and a basic knowledge of ML classification. The demo is coded using C#, but you shouldn’t have much trouble refactoring the code to another language, such as Visual Basic .NET or Python. The demo code is a bit too long to present in its entirety, but the entire source code is available in the download that accompanies this article at msdn.microsoft.com/magazine/msdnmag1114.

## Overall Program Structure

The overall structure of the demo program, with some WriteLine statements removed and minor edits to save space, is presented in Figure 2. To create the demo, I launched Visual Studio and created a new C# console application and named it Consensus­Classification. The demo has no significant Microsoft .NET Framework dependencies, so any relatively recent version of Visual Studio will work. After the template-generated code loaded, in the Solution Explorer window I renamed file Program.cs to the more descriptive ConsensusProgram.cs and Visual Studio automatically renamed class Program for me.

Figure 2 Overall Program Structure

``````using System;
using System.Collections.Generic;
namespace ConsensusClassification
{
class ConsensusProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin consensus classification demo");
Console.WriteLine("Goal is predict political party");
string[][] allData = new string[];
allData = new string[] { "n", "y", "y", "n", "y", "y", "n", "n",
"n", "n", "n", "n", "y", "y", "y", "y", "democrat" };
allData = new string[] { "n", "y", "n", "y", "y", "y", "n", "n",
"n", "n", "n", "y", "y", "y", "n", "y", "republican" };
// Etc.
allData = new string[] { "y", "n", "y", "n", "n", "y", "y", "y",
"y", "y", "y", "n", "n", "n", "y", "y", "democrat" };
Console.WriteLine("All data: ");
ShowData(allData, 5, true);
Console.WriteLine("Creating 80-20 train-test data");
string[][] trainData;
string[][] testData;
MakeTrainTest(allData, 0, out trainData, out testData); // 0 = seed
Console.WriteLine("Training data: \n");
ShowData(trainData, 3, true);
Console.WriteLine("Test data: \n");
ShowData(testData, 3, true);
int numConditions = 5; // Conditions per rule
int maxNumRules = 500;
double minAccuracy = 0.90; // Min % rule accuracy
Console.WriteLine("Setting number conditions per rule = " +
numConditions);
Console.WriteLine("Setting max number simple rules    = " +
maxNumRules);
Console.WriteLine("Setting simple rule min accuracy   = " +
minAccuracy.ToString("F2"));
ConsensusClassifier cc =
new ConsensusClassifier(numConditions, maxNumRules);
Console.WriteLine("Starting training");
cc.Train(trainData, minAccuracy);
Console.WriteLine("Done");
Console.WriteLine("Created " + cc.RuleCount() + " simple rules");
double trainAcc = cc.Accuracy(trainData);
Console.WriteLine("Accuracy on train data = " +
trainAcc.ToString("F4"));
int numCorrect, numWrong, numUnknown;
double testAcc = cc.Accuracy(testData, out numCorrect,
out numWrong, out numUnknown);
Console.WriteLine("Accuracy on test data  = " +
testAcc.ToString("F4"));
Console.WriteLine("Number correct = " + numCorrect);
Console.WriteLine("Number wrong   = " + numWrong);
Console.WriteLine("Number unknown = " + numUnknown);
Console.WriteLine("End consensus classification demo\n");
}
static void MakeTrainTest(string[][] allData, int seed,
out string[][] trainData, out string[][] testData) { . . }
static void ShowData(string[][] rawData, int numRows,
bool indices) { . . }
} // Program
public class ConsensusClassifier { . . }
} // ns
``````

All the program logic is contained in the Main method. The demo has two static helper methods, MakeTrainTest and ShowData. The classification logic is contained in a single program-defined class named ConsensusClassifier. The classifier exposes six public methods: a single constructor, RuleCount (the actual number of simple rules in the model), ComputeOutput (to make predictions after the model has been created), Train (to create the model) and an overloaded Accuracy (to compute the percentage of correct predictions).

In the Main method, the demo program hardcodes the 100-item source data into an array-of-arrays-style string matrix:

``````string[][] allData = new string[];
allData = new string[] { "n", "y", "y", "n", "y", "y", "n", "n",
"n", "n", "n", "n", "y", "y", "y", "y", "democrat" };
...
``````

In a non-demo scenario, your data would likely be in a text file and you’d load the data into memory using a helper method. The source data is split into a training set and a test set, like so:

``````string[][] trainData;
string[][] testData;
MakeTrainTest(allData, 0, out trainData, out testData);
``````

The 80/20 split percentages are hardcoded. The 0-value argument passed to method MakeTrainTest is a seed value for a Math.Random object so that the train-test split can assign data items randomly. An alternative is to use a stratified approach so that the proportions of Democrat and Republican data items in the training and test matrices are roughly the same as the percentages in the overall source data.

The demo creates the model with this code:

``````int numConditions = 5;
int maxNumRules = 500;
double minAccuracy = 0.90;
ConsensusClassifier cc =
new ConsensusClassifier(numConditions, maxNumRules);
cc.Train(trainData, minAccuracy);
``````

Here, I decided to pass values for the number of Boolean conditions in each rule and the maximum number of rules to create to the constructor, and pass a reference to the training data and the minimum accuracy each rule must meet to method Train. Many developers prefer to pass all relevant parameters to the constructor (at the expense of a constructor with a large number of parameters). Here, I passed values most directly related to each method (at the expense of a rather arbitrary-looking calling interface).

The demo program concludes by computing and displaying the accuracy of the model:

``````double trainAcc = cc.Accuracy(trainData);
Console.WriteLine("Accuracy on train data = " + trainAcc.ToString("F4"));
int numCorrect, numWrong, numUnknown;
double testAcc = cc.Accuracy(testData, out numCorrect, out numWrong,
out numUnknown);
Console.WriteLine("Accuracy on test data  = " + testAcc.ToString("F4"));
Console.WriteLine("Number correct = " + numCorrect);
Console.WriteLine("Number wrong   = " + numWrong);
Console.WriteLine("Number unknown = " + numUnknown);
``````

The first overloaded Accuracy returns just the percentage of correct classifications. The second overloaded Accuracy returns, in addition, the number of correct, incorrect and non-applicable data items, where non-applicable means that none of the model’s rules apply to a particular data item. For example, suppose the number-of-conditions parameter was set to 3 and one of the data items has vote0 = ‘y,’ vote1 = ‘y’ and vote2 = ‘n.’ Even with 500 rules, it’s possible for none of the rules to take this combination of votes into account. An alternative is to design method Accuracy so that non-applicable data items aren’t possible. One common way to do this is to add a final rule that resembles, “. . .else Representative is a Democrat,” where Democrat is a default value, which is typically the most common dependent-variable value.

The demo program doesn’t use the model to predict the political party of a Representative. Predicting the political party of a Representative who voted “yes” on bills  through , and “no” on bills  through  could resemble this:

``````string[] newData = new string[] { "y", "y", "y", "y", "y", "y", "y", "y",
"n", "n", "n", "n", "n", "n", "n", "n" };
int party = cc.ComputeOutput(newData);
Console.WriteLine("Predicted party = " + party);
``````

Method ComputeOutput returns an integer value, 0 or 1, where 0 indicates Republican and 1 indicates Democrat.

## The Consensus Algorithm

The heart of the consensus classifier is a rule set. The two primary issues to deal with are deciding how to represent a rule, and generating rules. The demo program represents a rule as an integer array. The scheme is best explained using a concrete example, as shown in Figure 3. Suppose the number of Boolean conditions in each rule is set to three. A single rule would be represented as an array with seven cells. If the array had values { 3, 0, 8, 1, 15, 1, 0 }, then the rule corresponds to “If vote  is 0 and vote  is 1 and vote  is 1, then party is 0.” The rule set is a generic List collection of rules. Figure 3 Consensus Classification Data Structures

The meaning of 0 and 1 can vary for each column/vote. The demo program constructs an array of Dictionary collections, one per column/vote. Zero-based integer IDs are assigned to each string value as they’re encountered in each column in the training data. For example, in column , for vote , in the training data, “n” is encountered first and assigned to 0 and “y” is encountered next and assigned to 1. But in column , “y” is encountered first and assigned to 0 and “n” is encountered second and assigned to 1.

Similarly, in column  of the training data, the last column, which holds dependent variable values “democrat” and “republican,” “republican” is encountered first and so is 0, and “democrat” is 1. The string-to-integer mapping information is stored in a class member array named stringToInt, as shown in Figure 3. So the expression stringToInt["y"] returns the zero-based index value for a yes vote for vote , which, for the demo training data, is 0.

Here’s the high-level pseudo-code for creating the rules that define the classification model:

``````loop while numRules < maxRules && trial < maxTrials
++trial
select a random row of training data
select numConditions random columns of selected row
create candidate rule from selected columns of selected row
if candidate rule already in rule list, continue
if candidate rule does not meet minimum accuracy, continue
candidate rule is good so add rule to rule list
end loop
``````

A concrete example may help clarify. In the main loop, suppose row  of the training data is randomly selected. Next, suppose numConditions has been set to 3 and the three randomly selected columns are ,  and . In the training data, these columns have values “y,” “n,” and “y” and the dependent variable has value “republican.” The column values are looked up in the array-of-­Dictionary collections and are 0, 0 and 0. Therefore, the candidate rule is an integer array with values { 1, 0, 2, 0, 15, 0, 0 }, which can be interpreted as “If column 1 is 0 and column 2 is 0 and column 15 is 0, then party is 0.”

Next, the 80 items in the training data are scanned to see if the candidate rule meets the minimum accuracy criterion. Note that the candidate rule will be correct for at least one data item—the item that was used to create the rule. However, the candidate rule will not necessarily be applicable to all training items. For example, the candidate rule { 1, 0, 2, 0, 15, 0, 0 } that was generated from data item  isn’t applicable to data item  because column  has value 1 rather than the required 0.

Once the rule set has been created, the output for a given set of input data is determined by what might be described as counting votes. For each data item, all of the rules in the rule set are analyzed. Suppose, as in the demo, the rule set has 500 rules. And suppose that for one data item, 220 of the rules predict a political party of Republican, 250 of the rules predict Democrat, and 30 of the rules aren’t applicable. The classifier would predict the person associated with the data is a Democrat. In other words, the output is a consensus of the predictions from the rule set.