Naïve Bayes Classifier using RevoScaleR

In this article, we describe one simple and effective family of classification methods known as Naïve Bayes. In RevoScaleR, Naïve Bayes classifiers can be implemented using the rxNaiveBayes function. Classification, simply put, is the act of dividing observations into classes or categories. Some examples of this are the classification of product reviews into positive or negative categories or the detection of email spam. These classification examples can be achieved manually using a set of rules. However, this is not efficient or scalable. In Naïve Bayes and other machine learning based classification algorithms, the decision criteria for assigning class are learned from a training data set, which has classes assigned manually to each observation.

The rxNaiveBayes Algorithm

The Naïves Bayes classification method is simple, effective, and robust. This method can be applied to data large or small, it requires minimal training data, and is unlikely to produce a classifier that performs poorly compared to more complex algorithms. This family of classifiers utilizes Bayes Theorem to determine the probability that an observation belongs to a certain class. A training dataset is used to calculate prior probabilities of an observation occurring in a class within the predefined set of classes. In RevoScaleR this is done using the rxNaiveBayes function. These probabilities are then used to calculate posterior probabilities that an observation belongs to each class. The class membership is decided by choosing the class with the largest posterior probability for each observation. This is accomplished with the rxPredict function using the Naïve Bayes object from a call to rxNaiveBayes. Part of the beauty of Naïve Bayes is its simplicity due to the conditional independent assumption: that the values of each predictor are independent of each other given the class. This assumption reduces the number of parameters needed and in turn makes the algorithm extremely efficient. Naïve Bayes methods differ in their choice of distribution for any continuous independent variables. Our implementation via the rxNaiveBayes function assumes the distribution to be Gaussian.

A Simple Naïve Bayes Classifier

In Logistic Regression Models, we fit a simple logistic regression model to rpart’s kyphosis data and in Decision Trees and Decision Forests we used the kyphosis data again to create classification and regression trees. We can use the same data with our Naïve Bayes classifier to see which patients are more likely to acquire Kyphosis based on age, number, and start. We can train and test our classifier on the kyphosis data for the sake of illustration. We use the rxNaiveBayes function to construct a classifier for the kyphosis data:

#  A Simple Naïve Bayes Classifier

data("kyphosis", package="rpart")
kyphNaiveBayes <- rxNaiveBayes(Kyphosis ~ Age + Start + Number, data = kyphosis)
kyphNaiveBayes

  Call:
  rxNaiveBayes(formula = Kyphosis ~ Age + Start + Number, data = kyphosis)
  
  A priori probabilities:
  Kyphosis
	 absent   present 
  0.7901235 0.2098765 
  
  Predictor types:
	Variable    Type
  1      Age numeric
  2    Start numeric
  3   Number numeric
  
  Conditional probabilities:
  $Age
			 Means   StdDev
  absent  79.89062 61.86111
  present 97.82353 39.27505
  
  $Start
			  Means   StdDev
  absent  12.609375 4.427967
  present  7.294118 4.283175
  
  $Number
			 Means   StdDev
  absent  3.750000 1.414214
  present 5.176471 1.878673
  

The returned object kyphNaiveBayes is an object of class rxNaiveBayes. Objects of this class provide the following useful components: apriori and tables. The apriori component contains the conditional probabilities for the response variable, in this case the Kyphosis variable. The tables component contains a list of tables, one for each predictor variable. For a categorical variable, the table contains the conditional probabilities of the variable given the target level of the response variable. For a numeric variable, the table contains the mean and standard deviation of the variable given the target level of the response variable. These components are printed in the output above.

We can use our Naïve Bayes object with rxPredict to re-classify the Kyphosis variable for each child in our original dataset:

kyphPred <- rxPredict(kyphNaiveBayes,kyphosis)

When we table the results from the Naïve Bayes classifier with the original Kyphosis variable, it appears that 13 of 81 children are misclassified:

table(kyphPred[["Kyphosis_Pred"]], kyphosis[["Kyphosis"]])

          absent present
  absent      59       8
  present      5       9

A Larger Naïve Bayes Classifier

As a more complex example, consider the mortgage default example. For that example, there are ten input files total and we use nine input data files to create the training data set. We then use the model built from those files to make predictions on the final dataset. In this section we will use the same strategy to build a Naïve Bayes classifier on the first nine data sets and assign the outcome variable for the tenth data set.

The mortgage default data sets are available for download online. With the data downloaded we can create the training data set and test data set as follows (remember to modify the first line to match the location of the mortgage default text data files on your own system):

#  A Larger Naïve Bayes Classifier


bigDataDir <- "C:/MRS/Data"
mortCsvDataName <- file.path(bigDataDir, "mortDefault", "mortDefault")
trainingDataFileName <- "mortDefaultTraining"
mortCsv2009 <- paste(mortCsvDataName, "2009.csv", sep = "")
targetDataFileName <- "mortDefault2009.xdf"
defaultLevels <- as.character(c(0,1))
ageLevels <- as.character(c(0:40))
yearLevels <- as.character(c(2000:2009))
colInfo <- list(list(name  = "default", type = "factor",
    levels = defaultLevels), list(name = "houseAge", type = "factor",
    levels = ageLevels), list(name = "year", type = "factor",
    levels = yearLevels))
append= FALSE
for (i in 2000:2008)
{
    importFile <- paste(mortCsvDataName, i, ".csv", sep = "")
    rxImport(inData = importFile, outFile = trainingDataFileName,
    colInfo = colInfo, append = append, overwrite=TRUE)
    append = TRUE
}

rxImport(inData = mortCsv2009, outFile = targetDataFileName, 
    colInfo = colInfo)

In the above code the response variable default is converted to a factor using the colInfo argument to rxImport. For the rxNaiveBayes function, the response variable must be a factor or you will get an error.

Now that we have training and test data sets we can fit a Naïve Bayes classifier with our training data using rxNaiveBayes and assign values of the default variable for observations within the test data using rxPredict:

mortNB <- rxNaiveBayes(default ~ year + creditScore + yearsEmploy + ccDebt,
	data = trainingDataFileName, smoothingFactor = 1)
mortNBPred <- rxPredict(mortNB, data = targetDataFileName)

Notice that we added an additional argument, smoothingFactor, to our rxNaiveBayes call. This is a useful argument when your data are missing levels of a certain variable that you expect to be in your test data. Based on our training data, the conditional probability for year 2009 will be 0, since it only includes data between the years of 2000 and 2008. If we try to use our classifier on the test data without specifying a smoothing factor in our call to rxNaiveBayes the function rxPredict produces no results since our test data only has data from 2009. In general, smoothing is used to avoid overfitting your model. It follows that to achieve the optimal classifier you may want to smooth the conditional probabilities even if every level of each variable is observed.

We can compare the predicted values of the default variable from the Naïve Bayes classifier with the actual data in the test dataset:

results <- table(mortNBPred[["default_Pred"]], rxDataStep(targetDataFileName, 
    maxRowsByCols=6000000)[["default"]])
results

         0      1
  0 877272   3792
  1  97987  20949

pctMisclassified <- sum(results[2:3])/sum(results)*100
pctMisclassified

  [1] 10.1779

These results demonstrate a 10.2% misclassification rate using our Naïve Bayes classifier.

Naïve Bayes with Missing Data

You can control the handling of missing data using the byTerm argument in the rxNaiveBayes function. By default, byTerm is set to TRUE, which means that missings are removed by variable before computing the conditional probabilities. If you prefer to remove observations with missings in any variable before computations are done, set the byTerm argument to FALSE.