测试运行

项目
07/10/2015

使用 C# 进行 Naive Bayes 分类

James McCaffrey Naive Bayes 分类是一种计算机学习方法，可用于预测特定的数据用例属于哪个类别。在本文中，我将介绍 Naive Bayes 分类的工作原理，并提供一个使用 C# 语言编码的示例。

有许多独立工具可用于执行 Naive Bayes 分类。不过，这些工具可能很难或根本无法直接集成到应用程序中，而且很难根据特定需要进行定制。此外，还可能存在版权问题隐患。本文将会帮助您打下坚实的基础，使您无需依赖任何外部因素便可向 .NET 应用程序添加 Naive Bayes 分类功能。

要了解 Naive Bayes 分类的概念以及本文主题，最好看一下图 1 所示的演示程序屏幕快照。演示程序首先生成 40 行数据，这些数据用于训练分类器。多数情况下，您将使用现有数据源，但为了演示的简单性，我生成了虚拟数据。第一行数据为“administrative,right,72.0,female”，其中第一个字段是职业，第二个字段是习惯用手，第三个字段是身高（以英寸为单位），第四个字段是性别。分类器的用途是根据一组指定的职业、习惯用手和身高值来预测性别。由于因变量 sex 有两个可能的值，因此这是一个二元分类示例。

Naive Bayes Classification Demo 图 1 Naive Bayes 分类演示

生成原始数据后，演示程序通过将身高值分箱，将每个数值身高字段转换为一个类别：矮、中或高。稍后我将介绍，将数值数据分箱为分类数据是一种利弊掺半的方法。将训练数据分箱之后，演示程序扫描 40 行分类数据，然后计算结点计数。例如，个人职业为行政管理且性别为男性的数据用例数为 2。此外，还将计算每个因变量值（即要预测的属性，在本例中为男性或女性）的总数。可以看到，训练数据中有 24 位男性，16 位女性。

这样，演示程序便具备了所需的全部信息，能够对职业为教育、习惯用手为右手，身高为高的新数据用例进行性别分类。在本例中，演示最终确定数据用例为男性的概率为 0.3855，为女性的概率为 0.6145，因此系统判定数据用例很可能是一位女性。

在后面几部分内容中，我首先详细介绍 Naive Bayes 分类的工作原理，然后逐步为您介绍演示程序中的代码，并说明如何根据自己的需要修改演示程序。本文假定您至少具备 C 系列语言的初级编程技能，但对 Naive Bayes 分类知识不作要求。演示程序的代码太长，无法在此完整提供，但您可以从 MSDN 下载网站下载完整的源代码，网址为 archive.msdn.microsoft.com/mag201302TestRun。

Naive Bayes 分类的工作原理

以图 1 为例，程序目标是预测职业为教育，习惯用右手且身高较高（不低于 71.0 英寸）的人员的性别（男性或女性）。为此，我们可以计算具有这一指定信息的人员是男性的概率，以及具有这一指定信息的人员是女性的概率，然后依据其中较大的概率来预测性别。若用符号表示，我们希望得出 P(male | X)（含义是“给定自变量值 X 的条件下是男性的概率”）和 P(female | X)，其中 X 是 (education, right, tall)。Naive Bayes 中“naive”一词指所有 X 属性都假定为数学上的自变量，这可以极大地简化分类。您可以找到许多在线参考资料，其中介绍了 Naive Bayes 分类背后很有趣的数学计算，但结果都是相对简单的。符号表示如下：

P(male | X) =
  [ P(education | male) * P(right | male) * P(tall | male) * P(male) ] /
    [ PP(male | X) + PP(female | X) ]

可以看到，该等式是一个分数。分子有时不严格地称为部分概率，是四个项的乘积。本文中，我用非标准符号 PP 表示部分概率项。分母是两项之和，其中一项是分子。首先要计算 P(education | male)，即在某人是男性的前提下其职业为教育的概率。此概率最终可通过以下公式估计得出，即将职业为教育且性别为男性的训练用例的计数除以性别为男性（任意职业）的用例数：

P(education | male ) = count(education & male) / count(male) = 2/24 = 0.0833

同理可得：

P(right | male) = count(right & male) / count(male) = 17/24 = 0.7083
P(tall | male) = count(tall & male) / count(male) = 4/24 = 0.1667

其次要计算 P(male)。在 Naive Bayes 术语中，它称为先验概率。对于计算先验概率的最佳方法，存在着一些争议。一方面，我们可以假定没有任何理由认为存在的男性比存在的女性多，因此为 P(male) 赋值 0.5。另一方面，我们可以利用训练数据有 24 位男性和 16 位女性这一事实，估计出 P(male) 概率为 24/40 = 0.6000。我更喜欢后一种方法，即使用训练数据估计先验概率。

P(education | female) = count(education & female) / count(female) = 4/16 = 0.2500
P(right | female) = count(right & female) / count(female) = 14/16 = 0.8750
P(tall | female) = count(tall & female) / count(female) = 2/16 = 0.1250
P(female) = 16/40 = 0.4000

因此，P(male | X) 的部分概率分子为：

PP(male | X) = 0.0833 * 0.7083 * 0.1667 * 0.6000 = 0.005903

同理可得，在 X = (education, right, tall) 的前提下女性的部分概率为：

PP(female | X) = 0.2500 * 0.8750 * 0.1250 * 0.4000 = 0.010938

最后，得出男性和女性的总体概率分别为：

P(male | X) = 0.005903 / (0.005903 + 0.010938) = 0.3505
P(female | X) = 0.010938 / (0.005903 + 0.010938) = 0.6495

这些总体概率有时称为后验概率。由于 P(female | X) 大于 P(male | X)，因此系统判定未知人员的性别为女性。但是，请等一等。这两个概率 0.3505 和 0.6495 与图 1 所示的两个概率 0.3855 和 0.6145 虽然很接近，却截然不同。产生这一差异的原因在于，演示程序使用了一项对基本 Naive Bayes 的重大可选修改，这项修改称为 Laplacian 平滑处理。

Laplacian 平滑处理

参考图 1 可以发现，人员职业为建筑且性别为女性的训练用例计数为 0。在演示中，X 值为 (education, right, tall)，其中并不包含建筑。但假设 X 为 (construction, right, tall)，那么，在计算 PP(female | X) 的过程中，必须计算 P(construction | female) = count(construction & female) / count(female)，而该概率为 0，这又会使整个部分概率为零。简而言之，当结点计数为 0 时很糟糕。避免此情况的最常用方法就是给所有结点计数加 1。此方法看似粗暴，实际却有可靠的数学基础。该方法称为加一平滑处理，是 Laplacian 平滑处理的一种具体形式。

利用 Laplacian 平滑处理，如果如前所述 X = (education, right, tall)，则 P(male | X) 和 P(female | X) 计算如下：

P(education | male ) =
count(education & male) + 1 / count(male) + 3 = 3/27 = 0.1111
P(right | male) =
count(right & male) + 1 / count(male) + 3 = 18/27 = 0.6667
P(tall | male) =
count(tall & male) + 1 / count(male) + 3 = 5/27 = 0.1852
P(male) = 24/40 = 0.6000
P(education | female) =
count(education & female) + 1 / count(female) + 3 = 5/19 = 0.2632
P(right | female) =
count(right & female) + 1 / count(female) + 3 = 15/19 = 0.7895
P(tall | female) =
count(tall & female) + 1 / count(female) + 3 = 3/19 = 0.1579
P(female) = 16/40 = 0.4000

部分概率为：

PP(male | X) = 0.1111 * 0.6667 * 0.1852 * 0.6000 = 0.008230
PP(female | X) = 0.2632 * 0.7895 * 0.1579 * 0.4000 = 0.013121

因此，两个最终概率为：

P(male | X) = 0.008230 / (0.008230 + 0.013121) = 0.3855
P(female | X) = 0.013121 / (0.008230 + 0.013121) = 0.6145

这些便是图 1 所示屏幕快照中的值。可以看到，虽然每个结点计数加了 1，但给分母 count(male) 和 count(female) 加了 3。在某种程度上来说，3 是任意值，因为 Laplacian 平滑处理并不指定要使用的任何具体值。在本例中，它是 X 属性 (occupation, dominance, height) 的数目。在 Laplacian 平滑处理中，这是加到部分概率分母的最常见的值，但您也可以试用其他值。在 Naive Bayes 的数学文字中，要加到分母的值通常被赋予符号 k。您还会看到，在 Naive Bayes Laplacian 平滑处理中，通常不会修改先验概率 P(male) 和 P(female)。

程序的整体结构

图 1 所示运行中的演示程序是单个 C# 控制台应用程序。 Main 方法如图 2 所示（其中删除了一些 WriteLine 语句）。

图 2 Naive Bayes 程序结构

using System;
namespace NaiveBayes
{
  class Program
  {
    static Random ran = new Random(25); // Arbitrary
    static void Main(string[] args)
    {
      try
      {
        string[] attributes = new string[] { "occupation", "dominance",
          "height", "sex"};
        string[][] attributeValues = new string[attributes.Length][];
        attributeValues[0] = new string[] { "administrative",
          "construction", "education", "technology" };
        attributeValues[1] = new string[] { "left", "right" };
        attributeValues[2] = new string[] { "short", "medium", "tall" };
        attributeValues[3] = new string[] { "male", "female" };
        double[][] numericAttributeBorders = new double[1][];
        numericAttributeBorders[0] = new double[] { 64.0, 71.0 };
        string[] data = MakeData(40);
        for (int i = 0; i < 4; ++i)
          Console.WriteLine(data[i]);
        string[] binnedData = BinData(data, attributeValues,
          numericAttributeBorders);
        for (int i = 0; i < 4; ++i)
          Console.WriteLine(binnedData[i]);
        int[][][] jointCounts = MakeJointCounts(binnedData, attributes,
          attributeValues);
        int[] dependentCounts = MakeDependentCounts(jointCounts, 2);
        Console.WriteLine("Total male = " + dependentCounts[0]);
        Console.WriteLine("Total female = " + dependentCounts[1]);
        ShowJointCounts(jointCounts, attributeValues);
        string occupation = "education";
        string dominance = "right";
        string height = "tall";
        bool withLaplacian = true;
        Console.WriteLine(" occupation = " + occupation);
        Console.WriteLine(" dominance = " + dominance);
        Console.WriteLine(" height = " + height);
        int c = Classify(occupation, dominance, height, jointCounts,
          dependentCounts, withLaplacian, 3);
        if (c == 0)
          Console.WriteLine("\nData case is most likely male");
        else if (c == 1)
          Console.WriteLine("\nData case is most likely female");
        Console.WriteLine("\nEnd demo\n");
      }
      catch (Exception ex)
      {
        Console.WriteLine(ex.Message);
      }
    } // End Main
    // Methods to create data
    // Method to bin data
    // Method to compute joint counts
    // Helper method to compute partial probabilities
    // Method to classify a data case
  } // End class Program
}

程序首先设置硬编码的 X 属性 occupation、dominance 和 height 以及因变量属性 sex。在某些情况下，您可能更愿意通过扫描现有数据源确定这些属性，尤其在数据源为包含标题的数据文件或是包含列名的 SQL 表时。演示程序还指定九个分类 X 属性值： occupation 的 (administrative, construction, education, technology)；dominance 的 (left, right) 和 height 的 (short, medium, tall)。在本例中，sex 有两个因变量属性值： (male, female)。同样，您也可以通过扫描数据以编程方式确定属性值。

演示程序通过设置硬编码的边界值 64.0 和 71.0 将 height 数值装箱，从而将小于等于 64.0 的 height 值分类为 short；将介于 64.0 与 71.0 之间的 height 值分类为 medium；将大于等于 71.0 的 height 值分类为 tall。将 Naive Bayes 数值数据装箱时，边界值的数目比类别数少一个。在本例中，确定 64.0 和 71.0 的方式如下：先扫描训练数据找出最小和最大 height 值（57.0 和 78.0），计算这两者之差 21.0，然后通过将该差值除以 height 类别数目 3 计算出间隔大小（即 7.0）。多数情况下，您都会以编程而不是手动方式确定数值 X 属性的边界值。

演示程序通过调用 Helper 方法 MakeData 生成有些随机的训练数据。 MakeData 再调用 Helper 方法 MakeSex、MakeOccupation、MakeDominance 和 MakeHeight。例如，这些 Helper 方法会生成数据，使男性职业更可能为建筑和技术，男性习惯用手更可能为右手，而男性身高更可能介于 66.0 和 72.0 英寸之间。

Main 中调用的主要方法及其用途如下：BinData 用于分类身高数据；MakeJointCounts 用于扫描装箱数据并计算结点计数；MakeDependentCounts 用于计算男性和女性的总人数；Classify 使用结点计数和因变量计数执行 Naive Bayes 分类。

数据装箱

方法 BinData 如图 3 所示。该方法接受一个由逗号分隔字符串组成的数组，其中每个字符串类似于“education,left,67.5,male”。许多情况下，您将从每行均为字符串的文本文件读取训练数据。该方法使用 String.Split 将每个字符串解析为标记。 Token[2] 表示 height。它通过 double.Parse 方法从字符串转换为双精度类型。 height 数值与边界值进行比较，直到找到 height 的间隔，然后确定字符串形式的相应 height 类别。然后，使用旧标记、逗号分隔符和新计算出的 height 类别字符串将所得的字符串连在一起。

图 3 用于对身高进行分类的方法 BinData

static string[] BinData(string[] data, string[][] attributeValues,
  double[][] numericAttributeBorders)
{
  string[] result = new string[data.Length];
  string[] tokens;
  double heightAsDouble;
  string heightAsBinnedString;
  for (int i = 0; i < data.Length; ++i)
  {
    tokens = data[i].Split(',');
    heightAsDouble = double.Parse(tokens[2]);
    if (heightAsDouble <= numericAttributeBorders[0][0]) // Short
      heightAsBinnedString = attributeValues[2][0];
    else if (heightAsDouble >= numericAttributeBorders[0][1]) // Tall
      heightAsBinnedString = attributeValues[2][2];
    else
      heightAsBinnedString = attributeValues[2][1]; // Medium
    string s = tokens[0] + "," + tokens[1] + "," + heightAsBinnedString +
      "," + tokens[3];
    result[i] = s;
  }
  return result;
}

将数值数据装箱并非执行 Naive Bayes 分类的硬性要求。 Naive Bayes 可直接处理数值数据，但这些方法超出了本文的讨论范围。数据装箱的优点是十分简单，而且无需对数据的数学分布（如高斯或泊松分布）明确作出任何具体假定。不过，数据装箱实际上会丢失信息，而且需要确定并指定将数据划分为多少个类别。

确定结点计数

Naive Bayes 分类的关键在于计算结点计数。在演示示例中，共有九个自变量 X 属性值 (administrative, construction, … tall) 和两个因变量属性值 (male, female)，因此总共必须计算并存储 9 * 2 = 18 个结点计数。我的首选方法是将结点计数存储在一个三维数组 int[][][] jointCounts 中。第一个索引表示自变量 X 属性；第二个索引表示自变量 X 属性值；第三个索引表示因变量属性值。例如，jointCounts[0][3][1] 表示属性 0 (occupation)、属性值 3 (technology) 和 sex 1 (female)，换句话说，jointCounts[0][3][1] 中的值是职业为技术且性别为女性的训练用例的计数。方法 MakeJointCounts 如图 4 所示。

图 4 方法 MakeJointCounts

static int[][][] MakeJointCounts(string[] binnedData, string[] attributes,
  string[][] attributeValues)
{
  int[][][] jointCounts = new int[attributes.Length - 1][][]; // -1 (no sex)
  jointCounts[0] = new int[4][]; // 4 occupations
  jointCounts[1] = new int[2][]; // 2 dominances
  jointCounts[2] = new int[3][]; // 3 heights
  jointCounts[0][0] = new int[2]; // 2 sexes for administrative
  jointCounts[0][1] = new int[2]; // construction
  jointCounts[0][2] = new int[2]; // education
  jointCounts[0][3] = new int[2]; // technology
  jointCounts[1][0] = new int[2]; // left
  jointCounts[1][1] = new int[2]; // right
  jointCounts[2][0] = new int[2]; // short
  jointCounts[2][1] = new int[2]; // medium
  jointCounts[2][2] = new int[2]; // tall
  for (int i = 0; i < binnedData.Length; ++i)
  {
    string[] tokens = binnedData[i].Split(',');
    int occupationIndex = AttributeValueToIndex(0, tokens[0]);
    int dominanceIndex = AttributeValueToIndex(1, tokens[1]);
    int heightIndex = AttributeValueToIndex(2, tokens[2]);
    int sexIndex = AttributeValueToIndex(3, tokens[3]);
    ++jointCounts[0][occupationIndex][sexIndex];
    ++jointCounts[1][dominanceIndex][sexIndex];
    ++jointCounts[2][heightIndex][sexIndex];
  }
  return jointCounts;
}

该实现包含许多硬编码值，这样更容易理解。例如，下面三条语句可换为一个 for 循环，该循环通过在数组 attributeValues 中使用 Length 属性来分配空间：

jointCounts[0] = new int[4][]; // 4 occupations
jointCounts[1] = new int[2][]; // 2 dominances
jointCounts[2] = new int[3][]; // 3 heights

Helper 函数 AttributeValueToIndex 接受属性索引和属性值字符串并返回相应的索引。例如，AttributeValueToIndex(2, “medium”) 返回 height 属性中“medium”的索引，也就是 1。

演示程序使用方法 MakeDependentCounts 确定男性和女性数据用例的数目。有多种方法可以做到这一点。参考图 1 可以发现，有一种方法是添加三个属性中任何一个的结点计数。例如，男性人数是 count(administrative & male)、count(construction & male)、count(education & male) 和 count(technology & male) 的总和：

static int[] MakeDependentCounts(int[][][] jointCounts,
  int numDependents)
{
  int[] result = new int[numDependents];
  for (int k = 0; k < numDependents; ++k) 
  // Male then female
    for (int j = 0; j < jointCounts[0].Length; ++j)
    // Scanning attribute 0
      result[k] += jointCounts[0][j][k];
  return result;
}

对数据用例进行分类

方法 Classify 如图 5 所示，该方法很短，因为它依赖于 Helper 方法。

图 5 方法 Classify

static int Classify(string occupation, string dominance, string height,
  int[][][] jointCounts, int[] dependentCounts, bool withSmoothing,
  int xClasses)
{
  double partProbMale = PartialProbability("male", occupation, dominance,
    height, jointCounts, dependentCounts, withSmoothing, xClasses);
  double partProbFemale = PartialProbability("female", occupation, dominance,
    height, jointCounts, dependentCounts, withSmoothing, xClasses);
  double evidence = partProbMale + partProbFemale;
  double probMale = partProbMale / evidence;
  double probFemale = partProbFemale / evidence;
  if (probMale > probFemale) return 0;
  else return 1;
}

方法 Classify 接受以下参数：jointCounts 和 dependentCounts 数组；一个布尔型字段，用于指示是否使用 Laplacian 平滑处理；参数 xClasses，在本示例中为 3，因为有三个自变量 (occupation, dominance, height)。此参数也可从 jointCounts 参数推断得到。

方法 Classify 返回 int 值，用于表示预测因变量的索引。实际上，您可能想要返回每个因变量的概率的数组。请注意，分类基于 probMale 和 probFemale，它们均是通过部分概率与证据值相除得出的。您可能希望直接忽略证据项，只比较部分概率值本身。

方法 Classify 返回具有最大概率的因变量的索引。替代方法是提供一个阈值。例如，假设 probMale 为 0.5001，probFemale 为 0.4999。您可能认为这两个值过于接近，因而返回一个表示“未定”的分类值。

方法 PartialProbability 代 Classify 执行了大部分操作，如图 6 所示。

图 6 方法 PartialProbability

static double PartialProbability(string sex, string occupation, string dominance,
  string height, int[][][] jointCounts, int[] dependentCounts,
  bool withSmoothing, int xClasses)
{
  int sexIndex = AttributeValueToIndex(3, sex);
  int occupationIndex = AttributeValueToIndex(0, occupation);
  int dominanceIndex = AttributeValueToIndex(1, dominance);
  int heightIndex = AttributeValueToIndex(2, height);
  int totalMale = dependentCounts[0];
  int totalFemale = dependentCounts[1];
  int totalCases = totalMale + totalFemale;
  int totalToUse = 0;
  if (sex == "male") totalToUse = totalMale;
  else if (sex == "female") totalToUse = totalFemale;
  double p0 = (totalToUse * 1.0) / (totalCases); // Prob male or female
  double p1 = 0.0;
  double p2 = 0.0;
  double p3 = 0.0;
  if (withSmoothing == false)
  {
    p1 = (jointCounts[0][occupationIndex][sexIndex] * 1.0) / totalToUse
    p2 = (jointCounts[1][dominanceIndex][sexIndex] * 1.0) / totalToUse;  
    p3 = (jointCounts[2][heightIndex][sexIndex] * 1.0) / totalToUse;     
  }
  else if (withSmoothing == true)
  {
    p1 = (jointCounts[0][occupationIndex][sexIndex] + 1) /
     ((totalToUse + xClasses) * 1.0); 
    p2 = (jointCounts[1][dominanceIndex][sexIndex] + 1) /
     ((totalToUse + xClasses) * 1.0 ;
    p3 = (jointCounts[2][heightIndex][sexIndex] + 1) /
     ((totalToUse + xClasses) * 1.0);
  }
  //return p0 * p1 * p2 * p3; // Risky if any very small values
  return Math.Exp(Math.Log(p0) + Math.Log(p1) + Math.Log(p2) + Math.Log(p3));
}

为清楚起见，方法 PartialProbability 几乎全部采用了硬编码。例如，有四个概率组分 p0、p1、p2 和 p3。您可以使用概率数组使 PartialProbability 更通用，该数组大小由 jointCounts 数组确定。

请注意，该方法并不是返回四个概率组分的乘积，而是返回每个组分的对数和的对等指数。使用对数概率是计算机学习算法中的一种标准方法，可用于避免非常小的实数数值可能出现的数值错误。

总结

本文展示的示例应该能为您向 .NET 应用程序添加 Naive Bayes 分类功能打下良好的基础。 Naive Bayes 分类是一种相对粗糙的方法，但相较神经网络分类、逻辑回归分类和支持向量机分类等比较精深的其他方法，确有多方面优势。 Naive Bayes 十分简单，相对易于实现，并且能够适应超大型数据集。此外，Naive Bayes 还可轻松应对多项分类问题，即包含三个或更多因变量的问题。

James McCaffrey博士供职于 Volt Information Sciences, Inc.，在该公司他负责管理对华盛顿州雷蒙德市沃什湾 Microsoft 总部园区的软件工程师进行的技术培训。他参与过多项 Microsoft 产品的研发工作，其中包括 Internet Explorer 和 MSN Search。他是《.NET Test Automation Recipes》(Apress, 2006) 的作者，您可以通过以下电子邮箱地址与他联系：jammc@microsoft.com。

衷心感谢以下 Microsoft 技术专家对本文的审阅： Rich Caruana