# 資料聚類使用樸素貝葉斯推理

James McCaffrey

## 程式結構

``````using System;
using System.Collections.Generic;
namespace ClusteringBayesian
{
class ClusteringBayesianProgram
{
static Random random = null;
static void Main(string[] args)
{
try
{
Console.WriteLine("\nBegin data clustering using Naive Bayes demo\n");
random = new Random(6); // Seed of 6 gives a nice demo
int numTuples = 8;
Console.WriteLine("Generating " + numTuples +
"tuples of location-income-politics data");
string[] rawData = MakeData(numTuples);
Console.WriteLine("\nRaw data in string form is:\n");
ShowData(rawData, numTuples);
string[] attNames = new string[] { "Location", "Income", "Politics" };
string[][] attValues = new string[attNames.Length][];
attValues[0] = new string[] { "Urban", "Suburban", "Rural" };
// Location
attValues[1] = new string[] { "Low", "Medium", "High", "VeryHigh" };
// Income
attValues[2] = new string[] { "Liberal", "Conservative" };
// Politics
Console.WriteLine("Converting raw data to int form and" +
"storing in memory\n");
int[][] tuplesAsInt = TuplesToInts(tuples, attValues);
Console.WriteLine("Data in int form is:\n");
ShowData(tuplesAsInt, numTuples);
int numClusters = 3;
int numTrials = 10;  // Times to get good seed indexes (different tuples)
Console.WriteLine("\nSetting numClusters to " + numClusters);
Console.WriteLine("\nInitializing Naive Bayes joint counts with " +
"Laplacian smoothing");
int[][][] jointCounts = InitJointCounts(tuplesAsInt, attValues,
numClusters);
int[] clusterCounts = new int[numClusters];
Console.WriteLine("\nBegin clustering data using Naive Bayes " +
"(with equal priors) ");
int[] clustering = Cluster(tuplesAsInt, numClusters, numTrials,
jointCounts, clusterCounts, true);
Console.WriteLine("Clustering complete");
Console.WriteLine("\nClustering in internal form is:\n");
for (int i = 0; i < clustering.Length; ++i)
Console.Write(clustering[i] + " ");
Console.WriteLine("Raw data displayed in clusters is:\n");
DisplayClustered(tuplesAsInt, numClusters, clustering, attValues);
Console.WriteLine("\nEnd demo\n");
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
} // Main
// Other methods go here
} // class
} // ns
``````

``````int numTuples = 4;
string[] rawData = new string[] { "Urban,VeryHigh,Liberal",
"Rural,Medium,Conservative", "Urban,Low,Liberal",
"Suburban,Medium,Liberal" };
``````

## 樸素貝葉斯推理

``````[0] Suburban VeryHigh  Liberal
[1] Rural    Medium    Conservative
[2] Suburban VeryHigh  Conservative
[3] Suburban Low       Liberal
[4] Suburban High      Liberal
[5] Suburban Low       Conservative
[6] Urban    Low       Liberal
[7] Rural    Low       Conservative
``````

``````[  2,  0, -1, -1, -1, -1,  1, -1 ]
``````

``````c0 : Rural    Medium   Conservative  (tuple 1)
c1 : Urban    Low      Liberal       (tuple 6)
c2 : Suburban VeryHigh Liberal       (tuple 0)
``````

``````P(c0 | X)
P(c1 | X)
P(c2 | X)
``````

``````P(c0 | X) = PP(c0 | X) / [PP(c0 | X) + PP(c1 | X) + PP(c2 | X)]
P(c1 | X) = PP(c1 | X) / [PP(c0 | X) + PP(c1 | X) + PP(c2 | X)]
P(c2 | X) = PP(c2 | X) / [PP(c0 | X) + PP(c1 | X) + PP(c2 | X)]
``````

P(c0 | 部分概率X） 是：

``````PP(c0 | X) = P(Suburban | c0) *
P(VeryHigh | c0) *
P(Conservative | c0) *
P(c0)
= count(Suburban     & co) / count c0 *
count(VeryHigh     & co) / count c0 *
count(Conservative & c0) / count c0 *
P(c0)
``````

``````PP(c0 | X) = P(Suburban     | c0) *
P(VeryHigh     | c0) *
P(Conservative | c0) *
P(c0)
= count(Suburban     & co) / (count c0 + 3) *
count(VeryHigh     & co) / (count c0 + 3) *
count(Conservative & c0) / (count c0 + 3) *
1 / numClusters
= (1 / 4) * (1 / 4) * (2 / 4) * (1 / 3)
= 0.0104
``````

``````PP(c1 | X) = count(Suburban     & c1) / (count c1 + 3) *
count(VeryHigh     & c1) / (count c1 + 3) *
count(Conservative & c1) / (count c1 + 3) *
1 / numClusters
= (1 / 4) * (1 / 4) * (1 / 4) * (1 / 3)
= 0.0052
PP(c2 | X) = count(Suburban     & c2) / (count c2 + 3) *
count(VeryHigh     & c2) / (count c2 + 3) *
count(Conservative & c2) / (count c2 + 3) *
1 / numClusters
= (2 / 4) * (2 / 4) * (1 / 4) * (1 / 3)
= 0.0208
``````

``````P(c0 | X) = PP(X | c0) / [PP(X | c0) + PP(X | c1) + PP(X | c2)]
= 0.0104 / (0.0104 + 0.0052 + 0.0208)
= 0.2857
P(c1 | X) = PP(X | c1) / [PP(X | c0) + PP(X | c1) + PP(X | c2)]
= 0.0052 / (0.0104 + 0.0052 + 0.0208)
= 0.1429
P(c2 | X) = PP(X | c0) / [PP(X | c0) + PP(X | c1) + PP(X | c2)]
= 0.0208 / (0.0104 + 0.0052 + 0.0208)
= 0.5714
``````

## 執行的聚類方法

圖 4 方法群集

``````static int[] Cluster(int[][] tuplesAsInt, int numClusters,
int numTrials, int[][][] jointCounts, int[] clusterCounts,
bool equalPriors)
{
int numRows = tuplesAsInt.Length;
int[] clustering = new int[numRows];
for (int i = 0; i < clustering.Length; ++i)
clustering[i] = -1;
int[] goodIndexes = GetGoodIndexes(tuplesAsInt, numClusters, numTrials);
for (int i = 0; i < goodIndexes.Length; ++i)
{
int idx = goodIndexes[i];
clustering[idx] = i;
}
for (int i = 0; i < goodIndexes.Length; ++i)
{
int idx= goodIndexes[i];
for (int j = 0; j < tuplesAsInt[idx].Length; ++j)
{
int v = tuplesAsInt[idx][j];
++jointCounts[j][v][i];     // Very tricky indexing
}
}
for (int i = 0; i < clusterCounts.Length; ++i)
++clusterCounts[i];
for (int i = 0; i < tuplesAsInt.Length; ++i)
{
if (clustering[i] != -1) continue; // Tuple already clustered
double[] allProbabilities = AllProbabilities(tuplesAsInt[i],
jointCounts, clusterCounts, equalPriors);
int c = IndexOfLargestProbability(allProbabilities);
clustering[i] = c;  // Assign tuple i to cluster c
for (int j = 0; j < tuplesAsInt[i].Length; ++j)
{
int v = tuplesAsInt[i][j];
++jointCounts[j][v][c];
}
++clusterCounts[c];
} // Main loop
return clustering;
}
``````

``````loop n times
select a random tuple index
unassign the tuple from its curr cluster
assign the tuple using non-equal priors computation
end loop
``````

INBIAC 建立群集分配一個元組一次通過查找群集的當前元組的召回屬於概率最大的風險。 概率計算使用平等普賴厄斯，即假定每個群集的概率都是相等。 但聚類後, 有可用現在詳細資訊有關如何可能是每個群集，和資訊可以用於可能改進的聚類結果。 代碼下載實現名為的方法改進此選項使用。

## 越來越好的初始種子元組

INBIAC 演算法種子具有單個元組的每個群集。 它是重要的是這些種子元組應盡可能從彼此不同。 有很多聚類演算法所使用的不同的措施。 方法 GetGoodIndexes 生成一組隨機候選索引，然後計算總人數的元組的屬性的不同，稱為海明距離度量。 這一過程是反復的 numTrials 倍，並返回具有最大的不同的元組的索引。

``````[0] Suburban VeryHigh Liberal
[1] Rural    Medium   Conservative
[2] Suburban VeryHigh Conservative
``````

``````init a result array
loop numTrials times
generate numClusters distinct random tuple indexes
compute their dissimilarity
if curr dissimilarity > greatest dissimilarity
greatest dissimilarity = curr dissimilarity
store curr indexes into result array
end if
end loop
return result array
``````

``````init a dictionary to store result indexes
init a result array
set count = 0
while count < numClusters
generate a random tuple index
if index not in dictionary
increment count
end if
end while
return result
``````