# 將數值資料轉換成類別資料

James McCaffrey

## K-均值聚類

K-均值聚類演算法是相當簡單的。有許多不同的演算法。最基本的形式，對於一組給定資料點和給定的簇數 k，初始化過程指派隨機選定的群集的每個資料點。然後被計算在群集的每個資料點的手段。接下來，每個資料點是掃描並重新分配給該群集具有最接近的資料點的平均值。計算方法，重新分配群集的步驟不斷重複，直到沒有資料點被重新分配給新的群集。

## 程式的總體結構

``````double[] rawData = new int[] { 66.0, 66.0, ...
};
Discretizer d = new Discretizer(rawData);
double numericVal = 75.5;
int catVal = d.Discretize(numericVal);
``````

``````using System;
using System.Collections.Generic;
namespace BinningData
{
class BinningProgram
{
static void Main(string[] args)
{
try
{
Console.WriteLine("\nBegin discretization of continuous data demo\n");
double[] rawData = new double[20] {
66, 66, 66, 67, 67, 67, 67, 68, 68, 69,
73, 73, 73, 74, 76, 78,
60, 61, 62, 62 };
Console.WriteLine("Raw data:");
ShowVector(rawData, 2, 10);
Console.WriteLine("\nCreating a discretizer on the raw data");
Discretizer d = new Discretizer(rawData);
Console.WriteLine("\nDiscretizer creation complete");
Console.WriteLine("\nDisplaying internal structure of the discretizer:\n");
Console.WriteLine(d.ToString());
Console.WriteLine("\nGenerating three existing and three new data values");
double[] newData = new double[6] { 62.0, 66.0, 73.0, 59.5, 75.5, 80.5 };
Console.WriteLine("\nData values:");
ShowVector(newData, 2, 10);
Console.WriteLine("\nDiscretizing the data:\n");
for (int i = 0; i < newData.Length; ++i)
{
int cat = d.Discretize(newData[i]);
Console.WriteLine(newData[i].ToString("F2") + " -> " + cat);
}
Console.WriteLine("\n\nEnd discretization demo");
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
} // Main
public static void ShowVector(double[] vector, int decimals,
int itemsPerRow) { .
.
}
} // Program
public class Discretizer
{
public Discretizer(double[] rawData) { .
.
}
private static double[] GetDistinctValues(double[] array) { .
.
}
private static bool AreEqual(double x1, double x2) { .
.
}
public int Discretize(double x) { .
.
}
public override string ToString() { .
.
}
private void InitializeClustering() { .
.
}
private int[] GetInitialIndexes() { .
.
}
private int InitialCluster(int di, int[] initialIndexes) { .
.
}
private void Cluster() { .
.
}
private bool ComputeMeans() { .
.
}
private bool AssignAll() { .
.
}
private int MinIndex(double[] distances) { .
.
}
private static double Distance(double x1, double x2) { .
.
}
}
} // ns
``````

## Discretizer 類

Discretizer 類有四個資料成員：

``````private double[] data;
private int k;
private double[] means;
private int[] clustering;
``````

Discretizer 建構函式使用數值資料啟用 Discretize 方法接受一個數值，並返回斷然從零開始的整數值。 請注意 Discretizer 將自動確定類別的數目。

## Discretizer 建構函式

Discretizer 建構函式定義如下：

``````public Discretizer(double[] rawData)
{
double[] sortedRawData = new double[rawData.Length];
Array.Copy(rawData, sortedRawData, rawData.Length);
Array.Sort(sortedRawData);
this.data = GetDistinctValues(sortedRawData);
this.clustering = new int[data.Length];
this.k = (int)Math.Sqrt(data.Length); // heuristic
this.means = new double[k];
this.Cluster();
}
``````

``````private static double[] GetDistinctValues(double[] array)
{
List<double> distinctList = new List<double>();
for (int i = 0; i < array.Length - 1; ++i)
if (AreEqual(array[i], array[i + 1]) == false)
double[] result = new double[distinctList.Count];
distinctList.CopyTo(result);
return result;
}
``````

``````private static bool AreEqual(double x1, double x2)
{
if (Math.Abs(x1 - x2) < 0.000001) return true;
else return false;
}
``````

AreEqual 方法使用 0.000001 任意接近閾值。 你可能想要將此值傳遞到 Discretizer 物件作為輸入參數。 一個名為 ε 變數常用的在這種情況。

## 聚類演算法

Discretizer 類的心臟是執行 k-均值聚類的代碼。 中列出的群集方法圖 4

``````private void Cluster()
{
InitializeClustering();
ComputeMeans();
bool changed = true; bool success = true;
int ct = 0;
int maxCt = data.Length * 10; // Heuristic
while (changed == true && success == true && ct < maxCt) {
++ct;
changed = AssignAll();
success = ComputeMeans();
}
}
``````

``````private static double Distance(double x1, double x2)
{
return Math.Sqrt((x1 - x2) * (x1 - x2));
}
``````

``````private bool ComputeMeans()
{
double[] sums = new double[k];
int[] counts = new int[k];
for (int i = 0; i < data.Length; ++i)
{
int c = clustering[i]; // Cluster ID
sums[c] += data[i];
counts[c]++;
}
for (int c = 0; c < sums.Length; ++c)
{
if (counts[c] == 0)
return false; // fail
else
sums[c] = sums[c] / counts[c];
}
sums.CopyTo(this.means, 0);
return true; // Success
}
``````

## 初始化的聚類

``````private int[] GetInitialIndexes()
{
int interval = data.Length / k;
int[] result = new int[k];
for (int i = 0; i < k; ++i)
result[i] = interval * (i + 1);
return result;
}
``````

``````private int InitialCluster(int di, int[] initialIndexes)
{
for (int i = 0; i < initialIndexes.Length; ++i)
if (di < initialIndexes[i])
return i;
return initialIndexes.Length - 1; // Last cluster
}
``````

``````private void InitializeClustering()
{
int[] initialIndexes = GetInitialIndexes();
for (int di = 0; di < data.Length; ++di)
{
int c = InitialCluster(di, initialIndexes);
clustering[di] = c;
}
}
``````

## 離散化方法

``````public int Discretize(double x)
{
double[] distances = new double[k];
for (int c = 0; c < k; ++c)
distances[c] = Distance(x, data[means[c]]);
return MinIndex(distances);
}
``````

## 總結

**博士。**JamesMcCaffrey 工程為微軟公司的華盛頓州雷德蒙德，校園。他曾經參與過多項 Microsoft 產品的研發，包括 Internet Explorer 和 MSN Search。他是作者的".NET 測試自動化食譜"(Apress，2006 年），並可以在達成 jammc@microsoft.com