Discretization Methods

Some algorithms that are used to create data mining models in Microsoft SQL Server 2005 Analysis Services (SSAS) require specific content types, to be able to function correctly. For example, some algorithms such as the Microsoft Naive Bayes algorithm cannot use continuous columns as input, or cannot predict continuous values. Additionally, some columns may contain so many values that the algorithm cannot easily identify interesting patterns in the data, from which to create a model.

In these cases, you can discretize the data in the columns so that you can use the algorithms to produce a mining model. Discretization is the process of putting values of a continuous set of date into buckets so that there are a discrete number of possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric and string columns.

There are several methods that you can use to discretize data. Each method automatically computes the number of buckets to generate, by using the equation in the following code example:

`Number of Buckets = sqrt(n)`

In this code example, n is the number of distinct values of data in the column. If you do not want Analysis Services to calculate the number of buckets, you can use the DiscretizationBuckets property to manually specify the number of buckets.

The following table describes the methods that you can use to discretize data in Analysis Services.

Discretization method Description

AUTOMATIC

Analysis Services determines which discretization method to use.

CLUSTERS

The algorithm divides the data into groups by sampling the training data, initializing to a number of random points, and then running several iterations of the Microsoft Clustering algorithm using the Expectation Maximization (EM) clustering method. The CLUSTERS method is useful because it works on any distribution curve. However, it requires more processing time than the other discretization methods.

This method can only be used with numeric columns.

EQUAL_AREAS

The algorithm divides the data into groups that contain an equal number of values. This method is best used for normal distribution curves, but does not work well if the distribution includes a large number of values that occur in a narrow group in the continuous data. For example, if one-half of the order items that are specified in a case diagram have a Cost value of zero, one-half the data will occur under a single point in the curve. In such a distribution, this method breaks the data up in an effort to establish equal discretization into multiple areas. This produces an inaccurate representation of the data.

You can use the EQUAL_AREAS method to discretize strings.

The CLUSTERS and THRESHOLDS methods use a random sample of 1000 records to discretize data. Use the EQUAL_AREAS method if you do not want the algorithm to sample data.