Feature Selection in Data Mining
When you build a data mining model in Microsoft SQL Server 2005 Analysis Services (SSAS), the dataset frequently contains more information than is needed to build the model, although it is difficult to tell what is necessary until after you have built the model. For example, a dataset may contain 500 columns that describe characteristics of customers, but perhaps only 50 of those columns are used to build a particular model. While the additional columns do not affect the output of the model, they do increase the time that is required to process the model and the space that is needed to store the model. To solve this problem, certain Microsoft algorithms implement feature selection. Feature selection automatically chooses the attributes in a dataset that are most likely to be used in the model. The following algorithms support feature selection:
- Naive Bayes
- Decision Trees
- Neural Network
Feature selection works on input and predictable attributes, or on the number of states in a column, depending on the algorithm. You can control when feature selection is turned on by using the algorithm parameters MAXIMUM_INPUT_ATTRIBUTES, MAXIMUM_OUTPUT_ATTRIBUTES, and MAXIMUM_STATES. If a model contains more columns than the number that is specified in the MAXIMUM_INPUT_ATTRIBUTES parameter, the algorithm ignores any columns that it calculates to be uninteresting. Similarly, if a model contains more predictable columns than the number that is specified in the MAXIMUM_OUTPUT_ATTRIBUTES parameter, the algorithm ignores any columns that it calculates to be uninteresting. If a model contains more cases than are specified in the MAXIMUM_STATES parameter, the least popular states are grouped together and treated as missing. If any one of these parameters is set to 0, feature selection is turned off, affecting processing time and performance.
Only the input attributes and states that the algorithm selects are included in the model-building process and can be used for prediction. Predictable columns that are ignored by feature selection are used for prediction, but the predictions are based only on the global statistics that exist in the model.