Filter Based Feature Selection
Identifies the features in a dataset with the greatest predictive power
Category: Feature Selection Modules
This article describes how to use the Filter Based Feature Selection module in Azure Machine Learning Studio, to identify the columns in your input dataset that have the greatest predictive power.
In general, feature selection refers to the process of applying statistical tests to inputs, given a specified output, to determine which columns are more predictive of the output. The Filter Based Feature Selection module provides multiple feature selection algorithms to choose from, including correlation methods such as Pearsons's or Kendall's correlation, mutual information scores, and chi-squared values. Azure Machine Learning also supports feature value counts as an indicator of information value.
When you use the Filter Based Feature Selection module, you provide a dataset, identify the column that contains the label or dependent variable, and then specify a single method to use in measuring feature importance.
The module outputs a dataset that contains the best feature columns, as ranked by predictive power. It also outputs the names of the features and their scores from the selected metric.
What is filter-based feature selection and why use it?
This module for feature selection is called "filter-based" because you use the selected metric to identify irrelevant attributes, and filter out redundant columns from your model. You choose a single statistical measure that suits your data, and the module calculates a score for each feature column. The columns are returned ranked by their feature scores.
By choosing the right features, you can potentially improve the accuracy and efficiency of classification.
You typically use only the columns with the best scores to build your predictive model. Columns with poor feature selection scores can be left in the dataset and ignored when you build a model.
How to choose a feature selection metric
The Filter-Based Feature Selection provides a variety of metrics for assessing the information value in each column. This section provides a general description of each metric, and how it is applied. Additional requirements for using each metric are stated in the Technical Notes section and in the instructions for configuring each module.
Pearson’s correlation statistic, or Pearson’s correlation coefficient, is also known in statistical models as the
rvalue. For any two variables, it returns a value that indicates the strength of the correlation
Pearson's correlation coefficient is computed by taking the covariance of two variables and dividing by the product of their standard deviations. The coefficient is not affected by changes of scale in the two variables.
The mutual information score measures the contribution of a variable towards reducing uncertainty about the value of another variable: namely, the label. Many variations of the mutual information score have been devised to suit different distributions.
The mutual information score is particularly useful in feature selection because it maximizes the mutual information between the joint distribution and target variables in datasets with many dimensions.
Kendall's rank correlation is one of several statistics that measure the relationship between rankings of different ordinal variables or different rankings of the same variable. In other words, it measures the similarity of orderings when ranked by the quantities. Both this coefficient and Spearman’s correlation coefficient are designed for use with non-parametric and non-normally distributed data.
Spearman's coefficient is a nonparametric measure of statistical dependence between two variables, and is sometimes denoted by the Greek letter rho. The Spearman’s coefficient expresses the degree to which two variables are monotonically related. It is also called Spearman rank correlation, because it can be used with ordinal variables.
The two-way chi-squared test is a statistical method that measures how close expected values are to actual results. The method assumes that variables are random and drawn from an adequate sample of independent variables. The resulting chi-squared statistic indicates how far results are from the expected (random) result.
The Fisher score (also called the Fisher method, or Fisher combined probability score) is sometimes termed the information score, because it represents the amount of information that one variable provides about some unknown parameter on which it depends.
The score is computed by measuring the variance between the expected value of the information and the observed value. When variance is minimized, information is maximized. Since the expectation of the score is zero, the Fisher information is also the variance of the score.
Count-based feature selection is a simple yet relatively powerful way of finding information about predictors. The basic idea underlying count-based featurization is simple: by calculating counts of individual values within a column, you can get an idea of the distribution and weight of values, and from this, understand which columns contain the most important information.
Count-based feature selection is a non-supervised method of feature selection, meaning you don't need a label column. This method also reduces the dimensionality of the data without losing information.
For more information about how count-based features are created and why they are useful in machine learning, see Learning with Counts.
If you need a different option for custom feature selection method, use the Execute R Script module.
How to configure Filter-Based Feature Selection
This module provides two methods for determining feature scores:
You choose a standard statistical metric, and the module computes the correlation between a pair of columns, the label column and a feature column
With the count-based method, the module calculates a score based purely on the values in the column.
Generate feature scores using a traditional statistical metric
Add the Filter-Based Feature Selection module to your experiment. You can find it in the Feature Selection category in Studio.
Connect an input dataset that contains at least two columns that are potential features.
To ensure that a column should be analyzed and a feature score generated, use the Edit Metadata module to set the IsFeature attribute.
Ensure that the columns you are providing as input are potential features. For example, a column that contains a single value has no information value.
If you know there are columns that would make bad features, you can remove them from the column selection. You could also use the Edit Metadata module to flag them as Categorical.
For Feature scoring method, choose one of the following established statistical methods to use in calculating scores.
Method Requirements Pearson Correlation Label can be text or numeric. Features must be numeric. Mutual Information Labels and features can be text or numeric. Use this method for computing feature importance for two categorical columns. Kendall Correlation Label can be text or numeric but features must be numeric. Spearman Correlation Label can be text or numeric but features must be numeric. Chi Squared Labels and features can be text or numeric. Use this method for computing feature importance for two categorical columns. Fisher Score Label can be text or numeric but features must be numeric. Counts See: To use Count-Based Feature Selection
If you change the selected metric, all other selections will be reset, so be sure to set this option first!)
Select the Operate on feature columns only option to generate a score only for those columns that have been previously marked as features.
If you deselect this option, the module will create a score for any column that otherwise meets the criteria, up to the number of columns specified in Number of desired features.
For Target column, click Launch column selector to choose the label column either by name or by its index (indexes are one-based).
A label column is required for all methods that involve statistical correlation. The module returns a design-time error if you choose no label column or multiple label columns.
For Number of desired features, type the number of feature columns you want returned as a result.
The minimum number of features you can specify is 1, but we recommend that you increase this value.
If the specified number of desired features is greater than the number of columns in the dataset, then all features are returned, even those with zero scores.
- If you specify fewer result columns than there are feature columns, the features are ranked by descending score, and only the top features are returned.
Run the experiment, or select the Filter Based Feature Selection module and then click Run selected.
Results of feature selection
After processing is complete:
To see a complete list of the feature columns that were analyzed, and their scores, right-click the module, select Features, and click Visualize.
To view the dataset that is generated based on your feature selection criteria, right-click the module, select Dataset, and click Visualize.
If the dataset contains fewer columns than you expected, check the module settings, and the data types of the columns provided as input. For example, if you set Number of desired features to 1, the output dataset contains just two columns: the label column, and the most highly ranked feature column.
Use count-based feature selection
Add the Filter-Based Feature Selection module to your experiment. You can find it in the list of modules in Studio, in the Feature Selection group.
Connect an input dataset that contains at least two columns that are possible features.
Select Count Based from the list of statistical methods in the Feature scoring method dropdown list.
For Minimum number of non-zero elements, indicate the minimum number of feature columns to include in the output.
By default, the module outputs all columns that meet the requirements. The module cannot output any column that gets a score of zero.
Run the experiment, or select just the module, and click Run Selected.
Results of count-based feature selection
- To see the list of feature columns with their scores, right-click the module, select Features, and click Visualize .
- To see the dataset containing the analyzed columns, right-click the module, select Dataset, and click Visualize.
Unlike other methods, the Count Based feature selection method does not rank the variables by highest scores, but returns all variables with a non-zero score, in their original order.
String features always get a zero (0) score and are thus are not output.
You can see examples of how feature selection is used in the Azure AI Gallery:
Text Classification; In the third step of this sample, Filter-Based Feature Selection is used to identify the 15 best features. Feature hashing is used to convert the text documents to numeric vectors. Pearson’s correlation is then used on the vector features.
Machine learning feature selection and feature engineering: This article provides an introduction to feature selection and feature engineering in machine learning.
To see examples of feature scores, see Table of scores compared.
You can find this module under Data Transformation, in the Filters category.
If you use Pearson Correlation, Kendall Correlation, or Spearman Correlation on a numeric feature and a categorical label, the feature score is calculated as follows:
For each level in the categorical column, compute the conditional mean of numeric column.
Correlate the column of conditional means with the numeric column.
A feature selection score cannot be generated for any column that is designated as a label or as a score column.
If you attempt to use a scoring method with a column of a data type not supported by the method, either the module will raise an error, or a zero score will be assigned to the column.
If a column contains logical (true/false) values, they are processed as True = 1 and False = 0.
A column cannot be a feature if it has been designated as a Label or a Score.
How missing values are handled
You cannot specify as a target (label) column any column that has all missing values.
If a column contains missing values, they are ignored when computing the score for the column.
If a column designated as a feature column has all missing values, a zero score is assigned.
Table of scores compared
To give you an idea of how the scores compare when using different metrics, the following table presents some feature selection scores from multiple features in the automobile price dataset, given the dependent variable highway-mpg.
|Feature column||Pearson score||Count score||Kendall score||Mutual information|
Mutual information scores can be created for all column types, including strings.
The other scores included in this table, such as Pearson's correlation or count-based feature selection, require numeric values. String features get a score of 0 and hence are not included in the output. For exceptions, see the Technical Notes section.
The count-based method does not treat a label column any differently from feature columns.
|Dataset||Data Table||Input dataset|
|Feature scoring method||List||Scoring method||Choose the method to use for scoring|
|Operate on feature columns only||Any||Boolean||true||Indicate whether to use only feature columns in the scoring process|
|Target column||Any||ColumnSelection||None||Specify the target column|
|Number of desired features||>=1||Integer||1||Specify the number of features to output in results|
|Minimum number of non-zero elements||>=1||Integer||1||Specify the number of features to output (for CountBased method)|
|Filtered dataset||Data Table||Filtered dataset|
|Features||Data Table||Names of output columns and feature selection scores|
|Error 0001||Exception occurs if one or more specified columns of data set couldn't be found.|
|Error 0003||Exception occurs if one or more of inputs are null or empty.|
|Error 0004||Exception occurs if parameter is less than or equal to specific value.|
|Error 0017||Exception occurs if one or more specified columns have type unsupported by current module.|
For a list of errors specific to Studio modules, see Machine Learning Error codes
For a list of API exceptions, see Machine Learning REST API Error Codes.