Replace Discrete Values

Replaces discrete values from one column with numeric values based on another column

Category: Statistical Functions

Note

Applies to: Machine Learning Studio

This content pertains only to Studio. Similar drag and drop modules have been added to the visual interface in Machine Learning service. Learn more in this article comparing the two versions.

Module overview

This article describes how to use the Replace Discrete Values module in Azure Machine Learning Studio, to generate a probability score that can be used to represent a discrete value. This score can be useful for understanding the information value of the discrete values.

How it works:

You select a column that contains the discrete (or categorical) value, and then select another column to use for reference.

Depending on whether the second column is categorical or non-categorical, the module computes one of the following values:

  • The conditional probability for the second column given the values in the first column.
  • The mean and standard deviation for each group of values in the first column.

The module outputs both a dataset with the scores, and a function that you can save and apply to other datasets.

How to configure Replace Discrete Values

Tip

We recommend working with only one pair of columns at a time. The module does not raise an error if you select multiple columns to analyze. However, in practice, if you choose multiple columns, they are matched by an internal heuristic, not by order of selection.

Therefore, we recommend that you select a single pair of columns each time, one for Discrete columns and one for Replacement columns.

If you need to generate scores for multiple columns, use separate instances of Replace Discrete Values.

  1. Add the Replace Discrete Values module to your experiment. You can find this module in the Statistical Functions group in the experiment items list in Azure Machine Learning Studio.

  2. Connect a dataset containing at least one column of categorical data.

  3. Discrete columns: Click Launch column selector to choose a column that contain discrete (or categorical) values.

    Any discrete columns that you select must be categorical. If you get an error, use the Edit Metadata module to change the column type.

  4. Replacement columns: Click Launch column selector to choose the column that contains the values to use in computing a replacement score.

    If you select multiple columns for Discrete columns, you must choose an equal number of replacement columns.

  5. Run the experiment.

    Note

    You cannot choose which statistical function to apply. The module calculates an appropriate measure, based on the data type of the column selected for Replacement column.

Results

The module computes one of the following values for each pair of columns:

  • If the second column contains categorical values, the module computes the conditional probability of the second column, given the values in the first column.

    For example, assume you chose occupation from the Census dataset as the discrete column and choose gender as the replacement column. The output of the module would be the:

    P(gender | occupation)

  • If the second column contains non-categorical values that can be converted to numbers (such as numeric or Boolean values not marked as categorical), the module outputs the mean and standard deviation for each group of values in the first column.

    For example, assume you use occupation as the Discrete column and the other column is the numeric column hours-per-week. The module would output these new values:

    Mean(hours-per-week | occupation)

    Std-Dev(hours-per-week | occupation)

In addition to the probability scores, the module also outputs a transformed dataset. In this dataset, the column selected as the Replacement columns is replaced with a column containing the computed scores.

Tip

The columns in the source dataset are not actually changed or deleted by the operation; the score columns are new ones generated by the module and output instead of the source data.

To view the source values together with the probability scores, use the Add Columns module.

Examples

The usage of Replace Discrete Values can be illustrated by some simple examples.

Example 1 - Replace a categorical value with a probability score

The following table contains a categorical column X, and a column Y with True/False values that are treated as categorical values. When you use Replace Discrete Values, it calculates a conditional probability score for the probability of Y given X, as shown in the third column.

X Y P(Y|X)
Blue 0 P(Y=0|X=Blue) = 0.5
Blue 1 P(Y=1|X=Blue) = 0.5
Green 0 P(Y=0|X=Green) = 2/3
Green 0 P(Y=0|X=Green) = 2/3
Green 1 P(Y=1|X=Green) = 1/3
Red 0 P(Y=0|X=Red) = .75
Red 0 P(Y=0|X=Red) = .75
Red 1 P(Y=1|X=Red) = .25
Red 0 P(Y=0|X=Red) = .75

Example 2 - Calculate mean and standard deviation based on a noncategorical column

When the second column is numerical, Replace Discrete Values calculates the mean and standard deviation instead of a conditional probability score.

The following example is based on the Auto Prices sample dataset, simplified as follows:

  • A small subset of columns was selected.

  • Only the top 30 rows were extracted, by using the Head option of the Partition and Sample module.

  • The Replace Discrete Values module was used to compute the mean and standard deviation for vehicle curb weight. given the categorical column, num-of-doors.

The following table illustrates the results:

Body Num-of-doors Curb-weight Mean(curb-weight|num-of-doors) Std-Dev(curb-weight|num-of-doors)
std two 2548 2429.785714 507.45699
std four 2337 2625.6 493.409877
std two 2507 2429.785714 507.45699
turbo four 3086 2625.6 5 493.409877
std four 1989 2625.6 493.409877
turbo 2191
std four 2535 2625.6 493.409877

You can verify the mean for each group of values by using the AVERAGEIF function in Excel.

Example 3 - Handling missing values

This example demonstrates how missing values (nulls) propagate to the results when conditional probability scores are calculated.

  • If the discrete value column and the calculation lookup column contains any missing values, the missing values are propagated to the new column.

  • If the discrete value column contains only missing values, the module cannot process the column and an error message appears.

X Y P(Y|X)
1 True P(Y=true|X=1) = 1/2
1 False P(Y=false|X=1) = 1/2
2 True P(Y=true|X=2) = 1/3
2 False P(Y=false|X=2) = 1/3
2 Null P(Y=null|X=2) = null

Technical notes

  • You must ensure that any discrete columns you want to replace are categorical, or the module will return an error. To do this, use the Edit Metadata module.

  • If the second column contains Boolean values, the True-False values are processed as numeric with FALSE and TRUE equivalent to 0 and 1 respectively.

  • The formula for the standard deviation column calculates the population standard deviation. Therefore, N is used in the denominator instead of (N - 1).

  • If the second column contains noncategorical data (numeric or Boolean values), the module computes the mean and standard deviation of Y for the given value of X.

    That is, for each row in the dataset indexed by i:

    Mean(Y│X)i = Mean(Y│X = Xi)

    StdDev(Y│X)i = StdDev(Y│X = Xi)

  • If the second column contains categorical data or values that are neither numeric nor Boolean, the module computes the conditional probability of Y for the given value of X.

  • Any Boolean values in the second column are processed as numeric data with FALSE and TRUE equivalent to 0 and 1 respectively.

  • If there is a class in the discrete column, such that a row with a missing value is present in the second column, the sum of conditional probabilities within the class is less than one.

Expected inputs

Name Type Description
Dataset Data Table Input dataset

Module parameters

Name Range Type Default Description
Discrete columns Any ColumnSelection Selects the columns that contain discrete values
Replacement columns Any ColumnSelection Selects the columns that contain the data to use in place of the discrete values

Outputs

Name Type Description
Supplemented dataset Data Table Dataset with replaced data
Transform function ITransform interface Definition of the transform function, which can be applied to other datasets

Exceptions

Exception Description
Error 0001 Exception occurs if one or more specified columns of the data set couldn't be found.
Error 0003 Exception occurs if one or more of inputs are null or empty.
Error 0020 Exception occurs if the number of columns in some of the datasets passed to the module is too small.
Error 0021 Exception occurs if the number of rows in some of the datasets passed to the module is too small.
Error 0017 Exception occurs if one or more specified columns have a type that is unsupported by the current module.
Error 0026 Exception occurs when columns with the same name are not allowed.
Error 0022 Exception occurs if the number of selected columns in the input dataset does not equal the expected number.

For a list of errors specific to Studio modules, see Machine Learning Error codes.

For a list of API exceptions, see Machine Learning REST API Error Codes.

See also

Statistical Functions