# Prepare Data

Learn how to use ML.NET to prepare data for additional processing or building a model.

Data is often unclean and sparse. Additionally, ML.NET machine learning algorithms expect input or features to be in a single numerical vector. Therefore one of the goals of data preparation is to get the data into the format expected by ML.NET algorithms.

## Filter data

Sometimes, not all data in a dataset is relevant for analysis. An approach to remove irrelevant data is filtering. The `DataOperationsCatalog`

contains a set of filter operations that take in an `IDataView`

containing all of the data and return an IDataView containing only the data points of interest. It's important to note that because filter operations are not an `IEstimator`

or `ITransformer`

like those in the `TransformsCatalog`

, they cannot be included as part of an `EstimatorChain`

or `TransformerChain`

data preparation pipeline.

Using the following input data which is loaded into an `IDataView`

:

```
HomeData[] homeDataList = new HomeData[]
{
new HomeData
{
NumberOfBedrooms=1f,
Price=100000f
},
new HomeData
{
NumberOfBedrooms=2f,
Price=300000f
},
new HomeData
{
NumberOfBedrooms=6f,
Price=600000f
}
};
```

To filter data based on the value of a column, use the `FilterRowsByColumn`

method.

```
// Apply filter
IDataView filteredData = mlContext.Data.FilterRowsByColumn(data, "Price", lowerBound: 200000, upperBound: 1000000);
```

The sample above takes rows in the dataset with a price between 200000 and 1000000. The result of applying this filter would return only the last two rows in the data and exclude the first row because its price is 100000 and not between the specified range.

## Replace missing values

Missing values are a common occurrence in datasets. One approach to dealing with missing values is to replace them with the default value for the given type if any or another meaningful value such as the mean value in the data.

Using the following input data which is loaded into an `IDataView`

:

```
HomeData[] homeDataList = new HomeData[]
{
new HomeData
{
NumberOfBedrooms=1f,
Price=100000f
},
new HomeData
{
NumberOfBedrooms=2f,
Price=300000f
},
new HomeData
{
NumberOfBedrooms=6f,
Price=float.NaN
}
};
```

Notice that the last element in our list has a missing value for `Price`

. To replace the missing values in the `Price`

column, use the `ReplaceMissingValues`

method to fill in that missing value.

Important

`ReplaceMissingValue`

only works with numerical data.

```
// Define replacement estimator
var replacementEstimator = mlContext.Transforms.ReplaceMissingValues("Price", replacementMode: MissingValueReplacingEstimator.ReplacementMode.Mean);
// Fit data to estimator
// Fitting generates a transformer that applies the operations of defined by estimator
ITransformer replacementTransformer = replacementEstimator.Fit(data);
// Transform data
IDataView transformedData = replacementTransformer.Transform(data);
```

ML.NET supports various replacement modes. The sample above uses the `Mean`

replacement mode which will fill in the missing value with that column's average value. The replacement
's result fills in the `Price`

property for the last element in our data with 200,000 since it's the average of 100,000 and 300,000.

## Use normalizers

Normalization is a data pre-processing technique used to standardize features that are not on the same scale which helps algorithms converge faster. For example, the ranges for values like age and income vary significantly with age generally being in the range of 0-100 and income generally being in the range of zero to thousands. Visit the transforms page for a more detailed list and description of normalization transforms.

### Min-Max normalization

Using the following input data which is loaded into an `IDataView`

:

```
HomeData[] homeDataList = new HomeData[]
{
new HomeData
{
NumberOfBedrooms = 2f,
Price = 200000f
},
new HomeData
{
NumberOfBedrooms = 1f,
Price = 100000f
}
};
```

Normalize the data using min-max normalization using the `NormalizeMinMax`

method.

```
// Define min-max estimator
var minMaxEstimator = mlContext.Transforms.NormalizeMinMax("Price");
// Fit data to estimator
// Fitting generates a transformer that applies the operations of defined by estimator
ITransformer minMaxTransformer = minMaxEstimator.Fit(data);
// Transform data
IDataView transformedData = minMaxTransformer.Transform(data);
```

The original price values `[200000,100000]`

are converted to `[ 1, 0.5 ]`

using the `MinMax`

normalization formula which generates output values in the range of 0-1.

### Binning

Binning converts continuous values into a discrete representation of the input. For example, suppose one of your features is age. Instead of using the actual age value, binning creates ranges for that value. 0-18 could be one bin, another could be 19-35 and so on.

Using the following input data which is loaded into an `IDataView`

:

```
HomeData[] homeDataList = new HomeData[]
{
new HomeData
{
NumberOfBedrooms=1f,
Price=100000f
},
new HomeData
{
NumberOfBedrooms=2f,
Price=300000f
},
new HomeData
{
NumberOfBedrooms=6f,
Price=600000f
}
};
```

Normalize the data into bins using the `NormalizeBinning`

method. The `maximumBinCount`

parameter enables you to specify the number of bins needed to classify your data. In this example, data will be put into two bins.

```
// Define binning estimator
var binningEstimator = mlContext.Transforms.NormalizeBinning("Price", maximumBinCount: 2);
// Fit data to estimator
// Fitting generates a transformer that applies the operations of defined by estimator
var binningTransformer = binningEstimator.Fit(data);
// Transform Data
IDataView transformedData = binningTransformer.Transform(data);
```

The result of binning creates bin bounds of `[0,200000,Infinity]`

. Therefore the resulting bins are `[0,1,1]`

because the first observation is between 0-200000 and the others are greater than 200000 but less than infinity.

## Work with categorical data

Non-numeric categorical data needs to be converted to a number before being used to build a machine learning model.

Using the following input data which is loaded into an `IDataView`

:

```
CarData[] cars = new CarData[]
{
new CarData
{
Color="Red",
VehicleType="SUV"
},
new CarData
{
Color="Blue",
VehicleType="Sedan"
},
new CarData
{
Color="Black",
VehicleType="SUV"
}
};
```

The categorical `VehicleType`

property can be converted into a number using the `OneHotEncoding`

method.

```
// Define categorical transform estimator
var categoricalEstimator = mlContext.Transforms.Categorical.OneHotEncoding("VehicleType");
// Fit data to estimator
// Fitting generates a transformer that applies the operations of defined by estimator
ITransformer categoricalTransformer = categoricalEstimator.Fit(data);
// Transform Data
IDataView transformedData = categoricalTransformer.Transform(data);
```

The resulting transform converts the text value of `VehicleType`

to a number. The entries in the `VehicleType`

column become the following when the transform is applied:

```
[
1, // SUV
2, // Sedan
1 // SUV
]
```

## Work with text data

Text data needs to be transformed into numbers before using it to build a machine learning model. Visit the transforms page for a more detailed list and description of text transforms.

Using data like the data below that has been loaded into an `IDataView`

:

```
ReviewData[] reviews = new ReviewData[]
{
new ReviewData
{
Description="This is a good product",
Rating=4.7f
},
new ReviewData
{
Description="This is a bad product",
Rating=2.3f
}
};
```

The minimum step to convert text to a numerical vector representation is to use the `FeaturizeText`

method. By using the `FeaturizeText`

transform, a series of transformations is applied to the input text column resulting in a numerical vector representing the lp-normalized word and character ngrams.

```
// Define text transform estimator
var textEstimator = mlContext.Transforms.Text.FeaturizeText("Description");
// Fit data to estimator
// Fitting generates a transformer that applies the operations of defined by estimator
ITransformer textTransformer = textEstimator.Fit(data);
// Transform data
IDataView transformedData = textTransformer.Transform(data);
```

The resulting transform would convert the text values in the `Description`

column to a numerical vector that looks similar to the output below:

```
[ 0.2041241, 0.2041241, 0.2041241, 0.4082483, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0, 0, 0, 0, 0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136, 0 ]
```

Combine complex text processing steps into an `EstimatorChain`

to remove noise and potentially reduce the amount of required processing resources as needed.

```
// Define text transform estimator
var textEstimator = mlContext.Transforms.Text.NormalizeText("Description")
.Append(mlContext.Transforms.Text.TokenizeIntoWords("Description"))
.Append(mlContext.Transforms.Text.RemoveDefaultStopWords("Description"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Description"))
.Append(mlContext.Transforms.Text.ProduceNgrams("Description"))
.Append(mlContext.Transforms.NormalizeLpNorm("Description"));
```

`textEstimator`

contains a subset of operations performed by the `FeaturizeText`

method. The benefit of a more complex pipeline is control and visibility over the transformations applied to the data.

Using the first entry as an example, the following is a detailed description of the results produced by the transformation steps defined by `textEstimator`

:

**Original Text: This is a good product**

Transform | Description | Result |
---|---|---|

1. NormalizeText | Converts all letters to lowercase by default | this is a good product |

2. TokenizeWords | Splits string into individual words | ["this","is","a","good","product"] |

3. RemoveDefaultStopWords | Removes stopwords like is and a. |
["good","product"] |

4. MapValueToKey | Maps the values to keys (categories) based on the input data | [1,2] |

5. ProduceNGrams | Transforms text into sequence of consecutive words | [1,1,1,0,0] |

6. NormalizeLpNorm | Scale inputs by their lp-norm | [ 0.577350529, 0.577350529, 0.577350529, 0, 0 ] |

## Feedback

Send feedback about:

Loading feedback...