Data transformations

Data transformations are used to prepare data for model training. The transformations in this guide return classes that implement the IEstimator interface. Data transformations can be chained together. Each transformation both expects and produces data of specific types and formats, which are specified in the linked reference documentation.

Some data transformations require training data to calculate their parameters. For example: the NormalizeMeanVariance transformer calculates the mean and variance of the training data during the Fit() operation, and uses those parameters in the Transform() operation.

Other data transformations don't require training data. For example: the ConvertToGrayscale transformation can perform the Transform() operation without having seen any training data during the Fit() operation.

Column mapping and grouping

Transform Definition
Concatenate Concatenate one or more input columns into a new output column
CopyColumns Copy and rename one or more input columns
DropColumns Drop one or more input columns
SelectColumns Select one or more columns to keep from the input data

Normalization and scaling

Transform Definition
NormalizeMeanVariance Subtract the mean (of the training data) and divide by the variance (of the training data)
NormalizeLogMeanVariance Normalize based on the logarithm of the training data
NormalizeLpNorm Scale input vectors by their lp-norm, where p is 1, 2 or infinity. Defaults to the l2 (Euclidean distance) norm
NormalizeGlobalContrast Scale each value in a row by subtracting the mean of the row data and divide by either the standard deviation or l2-norm (of the row data), and multiply by a configurable scale factor (default 2)
NormalizeBinning Assign the input value to a bin index and divide by the number of bins to produce a float value between 0 and 1. The bin boundaries are calculated to evenly distribute the training data across bins
NormalizeSupervisedBinning Assign the input value to a bin based on its correlation with label column
NormalizeMinMax Scale the input by the difference between the minimum and maximum values in the training data

Conversions between data types

Transform Definition
ConvertType Convert the type of an input column to a new type
MapValue Map values to keys (categories) based on the supplied dictionary of mappings
MapValueToKey Map values to keys (categories) by creating the mapping from the input data
MapKeyToValue Convert keys back to their original values
MapKeyToVector Convert keys back to vectors of original values
MapKeyToBinaryVector Convert keys back to a binary vector of original values
Hash Hash the value in the input column

Text transformations

Transform Definition
FeaturizeText Transform a text column into a float array of normalized ngrams and char-grams counts
TokenizeIntoWords Split one or more text columns into individual words
TokenizeIntoCharactersAsKeys Split one or more text columns into individual characters floats over a set of topics
NormalizeText Change case, remove diacritical marks, punctuation marks and numbers
ProduceNgrams Transform text column into a bag of counts of ngrams (sequences of consecutive words)
ProduceWordBags Transform text column into a bag of counts of ngrams vector
ProduceHashedNgrams Transform text column into a vector of hashed ngram counts
ProduceHashedWordBags Transform text column into a bag of hashed ngram counts
RemoveDefaultStopWords Remove default stop words for the specified language from input columns
RemoveStopWords Removes specified stop words from input columns
LatentDirichletAllocation Transform a document (represented as a vector of floats) into a vector of floats over a set of topics
ApplyWordEmbedding Convert vectors of text tokens into sentence vectors using a pre-trained model

Image transformations

Transform Definition
ConvertToGrayscale Convert an image to grayscale
ConvertToImage Convert a vector of pixels to ImageDataViewType
ExtractPixels Convert pixels from input image into a vector of numbers
LoadImages Load images from a folder into memory
ResizeImages Resize images

Categorical data transformations

Transform Definition
OneHotEncoding Convert one or more text columns into one-hot encoded vectors
OneHotHashEncoding Convert one more text columns into hash-based one-hot encoded vectors

Missing values

Transform Definition
IndicateMissingValues Create a new boolean output column, the value of which is true when the value in the input column is missing
ReplaceMissingValues Create a new output column, the value of which is set to a default value if the value is missing from the input column, and the input value otherwise

Feature selection

Transform Definition
SelectFeaturesBasedOnCount Select features whose non-default values are greater than a threshold
SelectFeaturesBasedOnMutualInformation Select the features on which the data in the label column is most dependent

Custom transformations

Transform Definition
CustomMapping Transform existing columns onto new ones with a user-defined mapping