Data featurization in automated machine learning

Learn about the data featurization settings in Azure Machine Learning, and how to customize those features for automated machine learning experiments.

Feature engineering and featurization

Training data consists of rows and columns. Each row is an observation or record, and the columns of each row are the features that describe each record. Typically, the features that best characterize the patterns in the data are selected to create predictive models.

Although many of the raw data fields can be used directly to train a model, it's often necessary to create additional (engineered) features that provide information that better differentiates patterns in the data. This process is called feature engineering, where the use of domain knowledge of the data is leveraged to create features that, in turn, help machine learning algorithms to learn better.

In Azure Machine Learning, data-scaling and normalization techniques are applied to make feature engineering easier. Collectively, these techniques and this feature engineering are called featurization in automated ML experiments.

Prerequisites

This article assumes that you already know how to configure an automated ML experiment.

Important

The Python commands in this article require the latest azureml-train-automl package version.

For information about configuration, see the following articles:

Configure featurization

In every automated machine learning experiment, automatic scaling and normalization techniques are applied to your data by default. These techniques are types of featurization that help certain algorithms that are sensitive to features on different scales. You can enable more featurization, such as missing-values imputation, encoding, and transforms.

Note

Steps for automated machine learning featurization (such as feature normalization, handling missing data, or converting text to numeric) become part of the underlying model. When you use the model for predictions, the same featurization steps that are applied during training are applied to your input data automatically.

For experiments that you configure with the Python SDK, you can enable or disable the featurization setting and further specify the featurization steps to be used for your experiment. If you're using the Azure Machine Learning studio, see the steps to enable featurization.

The following table shows the accepted settings for featurization in the AutoMLConfig class:

Featurization configuration Description
"featurization": 'auto' Specifies that, as part of preprocessing, data guardrails and featurization steps are to be done automatically. This setting is the default.
"featurization": 'off' Specifies that featurization steps are not to be done automatically.
"featurization": 'FeaturizationConfig' Specifies that customized featurization steps are to be used. Learn how to customize featurization.

Automatic featurization

The following table summarizes techniques that are automatically applied to your data. These techniques are applied for experiments that are configured by using the SDK or the studio. To disable this behavior, set "featurization": 'off' in your AutoMLConfig object.

Note

If you plan to export your AutoML-created models to an ONNX model, only the featurization options indicated with an asterisk ("*") are supported in the ONNX format. Learn more about converting models to ONNX.

Featurization steps Description
Drop high cardinality or no variance features* Drop these features from training and validation sets. Applies to features with all values missing, with the same value across all rows, or with high cardinality (for example, hashes, IDs, or GUIDs).
Impute missing values* For numeric features, impute with the average of values in the column.

For categorical features, impute with the most frequent value.
Generate more features* For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.

For forecasting tasks, these additional DateTime features are created: ISO year, Half - half-year, Calendar month as string, Week, Day of week as string, Day of quarter, Day of year, AM/PM (0 if hour is before noon (12 pm), 1 otherwise), AM/PM as string, Hour of day (12-hr basis)

For Text features: Term frequency based on unigrams, bigrams, and trigrams. Learn more about how this is done with BERT.
Transform and encode* Transform numeric features that have few unique values into categorical features.

One-hot encoding is used for low-cardinality categorical features. One-hot-hash encoding is used for high-cardinality categorical features.
Word embeddings A text featurizer converts vectors of text tokens into sentence vectors by using a pre-trained model. Each word's embedding vector in a document is aggregated with the rest to produce a document feature vector.
Cluster Distance Trains a k-means clustering model on all numeric columns. Produces k new features (one new numeric feature per cluster) that contain the distance of each sample to the centroid of each cluster.

Data guardrails

Data guardrails help you identify potential issues with your data (for example, missing values or class imbalance). They also help you take corrective actions for improved results.

Data guardrails are applied:

  • For SDK experiments: When the parameters "featurization": 'auto' or validation=auto are specified in your AutoMLConfig object.
  • For studio experiments: When automatic featurization is enabled.

You can review the data guardrails for your experiment:

  • By setting show_output=True when you submit an experiment by using the SDK.

  • In the studio, on the Data guardrails tab of your automated ML run.

Data guardrail states

Data guardrails display one of three states:

State Description
Passed No data problems were detected and no action is required by you.
Done Changes were applied to your data. We encourage you to review the corrective actions that AutoML took, to ensure that the changes align with the expected results.
Alerted A data issue was detected but couldn't be remedied. We encourage you to revise and fix the issue.

Supported data guardrails

The following table describes the data guardrails that are currently supported and the associated statuses that you might see when you submit your experiment:

Guardrail Status Condition for trigger
Missing feature values imputation Passed


Done
No missing feature values were detected in your training data. Learn more about missing-value imputation.

Missing feature values were detected in your training data and were imputed.
High cardinality feature handling Passed


Done
Your inputs were analyzed, and no high-cardinality features were detected.

High-cardinality features were detected in your inputs and were handled.
Validation split handling Done The validation configuration was set to 'auto' and the training data contained fewer than 20,000 rows.
Each iteration of the trained model was validated by using cross-validation. Learn more about validation data.

The validation configuration was set to 'auto', and the training data contained more than 20,000 rows.
The input data has been split into a training dataset and a validation dataset for validation of the model.
Class balancing detection Passed



Alerted


Done
Your inputs were analyzed, and all classes are balanced in your training data. A dataset is considered to be balanced if each class has good representation in the dataset, as measured by number and ratio of samples.

Imbalanced classes were detected in your inputs. To fix model bias, fix the balancing problem. Learn more about imbalanced data.

Imbalanced classes were detected in your inputs and the sweeping logic has determined to apply balancing.
Memory issues detection Passed



Done

The selected values (horizon, lag, rolling window) were analyzed, and no potential out-of-memory issues were detected. Learn more about time-series forecasting configurations.


The selected values (horizon, lag, rolling window) were analyzed and will potentially cause your experiment to run out of memory. The lag or rolling-window configurations have been turned off.
Frequency detection Passed



Done

The time series was analyzed, and all data points are aligned with the detected frequency.

The time series was analyzed, and data points that don't align with the detected frequency were detected. These data points were removed from the dataset. Learn more about data preparation for time-series forecasting.

Customize featurization

You can customize your featurization settings to ensure that the data and features that are used to train your ML model result in relevant predictions.

To customize featurizations, specify "featurization": FeaturizationConfig in your AutoMLConfig object. If you're using the Azure Machine Learning studio for your experiment, see the how-to article. To customize featurization for forecastings task types, refer to the forecasting how-to.

Supported customizations include:

Customization Definition
Column purpose update Override the autodetected feature type for the specified column.
Transformer parameter update Update the parameters for the specified transformer. Currently supports Imputer (mean, most frequent, and median) and HashOneHotEncoder.
Drop columns Specifies columns to drop from being featurized.
Block transformers Specifies block transformers to be used in the featurization process.

Note

The drop columns functionality is deprecated as of SDK version 1.19. Drop columns from your dataset as part of data cleansing, prior to consuming it in your automated ML experiment.

Create the FeaturizationConfig object by using API calls:

featurization_config = FeaturizationConfig()
featurization_config.blocked_transformers = ['LabelEncoder']
featurization_config.drop_columns = ['aspiration', 'stroke']
featurization_config.add_column_purpose('engine-size', 'Numeric')
featurization_config.add_column_purpose('body-style', 'CategoricalHash')
#default strategy mean, add transformer param for for 3 columns
featurization_config.add_transformer_params('Imputer', ['engine-size'], {"strategy": "median"})
featurization_config.add_transformer_params('Imputer', ['city-mpg'], {"strategy": "median"})
featurization_config.add_transformer_params('Imputer', ['bore'], {"strategy": "most_frequent"})
featurization_config.add_transformer_params('HashOneHotEncoder', [], {"number_of_bits": 3})

Featurization transparency

Every AutoML model has featurization automatically applied. Featurization includes automated feature engineering (when "featurization": 'auto') and scaling and normalization, which then impacts the selected algorithm and its hyperparameter values. AutoML supports different methods to ensure you have visibility into what was applied to your model.

Consider this forecasting example:

  • There are four input features: A (Numeric), B (Numeric), C (Numeric), D (DateTime).
  • Numeric feature C is dropped because it is an ID column with all unique values.
  • Numeric features A and B have missing values and hence are imputed by the mean.
  • DateTime feature D is featurized into 11 different engineered features.

To get this information, use the fitted_model output from your automated ML experiment run.

automl_config = AutoMLConfig(…)
automl_run = experiment.submit(automl_config …)
best_run, fitted_model = automl_run.get_output()

Automated feature engineering

The get_engineered_feature_names() returns a list of engineered feature names.

Note

Use 'timeseriestransformer' for task='forecasting', else use 'datatransformer' for 'regression' or 'classification' task.

fitted_model.named_steps['timeseriestransformer']. get_engineered_feature_names ()

This list includes all engineered feature names.

['A', 'B', 'A_WASNULL', 'B_WASNULL', 'year', 'half', 'quarter', 'month', 'day', 'hour', 'am_pm', 'hour12', 'wday', 'qday', 'week']

The get_featurization_summary() gets a featurization summary of all the input features.

fitted_model.named_steps['timeseriestransformer'].get_featurization_summary()

Output

[{'RawFeatureName': 'A',
  'TypeDetected': 'Numeric',
  'Dropped': 'No',
  'EngineeredFeatureCount': 2,
  'Tranformations': ['MeanImputer', 'ImputationMarker']},
 {'RawFeatureName': 'B',
  'TypeDetected': 'Numeric',
  'Dropped': 'No',
  'EngineeredFeatureCount': 2,
  'Tranformations': ['MeanImputer', 'ImputationMarker']},
 {'RawFeatureName': 'C',
  'TypeDetected': 'Numeric',
  'Dropped': 'Yes',
  'EngineeredFeatureCount': 0,
  'Tranformations': []},
 {'RawFeatureName': 'D',
  'TypeDetected': 'DateTime',
  'Dropped': 'No',
  'EngineeredFeatureCount': 11,
  'Tranformations': ['DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime']}]
Output Definition
RawFeatureName Input feature/column name from the dataset provided.
TypeDetected Detected datatype of the input feature.
Dropped Indicates if the input feature was dropped or used.
EngineeringFeatureCount Number of features generated through automated feature engineering transforms.
Transformations List of transformations applied to input features to generate engineered features.

Scaling and normalization

To understand the scaling/normalization and the selected algorithm with its hyperparameter values, use fitted_model.steps.

The following sample output is from running fitted_model.steps for a chosen run:

[('RobustScaler', 
  RobustScaler(copy=True, 
  quantile_range=[10, 90], 
  with_centering=True, 
  with_scaling=True)), 

  ('LogisticRegression', 
  LogisticRegression(C=0.18420699693267145, class_weight='balanced', 
  dual=False, 
  fit_intercept=True, 
  intercept_scaling=1, 
  max_iter=100, 
  multi_class='multinomial', 
  n_jobs=1, penalty='l2', 
  random_state=None, 
  solver='newton-cg', 
  tol=0.0001, 
  verbose=0, 
  warm_start=False))

To get more details, use this helper function:

from pprint import pprint

def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0]+ ' - ')
        elif hasattr(step[1], '_base_learners') and hasattr(step[1], '_meta_learner'):
            print("\nMeta Learner")
            pprint(step[1]._meta_learner)
            print()
            for estimator in step[1]._base_learners:
                print_model(estimator[1], estimator[0]+ ' - ')
        else:
            pprint(step[1].get_params())
            print()   

This helper function returns the following output for a particular run using LogisticRegression with RobustScalar as the specific algorithm.

RobustScaler
{'copy': True,
'quantile_range': [10, 90],
'with_centering': True,
'with_scaling': True}

LogisticRegression
{'C': 0.18420699693267145,
'class_weight': 'balanced',
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'max_iter': 100,
'multi_class': 'multinomial',
'n_jobs': 1,
'penalty': 'l2',
'random_state': None,
'solver': 'newton-cg',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}

Predict class probability

Models produced using automated ML all have wrapper objects that mirror functionality from their open-source origin class. Most classification model wrapper objects returned by automated ML implement the predict_proba() function, which accepts an array-like or sparse matrix data sample of your features (X values), and returns an n-dimensional array of each sample and its respective class probability.

Assuming you have retrieved the best run and fitted model using the same calls from above, you can call predict_proba() directly from the fitted model, supplying an X_test sample in the appropriate format depending on the model type.

best_run, fitted_model = automl_run.get_output()
class_prob = fitted_model.predict_proba(X_test)

If the underlying model does not support the predict_proba() function or the format is incorrect, a model class-specific exception will be thrown. See the RandomForestClassifier and XGBoost reference docs for examples of how this function is implemented for different model types.

BERT integration in automated ML

BERT is used in the featurization layer of AutoML. In this layer, if a column contains free text or other types of data like timestamps or simple numbers, then featurization is applied accordingly.

For BERT, the model is fine-tuned and trained utilizing the user-provided labels. From here, document embeddings are output as features alongside others, like timestamp-based features, day of week.

Steps to invoke BERT

In order to invoke BERT, set enable_dnn: True in your automl_settings and use a GPU compute (vm_size = "STANDARD_NC6" or a higher GPU). If a CPU compute is used, then instead of BERT, AutoML enables the BiLSTM DNN featurizer.

AutoML takes the following steps for BERT.

  1. Preprocessing and tokenization of all text columns. For example, the "StringCast" transformer can be found in the final model's featurization summary. An example of how to produce the model's featurization summary can be found in this notebook.

  2. Concatenate all text columns into a single text column, hence the StringConcatTransformer in the final model.

    Our implementation of BERT limits total text length of a training sample to 128 tokens. That means, all text columns when concatenated, should ideally be at most 128 tokens in length. If multiple columns are present, each column should be pruned so this condition is satisfied. Otherwise, for concatenated columns of length >128 tokens BERT's tokenizer layer truncates this input to 128 tokens.

  3. As part of feature sweeping, AutoML compares BERT against the baseline (bag of words features) on a sample of the data. This comparison determines if BERT would give accuracy improvements. If BERT performs better than the baseline, AutoML then uses BERT for text featurization for the whole data. In that case, you will see the PretrainedTextDNNTransformer in the final model.

BERT generally runs longer than other featurizers. For better performance, we recommend using "STANDARD_NC24r" or "STANDARD_NC24rs_V3" for their RDMA capabilities.

AutoML will distribute BERT training across multiple nodes if they are available (upto a max of eight nodes). This can be done in your AutoMLConfig object by setting the max_concurrent_iterations parameter to higher than 1.

Supported languages for BERT in autoML

AutoML currently supports around 100 languages and depending on the dataset's language, autoML chooses the appropriate BERT model. For German data, we use the German BERT model. For English, we use the English BERT model. For all other languages, we use the multilingual BERT model.

In the following code, the German BERT model is triggered, since the dataset language is specified to deu, the three letter language code for German according to ISO classification:

from azureml.automl.core.featurization import FeaturizationConfig

featurization_config = FeaturizationConfig(dataset_language='deu')

automl_settings = {
    "experiment_timeout_minutes": 120,
    "primary_metric": 'accuracy', 
# All other settings you want to use 
    "featurization": featurization_config,
    
  "enable_dnn": True, # This enables BERT DNN featurizer
    "enable_voting_ensemble": False,
    "enable_stack_ensemble": False
}

Next steps