Configure automated machine learning experiments

Automated machine learning picks an algorithm and hyperparameters for you and generates a model ready for deployment. There are several options that you can use to configure automated machine learning experiments. In this guide, learn how to define various configuration settings.

To view examples of an automated machine learning experiments , see Tutorial: Train a classification model with automated machine learning or Train models with automated machine learning in the cloud.

Configuration options available in automated machine learning:

  • Select your experiment type: Classification, Regression or Time Series Forecasting
  • Data source, formats, and fetch data
  • Choose your compute target: local or remote
  • Automated machine learning experiment settings
  • Run an automated machine learning experiment
  • Explore model metrics
  • Register and deploy model

Select your experiment type

Before you begin your experiment, you should determine the kind of machine learning problem you are solving. Automated machine learning supports task types of classification, regression and forecasting.

Automated machine learning supports the following algorithms during the automation and tuning process. As a user, there is no need for you to specify the algorithm. While DNN algorithms are available during training, automated ML does not build DNN models.

Classification Regression Time Series Forecasting
Logistic Regression Elastic Net Elastic Net
Light GBM Light GBM Light GBM
Gradient Boosting Gradient Boosting Gradient Boosting
Decision Tree Decision Tree Decision Tree
K Nearest Neighbors K Nearest Neighbors K Nearest Neighbors
Linear SVC LARS Lasso LARS Lasso
C-Support Vector Classification (SVC) Stochastic Gradient Descent (SGD) Stochastic Gradient Descent (SGD)
Random Forest Random Forest Random Forest
Extremely Randomized Trees Extremely Randomized Trees Extremely Randomized Trees
Xgboost Xgboost Xgboost
DNN Classifer DNN Regressor DNN Regressor
DNN Linear Classifier Linear Regressor Linear Regressor
Naive Bayes
Stochastic Gradient Descent (SGD)

Data source and format

Automated machine learning supports data that resides on your local desktop or in the cloud such as Azure Blob Storage. The data can be read into scikit-learn supported data formats. You can read the data into:

  • Numpy arrays X (features) and y (target variable or also known as label)
  • Pandas dataframe

Examples:

  • Numpy arrays

    digits = datasets.load_digits()
    X_digits = digits.data
    y_digits = digits.target
    
  • Pandas dataframe

    import pandas as pd
    df = pd.read_csv("https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv", delimiter="\t", quotechar='"')
    # get integer labels
    df = df.drop(["Label"], axis=1)
    df_train, _, y_train, _ = train_test_split(df, y, test_size=0.1, random_state=42)
    

Fetch data for running experiment on remote compute

If you are using a remote compute to run your experiment, the data fetch must be wrapped in a separate python script get_data(). This script is run on the remote compute where the automated machine learning experiment is run. get_data eliminates the need to fetch the data over the wire for each iteration. Without get_data, your experiment will fail when you run on remote compute.

Here is an example of get_data:

%%writefile $project_folder/get_data.py
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
def get_data(): # Burning man 2016 data
    df = pd.read_csv("https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv", delimiter="\t", quotechar='"')
    # get integer labels
    le = LabelEncoder()
    le.fit(df["Label"].values)
    y = le.transform(df["Label"].values)
    df = df.drop(["Label"], axis=1)
    df_train, _, y_train, _ = train_test_split(df, y, test_size=0.1, random_state=42)
    return { "X" : df, "y" : y }

In your AutoMLConfig object, you specify the data_script parameter and provide the path to the get_data script file similar to below:

automl_config = AutoMLConfig(****, data_script=project_folder + "/get_data.py", **** )

get_data script can return:

Key Type Mutually Exclusive with Description
X Pandas Dataframe or Numpy Array data_train, label, columns All features to train with
y Pandas Dataframe or Numpy Array label Label data to train with. For classification, should be an array of integers.
X_valid Pandas Dataframe or Numpy Array data_train, label Optional All features to validate with. If not specified, X is split between train and validate
y_valid Pandas Dataframe or Numpy Array data_train, label Optional The label data to validate with. If not specified, y is split between train and validate
sample_weight Pandas Dataframe or Numpy Array data_train, label, columns Optional A weight value for each sample. Use when you would like to assign different weights for your data points
sample_weight_valid Pandas Dataframe or Numpy Array data_train, label, columns Optional A weight value for each validation sample. If not specified, sample_weight is split between train and validate
data_train Pandas Dataframe X, y, X_valid, y_valid All data (features+label) to train with
label string X, y, X_valid, y_valid Which column in data_train represents the label
columns Array of strings Optional Whitelist of columns to use for features
cv_splits_indices Array of integers Optional List of indexes to split the data for cross validation

Load and prepare data using DataPrep SDK

Automated machine learning experiments supports data loading and transforms using the dataprep SDK. Using the SDK provides the ability to

  • Load from many file types with parsing parameter inference (encoding, separator, headers)
  • Type-conversion using inference during file loading
  • Connection support for MS SQL Server and Azure Data Lake Storage
  • Add column using an expression
  • Impute missing values
  • Derive column by example
  • Filtering
  • Custom Python transforms

To learn about the data prep sdk refer the How to prepare data for modeling article. Below is an example loading data using data prep sdk.

# The data referenced here was pulled from `sklearn.datasets.load_digits()`.
simple_example_data_root = 'https://dprepdata.blob.core.windows.net/automl-notebook-data/'
X = dprep.auto_read_file(simple_example_data_root + 'X.csv').skip(1)  # Remove the header row.
# You can use `auto_read_file` which intelligently figures out delimiters and datatypes of a file.

# Here we read a comma delimited file and convert all columns to integers.
y = dprep.read_csv(simple_example_data_root + 'y.csv').to_long(dprep.ColumnSelector(term='.*', use_regex = True))

Train and validation data

You can specify separate train and validation set either through get_data() or directly in the AutoMLConfig method.

Cross validation split options

K-Folds Cross Validation

Use n_cross_validations setting to specify the number of cross validations. The training data set will be randomly split into n_cross_validations folds of equal size. During each cross validation round, one of the folds will be used for validation of the model trained on the remaining folds. This process repeats for n_cross_validations rounds until each fold is used once as validation set. The average scores across all n_cross_validations rounds will be reported, and the corresponding model will be retrained on the whole training data set.

Monte Carlo Cross Validation (a.k.a. Repeated Random Sub-Sampling)

Use validation_size to specify the percentage of the training dataset that should be used for validation, and use n_cross_validations to specify the number of cross validations. During each cross validation round, a subset of size validation_size will be randomly selected for validation of the model trained on the remaining data. Finally, the average scores across all n_cross_validations rounds will be reported, and the corresponding model will be retrained on the whole training data set. Monte Carlo is not supported for time series forecasting.

Custom validation dataset

Use custom validation dataset if random split is not acceptable, usually time series data or imbalanced data. You can specify your own validation dataset. The model will be evaluated against the validation dataset specified instead of random dataset.

Compute to run experiment

Next determine where the model will be trained. An automated machine learning training experiment can run on the following compute options:

  • Your local machine such as a local desktop or laptop – Generally when you have small dataset and you are still in the exploration stage.
  • A remote machine in the cloud – Azure Machine Learning Managed Compute is a managed service that enables the ability to train machine learning models on clusters of Azure virtual machines.

See the GitHub site for example notebooks with local and remote compute targets.

Configure your experiment settings

There are several options that you can use to configure your automated machine learning experiment. These parameters are set by instantiating an AutoMLConfig object. See the AutoMLConfig class for a full list of parameters.

Some examples include:

  1. Classification experiment using AUC weighted as the primary metric with a max time of 12,000 seconds per iteration, with the experiment to end after 50 iterations and 2 cross validation folds.

    automl_classifier = AutoMLConfig(
        task='classification',
        primary_metric='AUC_weighted',
        max_time_sec=12000,
        iterations=50,
        blacklist_models='XGBoostClassifier',
        X=X,
        y=y,
        n_cross_validations=2)
    
  2. Below is an example of a regression experiment set to end after 100 iterations, with each iteration lasting up to 600 seconds with 5 validation cross folds.

    automl_regressor = AutoMLConfig(
        task='regression',
        max_time_sec=600,
        iterations=100,
        whitelist_models='kNN regressor'
        primary_metric='r2_score',
        X=X,
        y=y,
        n_cross_validations=5)
    

The three different task parameter values determine the list of algorithms to apply. Use the whitelist or blacklist parameters to further modify iterations with the available algorithms to include or exclude. The list of supported models can be found on SupportedAlgorithms Class.

Primary Metric

The primary metric; as shown in the examples above determines the metric to be used during model training for optimization. The primary metric you can select is determined by the task type you choose. Below is a list of available metrics.

Classification Regression Time Series Forecasting
accuracy spearman_correlation spearman_correlation
AUC_weighted normalized_root_mean_squared_error normalized_root_mean_squared_error
average_precision_score_weighted r2_score r2_score
norm_macro_recall normalized_mean_absolute_error normalized_mean_absolute_error
precision_score_weighted

Data pre-processing and featurization

If you use preprocess=True, the following data preprocessing steps are performed automatically for you:

  1. Drop high cardinality or no variance features
    • Drop features with no useful information from training and validation sets. These include features with all values missing, same value across all rows or with extremely high cardinality (e.g., hashes, IDs or GUIDs).
  2. Missing value imputation
    • For numerical features, impute missing values with average of values in the column.
    • For categorical features, impute missing values with most frequent value.
  3. Generate additional features
    • For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.
    • For Text features: Term frequency based on word unigram, bi-grams, and tri-gram, Count vectorizer.
  4. Transformations and encodings
    • Numeric features with very few unique values transformed into categorical features.
    • Depending on cardinality of categorical features, perform label encoding or (hashing) one-hot encoding.

Ensemble Models

Ensemble learning improves machine learning results and predictive performance by combing many models as opposed to using single models. When using automated machine learning, you can train ensemble models using the Caruana ensemble selection algorithm with sorted Ensemble initialization. The ensemble iteration appears as the last iteration of your run.

Time Series Forecasting

For time series forecasting task type you have additional parameters to define.

  1. time_column_name - This is a required parameter which defines the name of the column in your training data containing date/time series.
  2. max_horizon - This defines the length of time you want to predict out based on the periodicity of the training data. For example if you have training data with daily time grains, you define how far out in days you want the model to train for.
  3. grain_column_names - This defines the name of columns which contain individual time series data in your training data. For example, if you are forecasting sales of a particular brand by store, you would define store and brand columns as your grain columns.

See example of these settings being used below, notebook example is available here.

# Setting Store and Brand as grains for training.
grain_column_names = ['Store', 'Brand']
nseries = data.groupby(grain_column_names).ngroups

# View the number of time series data with defined grains
print('Data contains {0} individual time-series.'.format(nseries))
time_series_settings = {
    'time_column_name': time_column_name,
    'grain_column_names': grain_column_names,
    'drop_column_names': ['logQuantity'],
    'max_horizon': n_test_periods
}

automl_config = AutoMLConfig(task='forecasting',
                             debug_log='automl_oj_sales_errors.log',
                             primary_metric='normalized_root_mean_squared_error',
                             iterations=10,
                             X=X_train,
                             y=y_train,
                             n_cross_validations=5,
                             path=project_folder,
                             verbosity=logging.INFO,
                             **time_series_settings)

Run experiment

Submit the experiment to run and generate a model. Pass the AutoMLConfig to the submit method to generate the model.

run = experiment.submit(automl_config, show_output=True)

Note

Dependencies are first installed on a new machine. It may take up to 10 minutes before output is shown. Setting show_output to True results in output being shown on the console.

Exit Criteria

There a few options you can define to complete your experiment.

  1. No Criteria - If you do not define any exit parameters the experiment will continue until no further progress is made on your primary metric.
  2. Number of iterations - You define the number of iterations for the experiment to run. You can optional add iteration_timeout_minutes to define a time limit in minutes per each iteration.
  3. Exit after a length of time - Using experiment_timeout_minutes in your settings you can define how long in minutes should an experiment continue in run.
  4. Exit after a score has been reached - Using experiment_exit_score you can choose to complete the experiement after a score based on your primary metric has been reached.

Explore model metrics

You can view your results in a widget or inline if you are in a notebook. See Track and evaluate models for more details.

Classification metrics

The following metrics are saved in each iteration for a classification task.

Metric Description Calculation Extra Parameters
AUC_Macro AUC is the Area under the Receiver Operating Characteristic Curve. Macro is the arithmetic mean of the AUC for each class. Calculation average="macro"
AUC_Micro AUC is the Area under the Receiver Operating Characteristic Curve. Micro is computed globally by combining the true positives and false positives from each class Calculation average="micro"
AUC_Weighted AUC is the Area under the Receiver Operating Characteristic Curve. Weighted is the arithmetic mean of the score for each class, weighted by the number of true instances in each class Calculation average="weighted"
accuracy Accuracy is the percent of predicted labels that exactly match the true labels. Calculation None
average_precision_score_macro Average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight. Macro is the arithmetic mean of the average precision score of each class Calculation average="macro"
average_precision_score_micro Average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight. Micro is computed globally by combing the true positives and false positives at each cutoff Calculation average="micro"
average_precision_score_weighted Average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight. Weighted is the arithmetic mean of the average precision score for each class, weighted by the number of true instances in each class Calculation average="weighted"
balanced_accuracy Balanced accuracy is the arithmetic mean of recall for each class. Calculation average="macro"
f1_score_macro F1 score is the harmonic mean of precision and recall. Macro is the arithmetic mean of F1 score for each class Calculation average="macro"
f1_score_micro F1 score is the harmonic mean of precision and recall. Micro is computed globally by counting the total true positives, false negatives, and false positives Calculation average="micro"
f1_score_weighted F1 score is the harmonic mean of precision and recall. Weighted mean by class frequency of F1 score for each class Calculation average="weighted"
log_loss This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions. For a single sample with true label yt in {0,1} and estimated probability yp that yt = 1, the log loss is -log P(yt|yp) = -(yt log(yp) + (1 - yt) log(1 - yp)) Calculation None
norm_macro_recall Normalized Macro Recall is Macro Recall normalized so that random performance has a score of 0 and perfect performance has a score of 1. This is achieved by norm_macro_recall := (recall_score_macro - R)/(1 - R), where R is the expected value of recall_score_macro for random predictions (i.e., R=0.5 for binary classification and R=(1/C) for C-class classification problems) Calculation average = "macro" and then (recall_score_macro - R)/(1 - R), where R is the expected value of recall_score_macro for random predictions (i.e., R=0.5 for binary classification and R=(1/C) for C-class classification problems)
precision_score_macro Precision is the percent of elements labeled as a certain class that actually are in that class. Macro is the arithmetic mean of precision for each class Calculation average="macro"
precision_score_micro Precision is the percent of elements labeled as a certain class that actually are in that class. Micro is computed globally by counting the total true positives and false positives Calculation average="micro"
precision_score_weighted Precision is the percent of elements labeled as a certain class that actually are in that class. Weighted is the arithmetic mean of precision for each class, weighted by number of true instances in each class Calculation average="weighted"
recall_score_macro Recall is the percent of elements actually in a certain class that are correctly labeled. Macro is the arithmetic mean of recall for each class Calculation average="macro"
recall_score_micro Recall is the percent of elements actually in a certain class that are correctly labeled. Micro is computed globally by counting the total true positives, false negatives Calculation average="micro"
recall_score_weighted Recall is the percent of elements actually in a certain class that are correctly labeled. Weighted is the arithmetic mean of recall for each class, weighted by number of true instances in each class Calculation average="weighted"
weighted_accuracy Weighted accuracy is accuracy where the weight given to each example is equal to the proportion of true instances in that example's true class Calculation sample_weight is a vector equal to the proportion of that class for each element in the target

Regression and time series forecasting metrics

The following metrics are saved in each iteration for a regression or forecasting task.

Metric Description Calculation Extra Parameters
explained_variance Explained variance is the proportion to which a mathematical model accounts for the variation of a given data set. It is the percent decrease in variance of the original data to the variance of the errors. When the mean of the errors is 0, it is equal to explained variance. Calculation None
r2_score R2 is the coefficient of determination or the percent reduction in squared errors compared to a baseline model that outputs the mean. When the mean of the errors is 0, it is equal to explained variance. Calculation None
spearman_correlation Spearman correlation is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases. Calculation None
mean_absolute_error Mean absolute error is the expected value of absolute value of difference between the target and the prediction Calculation None
normalized_mean_absolute_error Normalized mean absolute error is mean Absolute Error divided by the range of the data Calculation Divide by range of the data
median_absolute_error Median absolute error is the median of all absolute differences between the target and the prediction. This loss is robust to outliers. Calculation None
normalized_median_absolute_error Normalized median absolute error is median absolute error divided by the range of the data Calculation Divide by range of the data
root_mean_squared_error Root mean squared error is the square root of the expected squared difference between the target and the prediction Calculation None
normalized_root_mean_squared_error Normalized root mean squared error is root mean squared error divided by the range of the data Calculation Divide by range of the data
root_mean_squared_log_error Root mean squared log error is the square root of the expected squared logarithmic error Calculation None
normalized_root_mean_squared_log_error Normalized Root mean squared log error is root mean squared log error divided by the range of the data Calculation Divide by range of the data

Explain the model

While automated machine learning capabilities are generally available, the model explainability feature is still in public preview.

Automated machine learning allows you to understand feature importance. During the training process, you can get global feature importance for the model. For classification scenarios, you can also get class-level feature importance. You must provide a validation dataset (X_valid) to get feature importance.

There are two ways to generate feature importance.

  • Once an experiment is complete, you can use explain_model method on any iteration.

    from azureml.train.automl.automlexplainer import explain_model
    
    shap_values, expected_values, overall_summary, overall_imp, per_class_summary, per_class_imp = \
        explain_model(fitted_model, X_train, X_test)
    
    #Overall feature importance
    print(overall_imp)
    print(overall_summary)
    
    #Class-level feature importance
    print(per_class_imp)
    print(per_class_summary)
    
  • To view feature importance for all iterations, set model_explainability flag to True in AutoMLConfig.

    automl_config = AutoMLConfig(task = 'classification',
                                 debug_log = 'automl_errors.log',
                                 primary_metric = 'AUC_weighted',
                                 max_time_sec = 12000,
                                 iterations = 10,
                                 verbosity = logging.INFO,
                                 X = X_train,
                                 y = y_train,
                                 X_valid = X_test,
                                 y_valid = y_test,
                                 model_explainability=True,
                                 path=project_folder)
    

    Once done, you can use retrieve_model_explanation method to retrieve feature importance for a specific iteration.

    from azureml.train.automl.automlexplainer import retrieve_model_explanation
    
    shap_values, expected_values, overall_summary, overall_imp, per_class_summary, per_class_imp = \
        retrieve_model_explanation(best_run)
    
    #Overall feature importance
    print(overall_imp)
    print(overall_summary)
    
    #Class-level feature importance
    print(per_class_imp)
    print(per_class_summary)
    

You can visualize the feature importance chart in your workspace in the Azure portal. The chart is also shown when using the Jupyter widget in a notebook. To learn more about the charts refer to the Sample Azure Machine Learning service notebooks article.

from azureml.widgets import RunDetails
RunDetails(local_run).show()

feature importance graph

Next steps

Learn more about how and where to deploy a model.

Learn more about how to train a regression model with Automated machine learning or how to train using Automated machine learning on a remote resource.