Configure automated machine learning experiments

Automated machine learning picks an algorithm and hyperparameters for you and generates a model ready for deployment. There are several options that you can use to configure automated machine learning experiments. In this guide, learn how to define various configuration settings.

To view examples of an automated machine learning experiments , see Tutorial: Train a classification model with automated machine learning or Train models with automated machine learning in the cloud.

Configuration options available in automated machine learning:

  • Select your experiment type: Classification, Regression or Forecasting
  • Data source, formats, and fetch data
  • Choose your compute target: local or remote
  • Automated machine learning experiment settings
  • Run an automated machine learning experiment
  • Explore model metrics
  • Register and deploy model

Select your experiment type

Before you begin your experiment, you should determine the kind of machine learning problem you are solving. Automated machine learning supports task types of classification, regression and forecasting.

While automated machine learning capabilities are generally available, forecasting is still in public preview.

Automated machine learning supports the following algorithms during the automation and tuning process. As a user, there is no need for you to specify the algorithm.

Classification Regression Forecasting
Logistic Regression Elastic Net Elastic Net
Stochastic Gradient Descent (SGD) Light GBM Light GBM
Naive Bayes Gradient Boosting Gradient Boosting
C-Support Vector Classification (SVC) Decision Tree Decision Tree
Linear SVC K Nearest Neighbors K Nearest Neighbors
K Nearest Neighbors LARS Lasso LARS Lasso
Decision Tree Stochastic Gradient Descent (SGD) Stochastic Gradient Descent (SGD)
Random Forest Random Forest Random Forest
Extremely Randomized Trees Extremely Randomized Trees Extremely Randomized Trees
Gradient Boosting
Light GBM

Data source and format

Automated machine learning supports data that resides on your local desktop or in the cloud such as Azure Blob Storage. The data can be read into scikit-learn supported data formats. You can read the data into:

  • Numpy arrays X (features) and y (target variable or also known as label)
  • Pandas dataframe

Examples:

  • Numpy arrays

    digits = datasets.load_digits()
    X_digits = digits.data 
    y_digits = digits.target
    
  • Pandas dataframe

    import pandas as pd
    df = pd.read_csv("https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv", delimiter="\t", quotechar='"') 
    # get integer labels 
    df = df.drop(["Label"], axis=1) 
    df_train, _, y_train, _ = train_test_split(df, y, test_size=0.1, random_state=42)
    

Fetch data for running experiment on remote compute

If you are using a remote compute to run your experiment, the data fetch must be wrapped in a separate python script get_data(). This script is run on the remote compute where the automated machine learning experiment is run. get_data eliminates the need to fetch the data over the wire for each iteration. Without get_data, your experiment will fail when you run on remote compute.

Here is an example of get_data:

%%writefile $project_folder/get_data.py 
import pandas as pd 
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import LabelEncoder 
def get_data(): # Burning man 2016 data 
    df = pd.read_csv("https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv", delimiter="\t", quotechar='"') 
    # get integer labels 
    le = LabelEncoder() 
    le.fit(df["Label"].values) 
    y = le.transform(df["Label"].values) 
    df = df.drop(["Label"], axis=1) 
    df_train, _, y_train, _ = train_test_split(df, y, test_size=0.1, random_state=42) 
    return { "X" : df, "y" : y }

In your AutoMLConfig object, you specify the data_script parameter and provide the path to the get_data script file similar to below:

automl_config = AutoMLConfig(****, data_script=project_folder + "/get_data.py", **** )

get_data script can return:

Key Type Mutually Exclusive with Description
X Pandas Dataframe or Numpy Array data_train, label, columns All features to train with
y Pandas Dataframe or Numpy Array label Label data to train with. For classification, should be an array of integers.
X_valid Pandas Dataframe or Numpy Array data_train, label Optional All features to validate with. If not specified, X is split between train and validate
y_valid Pandas Dataframe or Numpy Array data_train, label Optional The label data to validate with. If not specified, y is split between train and validate
sample_weight Pandas Dataframe or Numpy Array data_train, label, columns Optional A weight value for each sample. Use when you would like to assign different weights for your data points
sample_weight_valid Pandas Dataframe or Numpy Array data_train, label, columns Optional A weight value for each validation sample. If not specified, sample_weight is split between train and validate
data_train Pandas Dataframe X, y, X_valid, y_valid All data (features+label) to train with
label string X, y, X_valid, y_valid Which column in data_train represents the label
columns Array of strings Optional Whitelist of columns to use for features
cv_splits_indices Array of integers Optional List of indexes to split the data for cross validation

Load and prepare data using DataPrep SDK

Automated machine learning experiments supports data loading and transforms using the dataprep SDK. Using the SDK provides the ability to

  • Load from many file types with parsing parameter inference (encoding, separator, headers)
  • Type-conversion using inference during file loading
  • Connection support for MS SQL Server and Azure Data Lake Storage
  • Add column using an expression
  • Impute missing values
  • Derive column by example
  • Filtering
  • Custom Python transforms

To learn about the data prep sdk refer the How to prepare data for modeling article. Below is an example loading data using data prep sdk.

# The data referenced here was pulled from `sklearn.datasets.load_digits()`.
simple_example_data_root = 'https://dprepdata.blob.core.windows.net/automl-notebook-data/'
X = dprep.auto_read_file(simple_example_data_root + 'X.csv').skip(1)  # Remove the header row.
# You can use `auto_read_file` which intelligently figures out delimiters and datatypes of a file.

# Here we read a comma delimited file and convert all columns to integers.
y = dprep.read_csv(simple_example_data_root + 'y.csv').to_long(dprep.ColumnSelector(term='.*', use_regex = True))

Train and validation data

You can specify separate train and validation set either through get_data() or directly in the AutoMLConfig method.

Cross validation split options

K-Folds Cross Validation

Use n_cross_validations setting to specify the number of cross validations. The training data set will be randomly split into n_cross_validations folds of equal size. During each cross validation round, one of the folds will be used for validation of the model trained on the remaining folds. This process repeats for n_cross_validations rounds until each fold is used once as validation set. The average scores across all n_cross_validations rounds will be reported, and the corresponding model will be retrained on the whole training data set.

Monte Carlo Cross Validation (a.k.a. Repeated Random Sub-Sampling)

Use validation_size to specify the percentage of the training dataset that should be used for validation, and use n_cross_validations to specify the number of cross validations. During each cross validation round, a subset of size validation_size will be randomly selected for validation of the model trained on the remaining data. Finally, the average scores across all n_cross_validations rounds will be reported, and the corresponding model will be retrained on the whole training data set.

Custom validation dataset

Use custom validation dataset if random split is not acceptable (usually time series data or imbalanced data). You can specify your own validation dataset. The model will be evaluated against the validation dataset specified instead of random dataset.

Compute to run experiment

Next determine where the model will be trained. An automated machine learning training experiment can run on the following compute options:

  • Your local machine such as a local desktop or laptop – Generally when you have small dataset and you are still in the exploration stage.
  • A remote machine in the cloud – Azure Machine Learning Managed Compute is a managed service that enables the ability to train machine learning models on clusters of Azure virtual machines.

See the GitHub site for example notebooks with local and remote compute targets.

Configure your experiment settings

There are several options that you can use to configure your automated machine learning experiment. These parameters are set by instantiating an AutoMLConfig object.

Some examples include:

  1. Classification experiment using AUC weighted as the primary metric with a max time of 12,000 seconds per iteration, with the experiment to end after 50 iterations and 2 cross validation folds.

    automl_classifier = AutoMLConfig(
        task='classification',
        primary_metric='AUC_weighted',
        max_time_sec=12000,
        iterations=50,
        X=X, 
        y=y,
        n_cross_validations=2)
    
  2. Below is an example of a regression experiment set to end after 100 iterations, with each iteration lasting up to 600 seconds with 5 validation cross folds.

    automl_regressor = AutoMLConfig(
        task='regression',
        max_time_sec=600,
        iterations=100,
        primary_metric='r2_score',
        X=X, 
        y=y,
        n_cross_validations=5)
    

This table lists parameter settings available for your experiment and their default values.

Property Description Default Value
task Specify the type of machine learning problem. Allowed values are
  • Classification
  • Regression
  • Forecasting
  • None
    primary_metric Metric that you want to optimize in building your model. For example, if you specify accuracy as the primary_metric, automated machine learning looks to find a model with maximum accuracy. You can only specify one primary_metric per experiment. Allowed values are
    Classification:
  • accuracy
  • AUC_weighted
  • precision_score_weighted
  • balanced_accuracy
  • average_precision_score_weighted

  • Regression:
  • normalized_mean_absolute_error
  • spearman_correlation
  • normalized_root_mean_squared_error
  • normalized_root_mean_squared_log_error
  • R2_score
  • For Classification: accuracy
    For Regression: spearman_correlation
    experiment_exit_score You can set a target value for your primary_metric. Once a model is found that meets the primary_metric target, automated machine learning will stop iterating and the experiment terminates. If this value is not set (default), Automated machine learning experiment will continue to run the number of iterations specified in iterations. Takes a double value. If the target never reaches, then Automated machine learning will continue until it reaches the number of iterations specified in iterations. None
    iterations Maximum number of iterations. Each iteration is equal to a training job that results in a pipeline. Pipeline is data preprocessing and model. To get a high-quality model, use 250 or more 100
    max_concurrent_iterations Max number of iterations to run in parallel. This setting works only for remote compute. 1
    max_cores_per_iteration Indicates how many cores on the compute target would be used to train a single pipeline. If the algorithm can leverage multiple cores, then this increases the performance on a multi-core machine. You can set it to -1 to use all the cores available on the machine. 1
    iteration_timeout_minutes Limits the amount of time (minutes) a particular iteration takes. If an iteration exceeds the specified amount, that iteration gets canceled. If not set, then the iteration continues to run until it is finished. None
    n_cross_validations Number of cross validation splits None
    validation_size Size of validation set as percentage of all training sample. None
    preprocess True/False
    True enables experiment to perform preprocessing on the input. Following is a subset of preprocessing
  • Missing Data: Imputes the missing data- Numerical with Average, Text with most occurrence
  • Categorical Values: If data type is numeric and number of unique values is less than 5 percent, Converts into one-hot encoding
  • Etc. for complete list check the GitHub repository

  • Note : if data is sparse you cannot use preprocess = true
    False
    blacklist_models Automated machine learning experiment has many different algorithms that it tries. Configure to exclude certain algorithms from the experiment. Useful if you are aware that algorithm(s) do not work well for your dataset. Excluding algorithms can save you compute resources and training time.
    Allowed values for Classification
  • LogisticRegression
  • SGD
  • MultinomialNaiveBayes
  • BernoulliNaiveBayes
  • SVM
  • LinearSVM
  • KNN
  • DecisionTree
  • RandomForest
  • ExtremeRandomTrees
  • LightGBM
  • GradientBoosting
  • TensorFlowDNN
  • TensorFlowLinearClassifier

  • Allowed values for Regression
  • ElasticNet
  • GradientBoosting
  • DecisionTree
  • KNN
  • LassoLars
  • SGD
  • RandomForest
  • ExtremeRandomTree
  • LightGBM
  • TensorFlowLinearRegressor
  • TensorFlowDNN

  • Allowed values for Forecasting
  • ElasticNet
  • GradientBoosting
  • DecisionTree
  • KNN
  • LassoLars
  • SGD
  • RandomForest
  • ExtremeRandomTree
  • LightGBM
  • TensorFlowLinearRegressor
  • TensorFlowDNN
  • None
    whitelist_models Automated machine learning experiment has many different algorithms that it tries. Configure to include certain algorithms for the experiment. Useful if you are aware that algorithm(s) do work well for your dataset.
    Allowed values for Classification
  • LogisticRegression
  • SGD
  • MultinomialNaiveBayes
  • BernoulliNaiveBayes
  • SVM
  • LinearSVM
  • KNN
  • DecisionTree
  • RandomForest
  • ExtremeRandomTrees
  • LightGBM
  • GradientBoosting
  • TensorFlowDNN
  • TensorFlowLinearClassifier

  • Allowed values for Regression
  • ElasticNet
  • GradientBoosting
  • DecisionTree
  • KNN
  • LassoLars
  • SGD
  • RandomForest
  • ExtremeRandomTree
  • LightGBM
  • TensorFlowLinearRegressor
  • TensorFlowDNN

  • Allowed values for Forecasting
  • ElasticNet
  • GradientBoosting
  • DecisionTree
  • KNN
  • LassoLars
  • SGD
  • RandomForest
  • ExtremeRandomTree
  • LightGBM
  • TensorFlowLinearRegressor
  • TensorFlowDNN
  • None
    verbosity Controls the level of logging with INFO being the most verbose and CRITICAL being the least. Verbosity level takes the same values as defined in the python logging package. Allowed values are:
  • logging.INFO
  • logging.WARNING
  • logging.ERROR
  • logging.CRITICAL
  • logging.INFO
    X All features to train with None
    y Label data to train with. For classification, should be an array of integers. None
    X_valid Optional All features to validate with. If not specified, X is split between train and validate None
    y_valid Optional The label data to validate with. If not specified, y is split between train and validate None
    sample_weight Optional A weight value for each sample. Use when you would like to assign different weights for your data points None
    sample_weight_valid Optional A weight value for each validation sample. If not specified, sample_weight is split between train and validate None
    run_configuration RunConfiguration object. Used for remote runs. None
    data_script Path to a file containing the get_data method. Required for remote runs. None
    model_explainability Optional True/False
    True enables experiment to perform feature importance for every iteration. You can also use explain_model() method on a specific iteration to enable feature importance on-demand for that iteration after experiment is complete.
    False
    enable_ensembling Flag to enable an ensembling iteration after all the other iterations complete. True
    ensemble_iterations Number of iterations during which we choose a fitted pipeline to be part of the final ensemble. 15
    experiment_timeout_minutes Limits the amount of time (minues) that the whole experiment run can take None

    Data pre-processing and featurization

    If you use preprocess=True, the following data preprocessing steps are performed automatically for you:

    1. Drop high cardinality or no variance features
      • Drop features with no useful information from training and validation sets. These include features with all values missing, same value across all rows or with extremely high cardinality (e.g., hashes, IDs or GUIDs).
    2. Missing value imputation
      • For numerical features, impute missing values with average of values in the column.
      • For categorical features, impute missing values with most frequent value.
    3. Generate additional features
      • For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.
      • For Text features: Term frequency based on word unigram, bi-grams, and tri-gram, Count vectorizer.
    4. Transformations and encodings
      • Numeric features with very few unique values transformed into categorical features.
      • Depending on cardinality of categorical features, perform label encoding or (hashing) one-hot encoding.

    Run experiment

    Submit the experiment to run and generate a model. Pass the AutoMLConfig to the submit method to generate the model.

    run = experiment.submit(automl_config, show_output=True)
    

    Note

    Dependencies are first installed on a new machine. It may take up to 10 minutes before output is shown. Setting show_output to True results in output being shown on the console.

    Explore model metrics

    You can view your results in a widget or inline if you are in a notebook. See Track and evaluate models for more details.

    Classification metrics

    The following metrics are saved in each iteration for a classification task.

    Primary Metric Description Calculation Extra Parameters
    AUC_Macro AUC is the Area under the Receiver Operating Characteristic Curve. Macro is the arithmetic mean of the AUC for each class. Calculation average="macro"
    AUC_Micro AUC is the Area under the Receiver Operating Characteristic Curve. Micro is computed globably by combining the true positives and false positives from each class Calculation average="micro"
    AUC_Weighted AUC is the Area under the Receiver Operating Characteristic Curve. Weighted is the arithmetic mean of the score for each class, weighted by the number of true instances in each class Calculation average="weighted"
    accuracy Accuracy is the percent of predicted labels that exactly match the true labels. Calculation None
    average_precision_score_macro Average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight. Macro is the arithmetic mean of the average precision score of each class Calculation average="macro"
    average_precision_score_micro Average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight. Micro is computed globally by combing the true positives and false positives at each cutoff Calculation average="micro"
    average_precision_score_weighted Average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight. Weighted is the arithmetic mean of the average precision score for each class, weighted by the number of true instances in each class Calculation average="weighted"
    balanced_accuracy Balanced accuracy is the arithmetic mean of recall for each class. Calculation average="macro"
    f1_score_macro F1 score is the harmonic mean of precision and recall. Macro is the arithmetic mean of F1 score for each class Calculation average="macro"
    f1_score_micro F1 score is the harmonic mean of precision and recall. Micro is computed globally by counting the total true positives, false negatives, and false positives Calculation average="micro"
    f1_score_weighted F1 score is the harmonic mean of precision and recall. Weighted mean by class frequency of F1 score for each class Calculation average="weighted"
    log_loss This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions. For a single sample with true label yt in {0,1} and estimated probability yp that yt = 1, the log loss is -log P(yt|yp) = -(yt log(yp) + (1 - yt) log(1 - yp)) Calculation None
    norm_macro_recall Normalized Macro Recall is Macro Recall normalized so that random performance has a score of 0 and perfect performance has a score of 1. This is achieved by norm_macro_recall := (recall_score_macro - R)/(1 - R), where R is the expected value of recall_score_macro for random predictions (i.e., R=0.5 for binary classification and R=(1/C) for C-class classification problems) Calculation average = "macro" and then (recall_score_macro - R)/(1 - R), where R is the expected value of recall_score_macro for random predictions (i.e., R=0.5 for binary classification and R=(1/C) for C-class classification problems)
    precision_score_macro Precision is the percent of elements labeled as a certain class that actually are in that class. Macro is the arithmetic mean of precision for each class Calculation average="macro"
    precision_score_micro Precision is the percent of elements labeled as a certain class that actually are in that class. Micro is computed globally by counting the total true positives and false positives Calculation average="micro"
    precision_score_weighted Precision is the percent of elements labeled as a certain class that actually are in that class. Weighted is the arithmetic mean of precision for each class, weighted by number of true instances in each class Calculation average="weighted"
    recall_score_macro Recall is the percent of elements actually in a certain class that are correctly labeled. Macro is the arithmetic mean of recall for each class Calculation average="macro"
    recall_score_micro Recall is the percent of elements actually in a certain class that are correctly labeled. Micro is computed globally by counting the total true positives, false negatives Calculation average="micro"
    recall_score_weighted Recall is the percent of elements actually in a certain class that are correctly labeled. Weighted is the arithmetic mean of recall for each class, weighted by number of true instances in each class Calculation average="weighted"
    weighted_accuracy Weighted accuracy is accuracy where the weight given to each example is equal to the proportion of true instances in that example's true class Calculation sample_weight is a vector equal to the proportion of that class for each element in the target

    Regression and forecasting metrics

    The following metrics are saved in each iteration for a regression or forecasting task.

    Primary Metric Description Calculation Extra Parameters
    explained_variance Explained variance is the proportion to which a mathematical model accounts for the variation of a given data set. It is the percent decrease in variance of the original data to the variance of the errors. When the mean of the errors is 0, it is equal to explained variance. Calculation None
    r2_score R2 is the coefficient of determination or the percent reduction in squared errors compared to a baseline model that outputs the mean. When the mean of the errors is 0, it is equal to explained variance. Calculation None
    spearman_correlation Spearman correlation is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases. Calculation None
    mean_absolute_error Mean absolute error is the expected value of absolute value of difference between the target and the prediction Calculation None
    normalized_mean_absolute_error Normalized mean absolute error is mean Absolute Error divided by the range of the data Calculation Divide by range of the data
    median_absolute_error Median absolute error is the median of all absolute differences between the target and the prediction. This loss is robust to outliers. Calculation None
    normalized_median_absolute_error Normalized median absolute error is median absolute error divided by the range of the data Calculation Divide by range of the data
    root_mean_squared_error Root mean squared error is the square root of the expected squared difference between the target and the prediction Calculation None
    normalized_root_mean_squared_error Normalized root mean squared error is root mean squared error divided by the range of the data Calculation Divide by range of the data
    root_mean_squared_log_error Root mean squared log error is the square root of the expected squared logarithmic error Calculation None
    normalized_root_mean_squared_log_error Noramlized Root mean squared log error is root mean squared log error divided by the range of the data Calculation Divide by range of the data

    Explain the model

    While automated machine learning capabilities are generally available, the model explainability feature is still in public preview.

    Automated machine learning allows you to understand feature importance. During the training process, you can get global feature importance for the model. For classification scenarios, you can also get class-level feature importance. You must provide a validation dataset (X_valid) to get feature importance.

    There are two ways to generate feature importance.

    • Once an experiment is complete, you can use explain_model method on any iteration.

      from azureml.train.automl.automlexplainer import explain_model
      
      shap_values, expected_values, overall_summary, overall_imp, per_class_summary, per_class_imp = \
          explain_model(fitted_model, X_train, X_test)
      
      #Overall feature importance
      print(overall_imp)
      print(overall_summary) 
      
      #Class-level feature importance
      print(per_class_imp)
      print(per_class_summary) 
      
    • To view feature importance for all iterations, set model_explainability flag to True in AutoMLConfig.

      automl_config = AutoMLConfig(task = 'classification',
                                   debug_log = 'automl_errors.log',
                                   primary_metric = 'AUC_weighted',
                                   max_time_sec = 12000,
                                   iterations = 10,
                                   verbosity = logging.INFO,
                                   X = X_train, 
                                   y = y_train,
                                   X_valid = X_test,
                                   y_valid = y_test,
                                   model_explainability=True,
                                   path=project_folder)
      

      Once done, you can use retrieve_model_explanation method to retrieve feature importance for a specific iteration.

      from azureml.train.automl.automlexplainer import retrieve_model_explanation
      
      shap_values, expected_values, overall_summary, overall_imp, per_class_summary, per_class_imp = \
          retrieve_model_explanation(best_run)
      
      #Overall feature importance
      print(overall_imp)
      print(overall_summary) 
      
      #Class-level feature importance
      print(per_class_imp)
      print(per_class_summary) 
      

    You can visualize the feature importance chart in your workspace in the Azure portal. The chart is also shown when using the Jupyter widget in a notebook. To learn more about the charts refer to the Sample Azure ML notebooks article.

    from azureml.widgets import RunDetails
    RunDetails(local_run).show()
    

    feature importance graph

    Next steps

    Learn more about how and where to deploy a model.

    Learn more about how to train a classification model with Automated machine learning or how to train using Automated machine learning on a remote resource.