Configure automated machine learning experiments
Automated machine learning picks an algorithm and hyperparameters for you and generates a model ready for deployment. There are several options that you can use to configure automated machine learning experiments. In this guide, learn how to define various configuration settings.
To view examples of an automated machine learning experiments , see Tutorial: Train a classification model with automated machine learning or Train models with automated machine learning in the cloud.
Configuration options available in automated machine learning:
- Select your experiment type: Classification, Regression or Forecasting
- Data source, formats, and fetch data
- Choose your compute target: local or remote
- Automated machine learning experiment settings
- Run an automated machine learning experiment
- Explore model metrics
- Register and deploy model
Select your experiment type
Before you begin your experiment, you should determine the kind of machine learning problem you are solving. Automated machine learning supports task types of classification, regression and forecasting.
While automated machine learning capabilities are generally available, forecasting is still in public preview.
Automated machine learning supports the following algorithms during the automation and tuning process. As a user, there is no need for you to specify the algorithm.
Data source and format
Automated machine learning supports data that resides on your local desktop or in the cloud such as Azure Blob Storage. The data can be read into scikit-learn supported data formats. You can read the data into:
- Numpy arrays X (features) and y (target variable or also known as label)
- Pandas dataframe
Examples:
Numpy arrays
digits = datasets.load_digits() X_digits = digits.data y_digits = digits.target
Pandas dataframe
import pandas as pd df = pd.read_csv("https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv", delimiter="\t", quotechar='"') # get integer labels df = df.drop(["Label"], axis=1) df_train, _, y_train, _ = train_test_split(df, y, test_size=0.1, random_state=42)
Fetch data for running experiment on remote compute
If you are using a remote compute to run your experiment, the data fetch must be wrapped in a separate python script get_data()
. This script is run on the remote compute where the automated machine learning experiment is run. get_data
eliminates the need to fetch the data over the wire for each iteration. Without get_data
, your experiment will fail when you run on remote compute.
Here is an example of get_data
:
%%writefile $project_folder/get_data.py
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
def get_data(): # Burning man 2016 data
df = pd.read_csv("https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv", delimiter="\t", quotechar='"')
# get integer labels
le = LabelEncoder()
le.fit(df["Label"].values)
y = le.transform(df["Label"].values)
df = df.drop(["Label"], axis=1)
df_train, _, y_train, _ = train_test_split(df, y, test_size=0.1, random_state=42)
return { "X" : df, "y" : y }
In your AutoMLConfig
object, you specify the data_script
parameter and provide the path to the get_data
script file similar to below:
automl_config = AutoMLConfig(****, data_script=project_folder + "/get_data.py", **** )
get_data
script can return:
Key | Type | Mutually Exclusive with | Description |
---|---|---|---|
X | Pandas Dataframe or Numpy Array | data_train, label, columns | All features to train with |
y | Pandas Dataframe or Numpy Array | label | Label data to train with. For classification, should be an array of integers. |
X_valid | Pandas Dataframe or Numpy Array | data_train, label | Optional All features to validate with. If not specified, X is split between train and validate |
y_valid | Pandas Dataframe or Numpy Array | data_train, label | Optional The label data to validate with. If not specified, y is split between train and validate |
sample_weight | Pandas Dataframe or Numpy Array | data_train, label, columns | Optional A weight value for each sample. Use when you would like to assign different weights for your data points |
sample_weight_valid | Pandas Dataframe or Numpy Array | data_train, label, columns | Optional A weight value for each validation sample. If not specified, sample_weight is split between train and validate |
data_train | Pandas Dataframe | X, y, X_valid, y_valid | All data (features+label) to train with |
label | string | X, y, X_valid, y_valid | Which column in data_train represents the label |
columns | Array of strings | Optional Whitelist of columns to use for features | |
cv_splits_indices | Array of integers | Optional List of indexes to split the data for cross validation |
Load and prepare data using DataPrep SDK
Automated machine learning experiments supports data loading and transforms using the dataprep SDK. Using the SDK provides the ability to
- Load from many file types with parsing parameter inference (encoding, separator, headers)
- Type-conversion using inference during file loading
- Connection support for MS SQL Server and Azure Data Lake Storage
- Add column using an expression
- Impute missing values
- Derive column by example
- Filtering
- Custom Python transforms
To learn about the data prep sdk refer the How to prepare data for modeling article. Below is an example loading data using data prep sdk.
# The data referenced here was pulled from `sklearn.datasets.load_digits()`.
simple_example_data_root = 'https://dprepdata.blob.core.windows.net/automl-notebook-data/'
X = dprep.auto_read_file(simple_example_data_root + 'X.csv').skip(1) # Remove the header row.
# You can use `auto_read_file` which intelligently figures out delimiters and datatypes of a file.
# Here we read a comma delimited file and convert all columns to integers.
y = dprep.read_csv(simple_example_data_root + 'y.csv').to_long(dprep.ColumnSelector(term='.*', use_regex = True))
Train and validation data
You can specify separate train and validation set either through get_data() or directly in the AutoMLConfig
method.
Cross validation split options
K-Folds Cross Validation
Use n_cross_validations
setting to specify the number of cross validations. The training data set will be randomly split into n_cross_validations
folds of equal size. During each cross validation round, one of the folds will be used for validation of the model trained on the remaining folds. This process repeats for n_cross_validations
rounds until each fold is used once as validation set. The average scores across all n_cross_validations
rounds will be reported, and the corresponding model will be retrained on the whole training data set.
Monte Carlo Cross Validation (a.k.a. Repeated Random Sub-Sampling)
Use validation_size
to specify the percentage of the training dataset that should be used for validation, and use n_cross_validations
to specify the number of cross validations. During each cross validation round, a subset of size validation_size
will be randomly selected for validation of the model trained on the remaining data. Finally, the average scores across all n_cross_validations
rounds will be reported, and the corresponding model will be retrained on the whole training data set.
Custom validation dataset
Use custom validation dataset if random split is not acceptable (usually time series data or imbalanced data). You can specify your own validation dataset. The model will be evaluated against the validation dataset specified instead of random dataset.
Compute to run experiment
Next determine where the model will be trained. An automated machine learning training experiment can run on the following compute options:
- Your local machine such as a local desktop or laptop – Generally when you have small dataset and you are still in the exploration stage.
- A remote machine in the cloud – Azure Machine Learning Managed Compute is a managed service that enables the ability to train machine learning models on clusters of Azure virtual machines.
See the GitHub site for example notebooks with local and remote compute targets.
Configure your experiment settings
There are several options that you can use to configure your automated machine learning experiment. These parameters are set by instantiating an AutoMLConfig
object.
Some examples include:
Classification experiment using AUC weighted as the primary metric with a max time of 12,000 seconds per iteration, with the experiment to end after 50 iterations and 2 cross validation folds.
automl_classifier = AutoMLConfig( task='classification', primary_metric='AUC_weighted', max_time_sec=12000, iterations=50, X=X, y=y, n_cross_validations=2)
Below is an example of a regression experiment set to end after 100 iterations, with each iteration lasting up to 600 seconds with 5 validation cross folds.
automl_regressor = AutoMLConfig( task='regression', max_time_sec=600, iterations=100, primary_metric='r2_score', X=X, y=y, n_cross_validations=5)
This table lists parameter settings available for your experiment and their default values.
Property | Description | Default Value |
---|---|---|
task |
Specify the type of machine learning problem. Allowed values are |
None |
primary_metric |
Metric that you want to optimize in building your model. For example, if you specify accuracy as the primary_metric, automated machine learning looks to find a model with maximum accuracy. You can only specify one primary_metric per experiment. Allowed values are Classification: Regression: |
For Classification: accuracy For Regression: spearman_correlation |
experiment_exit_score |
You can set a target value for your primary_metric. Once a model is found that meets the primary_metric target, automated machine learning will stop iterating and the experiment terminates. If this value is not set (default), Automated machine learning experiment will continue to run the number of iterations specified in iterations. Takes a double value. If the target never reaches, then Automated machine learning will continue until it reaches the number of iterations specified in iterations. | None |
iterations |
Maximum number of iterations. Each iteration is equal to a training job that results in a pipeline. Pipeline is data preprocessing and model. To get a high-quality model, use 250 or more | 100 |
max_concurrent_iterations |
Max number of iterations to run in parallel. This setting works only for remote compute. | 1 |
max_cores_per_iteration |
Indicates how many cores on the compute target would be used to train a single pipeline. If the algorithm can leverage multiple cores, then this increases the performance on a multi-core machine. You can set it to -1 to use all the cores available on the machine. | 1 |
iteration_timeout_minutes |
Limits the amount of time (minutes) a particular iteration takes. If an iteration exceeds the specified amount, that iteration gets canceled. If not set, then the iteration continues to run until it is finished. | None |
n_cross_validations |
Number of cross validation splits | None |
validation_size |
Size of validation set as percentage of all training sample. | None |
preprocess |
True/False True enables experiment to perform preprocessing on the input. Following is a subset of preprocessing Note : if data is sparse you cannot use preprocess = true |
False |
blacklist_models |
Automated machine learning experiment has many different algorithms that it tries. Configure to exclude certain algorithms from the experiment. Useful if you are aware that algorithm(s) do not work well for your dataset. Excluding algorithms can save you compute resources and training time. Allowed values for Classification Allowed values for Regression Allowed values for Forecasting |
None |
whitelist_models |
Automated machine learning experiment has many different algorithms that it tries. Configure to include certain algorithms for the experiment. Useful if you are aware that algorithm(s) do work well for your dataset. Allowed values for Classification Allowed values for Regression Allowed values for Forecasting |
None |
verbosity |
Controls the level of logging with INFO being the most verbose and CRITICAL being the least. Verbosity level takes the same values as defined in the python logging package. Allowed values are: |
logging.INFO |
X |
All features to train with | None |
y |
Label data to train with. For classification, should be an array of integers. | None |
X_valid |
Optional All features to validate with. If not specified, X is split between train and validate | None |
y_valid |
Optional The label data to validate with. If not specified, y is split between train and validate | None |
sample_weight |
Optional A weight value for each sample. Use when you would like to assign different weights for your data points | None |
sample_weight_valid |
Optional A weight value for each validation sample. If not specified, sample_weight is split between train and validate | None |
run_configuration |
RunConfiguration object. Used for remote runs. | None |
data_script |
Path to a file containing the get_data method. Required for remote runs. | None |
model_explainability |
Optional True/False True enables experiment to perform feature importance for every iteration. You can also use explain_model() method on a specific iteration to enable feature importance on-demand for that iteration after experiment is complete. |
False |
enable_ensembling |
Flag to enable an ensembling iteration after all the other iterations complete. | True |
ensemble_iterations |
Number of iterations during which we choose a fitted pipeline to be part of the final ensemble. | 15 |
experiment_timeout_minutes |
Limits the amount of time (minues) that the whole experiment run can take | None |
Data pre-processing and featurization
If you use preprocess=True
, the following data preprocessing steps are performed automatically for you:
- Drop high cardinality or no variance features
- Drop features with no useful information from training and validation sets. These include features with all values missing, same value across all rows or with extremely high cardinality (e.g., hashes, IDs or GUIDs).
- Missing value imputation
- For numerical features, impute missing values with average of values in the column.
- For categorical features, impute missing values with most frequent value.
- Generate additional features
- For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.
- For Text features: Term frequency based on word unigram, bi-grams, and tri-gram, Count vectorizer.
- Transformations and encodings
- Numeric features with very few unique values transformed into categorical features.
- Depending on cardinality of categorical features, perform label encoding or (hashing) one-hot encoding.
Run experiment
Submit the experiment to run and generate a model. Pass the AutoMLConfig
to the submit
method to generate the model.
run = experiment.submit(automl_config, show_output=True)
Note
Dependencies are first installed on a new machine. It may take up to 10 minutes before output is shown.
Setting show_output
to True
results in output being shown on the console.
Explore model metrics
You can view your results in a widget or inline if you are in a notebook. See Track and evaluate models for more details.
Classification metrics
The following metrics are saved in each iteration for a classification task.
Primary Metric | Description | Calculation | Extra Parameters |
---|---|---|---|
AUC_Macro | AUC is the Area under the Receiver Operating Characteristic Curve. Macro is the arithmetic mean of the AUC for each class. | Calculation | average="macro" |
AUC_Micro | AUC is the Area under the Receiver Operating Characteristic Curve. Micro is computed globably by combining the true positives and false positives from each class | Calculation | average="micro" |
AUC_Weighted | AUC is the Area under the Receiver Operating Characteristic Curve. Weighted is the arithmetic mean of the score for each class, weighted by the number of true instances in each class | Calculation | average="weighted" |
accuracy | Accuracy is the percent of predicted labels that exactly match the true labels. | Calculation | None |
average_precision_score_macro | Average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight. Macro is the arithmetic mean of the average precision score of each class | Calculation | average="macro" |
average_precision_score_micro | Average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight. Micro is computed globally by combing the true positives and false positives at each cutoff | Calculation | average="micro" |
average_precision_score_weighted | Average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight. Weighted is the arithmetic mean of the average precision score for each class, weighted by the number of true instances in each class | Calculation | average="weighted" |
balanced_accuracy | Balanced accuracy is the arithmetic mean of recall for each class. | Calculation | average="macro" |
f1_score_macro | F1 score is the harmonic mean of precision and recall. Macro is the arithmetic mean of F1 score for each class | Calculation | average="macro" |
f1_score_micro | F1 score is the harmonic mean of precision and recall. Micro is computed globally by counting the total true positives, false negatives, and false positives | Calculation | average="micro" |
f1_score_weighted | F1 score is the harmonic mean of precision and recall. Weighted mean by class frequency of F1 score for each class | Calculation | average="weighted" |
log_loss | This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions. For a single sample with true label yt in {0,1} and estimated probability yp that yt = 1, the log loss is -log P(yt|yp) = -(yt log(yp) + (1 - yt) log(1 - yp)) | Calculation | None |
norm_macro_recall | Normalized Macro Recall is Macro Recall normalized so that random performance has a score of 0 and perfect performance has a score of 1. This is achieved by norm_macro_recall := (recall_score_macro - R)/(1 - R), where R is the expected value of recall_score_macro for random predictions (i.e., R=0.5 for binary classification and R=(1/C) for C-class classification problems) | Calculation | average = "macro" and then (recall_score_macro - R)/(1 - R), where R is the expected value of recall_score_macro for random predictions (i.e., R=0.5 for binary classification and R=(1/C) for C-class classification problems) |
precision_score_macro | Precision is the percent of elements labeled as a certain class that actually are in that class. Macro is the arithmetic mean of precision for each class | Calculation | average="macro" |
precision_score_micro | Precision is the percent of elements labeled as a certain class that actually are in that class. Micro is computed globally by counting the total true positives and false positives | Calculation | average="micro" |
precision_score_weighted | Precision is the percent of elements labeled as a certain class that actually are in that class. Weighted is the arithmetic mean of precision for each class, weighted by number of true instances in each class | Calculation | average="weighted" |
recall_score_macro | Recall is the percent of elements actually in a certain class that are correctly labeled. Macro is the arithmetic mean of recall for each class | Calculation | average="macro" |
recall_score_micro | Recall is the percent of elements actually in a certain class that are correctly labeled. Micro is computed globally by counting the total true positives, false negatives | Calculation | average="micro" |
recall_score_weighted | Recall is the percent of elements actually in a certain class that are correctly labeled. Weighted is the arithmetic mean of recall for each class, weighted by number of true instances in each class | Calculation | average="weighted" |
weighted_accuracy | Weighted accuracy is accuracy where the weight given to each example is equal to the proportion of true instances in that example's true class | Calculation | sample_weight is a vector equal to the proportion of that class for each element in the target |
Regression and forecasting metrics
The following metrics are saved in each iteration for a regression or forecasting task.
Primary Metric | Description | Calculation | Extra Parameters |
---|---|---|---|
explained_variance | Explained variance is the proportion to which a mathematical model accounts for the variation of a given data set. It is the percent decrease in variance of the original data to the variance of the errors. When the mean of the errors is 0, it is equal to explained variance. | Calculation | None |
r2_score | R2 is the coefficient of determination or the percent reduction in squared errors compared to a baseline model that outputs the mean. When the mean of the errors is 0, it is equal to explained variance. | Calculation | None |
spearman_correlation | Spearman correlation is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases. | Calculation | None |
mean_absolute_error | Mean absolute error is the expected value of absolute value of difference between the target and the prediction | Calculation | None |
normalized_mean_absolute_error | Normalized mean absolute error is mean Absolute Error divided by the range of the data | Calculation | Divide by range of the data |
median_absolute_error | Median absolute error is the median of all absolute differences between the target and the prediction. This loss is robust to outliers. | Calculation | None |
normalized_median_absolute_error | Normalized median absolute error is median absolute error divided by the range of the data | Calculation | Divide by range of the data |
root_mean_squared_error | Root mean squared error is the square root of the expected squared difference between the target and the prediction | Calculation | None |
normalized_root_mean_squared_error | Normalized root mean squared error is root mean squared error divided by the range of the data | Calculation | Divide by range of the data |
root_mean_squared_log_error | Root mean squared log error is the square root of the expected squared logarithmic error | Calculation | None |
normalized_root_mean_squared_log_error | Noramlized Root mean squared log error is root mean squared log error divided by the range of the data | Calculation | Divide by range of the data |
Explain the model
While automated machine learning capabilities are generally available, the model explainability feature is still in public preview.
Automated machine learning allows you to understand feature importance. During the training process, you can get global feature importance for the model. For classification scenarios, you can also get class-level feature importance. You must provide a validation dataset (X_valid) to get feature importance.
There are two ways to generate feature importance.
Once an experiment is complete, you can use
explain_model
method on any iteration.from azureml.train.automl.automlexplainer import explain_model shap_values, expected_values, overall_summary, overall_imp, per_class_summary, per_class_imp = \ explain_model(fitted_model, X_train, X_test) #Overall feature importance print(overall_imp) print(overall_summary) #Class-level feature importance print(per_class_imp) print(per_class_summary)
To view feature importance for all iterations, set
model_explainability
flag toTrue
in AutoMLConfig.automl_config = AutoMLConfig(task = 'classification', debug_log = 'automl_errors.log', primary_metric = 'AUC_weighted', max_time_sec = 12000, iterations = 10, verbosity = logging.INFO, X = X_train, y = y_train, X_valid = X_test, y_valid = y_test, model_explainability=True, path=project_folder)
Once done, you can use retrieve_model_explanation method to retrieve feature importance for a specific iteration.
from azureml.train.automl.automlexplainer import retrieve_model_explanation shap_values, expected_values, overall_summary, overall_imp, per_class_summary, per_class_imp = \ retrieve_model_explanation(best_run) #Overall feature importance print(overall_imp) print(overall_summary) #Class-level feature importance print(per_class_imp) print(per_class_summary)
You can visualize the feature importance chart in your workspace in the Azure portal. The chart is also shown when using the Jupyter widget in a notebook. To learn more about the charts refer to the Sample Azure ML notebooks article.
from azureml.widgets import RunDetails
RunDetails(local_run).show()
Next steps
Learn more about how and where to deploy a model.
Learn more about how to train a classification model with Automated machine learning or how to train using Automated machine learning on a remote resource.