AutoMLConfig Class

Represents configuration for submitting an automated ML experiment in Azure Machine Learning.

This configuration object contains and persists the parameters for configuring the experiment run, as well as the training data to be used at run time. For guidance on selecting your settings, see https://aka.ms/AutoMLConfig.

Inheritance
builtins.object
AutoMLConfig

Constructor

AutoMLConfig(task: str, path: typing.Union[str, NoneType] = None, iterations: typing.Union[int, NoneType] = None, primary_metric: typing.Union[str, NoneType] = None, positive_label: typing.Union[typing.Any, NoneType] = None, compute_target: typing.Union[typing.Any, NoneType] = None, spark_context: typing.Union[typing.Any, NoneType] = None, X: typing.Union[typing.Any, NoneType] = None, y: typing.Union[typing.Any, NoneType] = None, sample_weight: typing.Union[typing.Any, NoneType] = None, X_valid: typing.Union[typing.Any, NoneType] = None, y_valid: typing.Union[typing.Any, NoneType] = None, sample_weight_valid: typing.Union[typing.Any, NoneType] = None, cv_splits_indices: typing.Union[typing.List[typing.List[typing.Any]], NoneType] = None, validation_size: typing.Union[float, NoneType] = None, n_cross_validations: typing.Union[int, NoneType] = None, y_min: typing.Union[float, NoneType] = None, y_max: typing.Union[float, NoneType] = None, num_classes: typing.Union[int, NoneType] = None, featurization: typing.Union[str, azureml.automl.core.featurization.featurizationconfig.FeaturizationConfig] = 'auto', max_cores_per_iteration: int = 1, max_concurrent_iterations: int = 1, iteration_timeout_minutes: typing.Union[int, NoneType] = None, mem_in_mb: typing.Union[int, NoneType] = None, enforce_time_on_windows: bool = True, experiment_timeout_hours: typing.Union[float, NoneType] = None, experiment_exit_score: typing.Union[float, NoneType] = None, enable_early_stopping: bool = True, blocked_models: typing.Union[typing.List[str], NoneType] = None, blacklist_models: typing.Union[typing.List[str], NoneType] = None, exclude_nan_labels: bool = True, verbosity: int = 20, enable_tf: bool = False, model_explainability: bool = True, allowed_models: typing.Union[typing.List[str], NoneType] = None, whitelist_models: typing.Union[typing.List[str], NoneType] = None, enable_onnx_compatible_models: bool = False, enable_voting_ensemble: bool = True, enable_stack_ensemble: typing.Union[bool, NoneType] = None, debug_log: str = 'automl.log', training_data: typing.Union[typing.Any, NoneType] = None, validation_data: typing.Union[typing.Any, NoneType] = None, test_data: typing.Union[typing.Any, NoneType] = None, test_size: typing.Union[float, NoneType] = None, label_column_name: typing.Union[str, NoneType] = None, weight_column_name: typing.Union[str, NoneType] = None, cv_split_column_names: typing.Union[typing.List[str], NoneType] = None, enable_local_managed: bool = False, enable_dnn: bool = False, forecasting_parameters: typing.Union[azureml.automl.core.forecasting_parameters.ForecastingParameters, NoneType] = None, **kwargs: typing.Any) -> None

Parameters

task
str or Tasks

The type of task to run. Values can be 'classification', 'regression', or 'forecasting' depending on the type of automated ML problem to solve.

path
str

The full path to the Azure Machine Learning project folder. If not specified, the default is to use the current directory or ".".

iterations
int

The total number of different algorithm and parameter combinations to test during an automated ML experiment. If not specified, the default is 1000 iterations.

primary_metric
str or Metric

The metric that Automated Machine Learning will optimize for model selection. Automated Machine Learning collects more metrics than it can optimize. You can use get_primary_metrics to get a list of valid metrics for your given task. For more information on how metrics are calculated, see https://docs.microsoft.com/azure/machine-learning/how-to-configure-auto-train#primary-metric.

If not specified, accuracy is used for classification tasks, normalized root mean squared is used for forecasting and regression tasks, accuracy is used for image classification and image multi label classification, and mean average precision is used for image object detection.

positive_label
<xref:Any>

The positive class label that Automated Machine Learning will use to calculate binary metrics with. Binary metrics are calculated in two conditions for classification tasks:

  1. label column consists of two classes indicating binary classification task AutoML will use specified positive class when positive_label is passed in, otherwise AutoML will pick a positive class based on label encoded value.
  2. multi class classification task with positive_label specified

For more information on classification, checkout metrics for classification scenarios.

compute_target
AbstractComputeTarget

The Azure Machine Learning compute target to run the Automated Machine Learning experiment on. See https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml#local-remote for more information on compute targets.

spark_context
<xref:SparkContext>

The Spark context. Only applicable when used inside Azure Databricks/Spark environment.

X
DataFrame or ndarray or Dataset or TabularDataset

The training features to use when fitting pipelines during an experiment. This setting is being deprecated. Please use training_data and label_column_name instead.

y
DataFrame or ndarray or Dataset or TabularDataset

The training labels to use when fitting pipelines during an experiment. This is the value your model will predict. This setting is being deprecated. Please use training_data and label_column_name instead.

sample_weight
DataFrame or ndarray or TabularDataset

The weight to give to each training sample when running fitting pipelines, each row should correspond to a row in X and y data.

Specify this parameter when specifying X. This setting is being deprecated. Please use training_data and weight_column_name instead.

X_valid
DataFrame or ndarray or Dataset or TabularDataset

Validation features to use when fitting pipelines during an experiment.

If specified, then y_valid or sample_weight_valid must also be specified. This setting is being deprecated. Please use validation_data and label_column_name instead.

y_valid
DataFrame or ndarray or Dataset or TabularDataset

Validation labels to use when fitting pipelines during an experiment.

Both X_valid and y_valid must be specified together. This setting is being deprecated. Please use validation_data and label_column_name instead.

sample_weight_valid
DataFrame or ndarray or TabularDataset

The weight to give to each validation sample when running scoring pipelines, each row should correspond to a row in X and y data.

Specify this parameter when specifying X_valid. This setting is being deprecated. Please use validation_data and weight_column_name instead.

cv_splits_indices
<xref:List>[<xref:List>[ndarray]]

Indices where to split training data for cross validation. Each row is a separate cross fold and within each crossfold, provide 2 numpy arrays, the first with the indices for samples to use for training data and the second with the indices to use for validation data. i.e., [[t1, v1], [t2, v2], ...] where t1 is the training indices for the first cross fold and v1 is the validation indices for the first cross fold.

To specify existing data as validation data, use validation_data. To let AutoML extract validation data out of training data instead, specify either n_cross_validations or validation_size. Use cv_split_column_names if you have cross validation column(s) in training_data.

validation_size
float

What fraction of the data to hold out for validation when user validation data is not specified. This should be between 0.0 and 1.0 non-inclusive.

Specify validation_data to provide validation data, otherwise set n_cross_validations or validation_size to extract validation data out of the specified training data. For custom cross validation fold, use cv_split_column_names.

For more information, see Configure data splits and cross-validation in automated machine learning.

n_cross_validations
int

How many cross validations to perform when user validation data is not specified.

Specify validation_data to provide validation data, otherwise set n_cross_validations or validation_size to extract validation data out of the specified training data. For custom cross validation fold, use cv_split_column_names.

For more information, see Configure data splits and cross-validation in automated machine learning.

y_min
float

Minimum value of y for a regression experiment. The combination of y_min and y_max are used to normalize test set metrics based on the input data range. This setting is being deprecated. Instead, this value will be computed from the data.

y_max
float

Maximum value of y for a regression experiment. The combination of y_min and y_max are used to normalize test set metrics based on the input data range. This setting is being deprecated. Instead, this value will be computed from the data.

num_classes
int

The number of classes in the label data for a classification experiment. This setting is being deprecated. Instead, this value will be computed from the data.

featurization
str or FeaturizationConfig

'auto' / 'off' / FeaturizationConfig Indicator for whether featurization step should be done automatically or not, or whether customized featurization should be used. Note: If the input data is sparse, featurization cannot be turned on.

Column type is automatically detected. Based on the detected column type preprocessing/featurization is done as follows:

  • Categorical: Target encoding, one hot encoding, drop high cardinality categories, impute missing values.

  • Numeric: Impute missing values, cluster distance, weight of evidence.

  • DateTime: Several features such as day, seconds, minutes, hours etc.

  • Text: Bag of words, pre-trained Word embedding, text target encoding.

More details can be found in the article Configure automated ML experiments in Python.

To customize featurization step, provide a FeaturizationConfig object. Customized featurization currently supports blocking a set of transformers, updating column purpose, editing transformer parameters, and dropping columns. For more information, see Customize feature engineering.

Note: Timeseries features are handled separately when the task type is set to forecasting independent of this parameter.

max_cores_per_iteration
int

The maximum number of threads to use for a given training iteration. Acceptable values:

  • Greater than 1 and less than or equal to the maximum number of cores on the compute target.

  • Equal to -1, which means to use all the possible cores per iteration per child-run.

  • Equal to 1, the default.

max_concurrent_iterations
int

Represents the maximum number of iterations that would be executed in parallel. The default value is 1.

  • AmlCompute clusters support one interation running per node. For multiple AutoML experiment parent runs executed in parallel on a single AmlCompute cluster, the sum of the max_concurrent_iterations values for all experiments should be less than or equal to the maximum number of nodes. Otherwise, runs will be queued until nodes are available.

  • DSVM supports multiple iterations per node. max_concurrent_iterations should be less than or equal to the number of cores on the DSVM. For multiple experiments run in parallel on a single DSVM, the sum of the max_concurrent_iterations values for all experiments should be less than or equal to the maximum number of nodes.

  • Databricks - max_concurrent_iterations should be less than or equal to the number of worker nodes on Databricks.

max_concurrent_iterations does not apply to local runs. Formerly, this parameter was named concurrent_iterations.

iteration_timeout_minutes
int

Maximum time in minutes that each iteration can run for before it terminates. If not specified, a value of 1 month or 43200 minutes is used.

mem_in_mb
int

Maximum memory usage that each iteration can run for before it terminates. If not specified, a value of 1 PB or 1073741824 MB is used.

enforce_time_on_windows
bool

Whether to enforce a time limit on model training at each iteration on Windows. The default is True. If running from a Python script file (.py), see the documentation for allowing resource limits on Windows.

experiment_timeout_hours
float

Maximum amount of time in hours that all iterations combined can take before the experiment terminates. Can be a decimal value like 0.25 representing 15 minutes. If not specified, the default experiment timeout is 6 days. To specify a timeout less than or equal to 1 hour, make sure your dataset's size is not greater than 10,000,000 (rows times column) or an error results.

experiment_exit_score
float

Target score for experiment. The experiment terminates after this score is reached. If not specified (no criteria), the experiment runs until no further progress is made on the primary metric. For for more information on exit criteria, see this article.

enable_early_stopping
bool

Whether to enable early termination if the score is not improving in the short term. The default is True.

Early stopping logic:

  • No early stopping for first 20 iterations (landmarks).

  • Early stopping window starts on the 21st iteration and looks for early_stopping_n_iters iterations

    (currently set to 10). This means that the first iteration where stopping can occur is the 31st.

  • AutoML still schedules 2 ensemble iterations AFTER early stopping, which might result in

    higher scores.

  • Early stopping is triggered if the absolute value of best score calculated is the same for past

    early_stopping_n_iters iterations, that is, if there is no improvement in score for early_stopping_n_iters iterations.

blocked_models
list(str) or list(Classification)<xref: for classification task,> or list(Regression)<xref: for regression task,> or list(Forecasting)<xref: for forecasting task>

A list of algorithms to ignore for an experiment. If enable_tf is False, TensorFlow models are included in blocked_models.

blacklist_models
list(str) or list(Classification)<xref: for classification task,> or list(Regression)<xref: for regression task,> or list(Forecasting)<xref: for forecasting task>

Deprecated parameter, use blocked_models instead.

exclude_nan_labels
bool

Whether to exclude rows with NaN values in the label. The default is True.

verbosity
int

The verbosity level for writing to the log file. The default is INFO or 20. Acceptable values are defined in the Python logging library.

enable_tf
bool

Deprecated parameter to enable/disable Tensorflow algorithms. The default is False.

model_explainability
bool

Whether to enable explaining the best AutoML model at the end of all AutoML training iterations. The default is True. For more information, see Interpretability: model explanations in automated machine learning.

allowed_models
list(str) or list(Classification)<xref: for classification task,> or list(Regression)<xref: for regression task,> or list(Forecasting)<xref: for forecasting task>

A list of model names to search for an experiment. If not specified, then all models supported for the task are used minus any specified in blocked_models or deprecated TensorFlow models. The supported models for each task type are described in the SupportedModels class.

whitelist_models
list(str) or list(Classification)<xref: for classification task,> or list(Regression)<xref: for regression task,> or list(Forecasting)<xref: for forecasting task>

Deprecated parameter, use allowed_models instead.

enable_onnx_compatible_models
bool

Whether to enable or disable enforcing the ONNX-compatible models. The default is False. For more information about Open Neural Network Exchange (ONNX) and Azure Machine Learning, see this article.

forecasting_parameters
ForecastingParameters

A ForecastingParameters object to hold all the forecasting specific parameters.

time_column_name
str

The name of the time column. This parameter is required when forecasting to specify the datetime column in the input data used for building the time series and inferring its frequency. This setting is being deprecated. Please use forecasting_parameters instead.

max_horizon
int

The desired maximum forecast horizon in units of time-series frequency. The default value is 1.

Units are based on the time interval of your training data, e.g., monthly, weekly that the forecaster should predict out. When task type is forecasting, this parameter is required. For more information on setting forecasting parameters, see Auto-train a time-series forecast model. This setting is being deprecated. Please use forecasting_parameters instead.

grain_column_names
str or list(str)

The names of columns used to group a timeseries. It can be used to create multiple series. If grain is not defined, the data set is assumed to be one time-series. This parameter is used with task type forecasting. This setting is being deprecated. Please use forecasting_parameters instead.

target_lags
int or list(int)

The number of past periods to lag from the target column. The default is 1. This setting is being deprecated. Please use forecasting_parameters instead.

When forecasting, this parameter represents the number of rows to lag the target values based on the frequency of the data. This is represented as a list or single integer. Lag should be used when the relationship between the independent variables and dependant variable do not match up or correlate by default. For example, when trying to forecast demand for a product, the demand in any month may depend on the price of specific commodities 3 months prior. In this example, you may want to lag the target (demand) negatively by 3 months so that the model is training on the correct relationship. For more information, see Auto-train a time-series forecast model.

feature_lags
str

Flag for generating lags for the numeric features. This setting is being deprecated. Please use forecasting_parameters instead.

target_rolling_window_size
int

The number of past periods used to create a rolling window average of the target column. This setting is being deprecated. Please use forecasting_parameters instead.

When forecasting, this parameter represents n historical periods to use to generate forecasted values, <= training set size. If omitted, n is the full training set size. Specify this parameter when you only want to consider a certain amount of history when training the model.

country_or_region
str

The country/region used to generate holiday features. These should be ISO 3166 two-letter country/region code, for example 'US' or 'GB'. This setting is being deprecated. Please use forecasting_parameters instead.

use_stl
str

Configure STL Decomposition of the time-series target column. use_stl can take three values: None (default) - no stl decomposition, 'season' - only generate season component and season_trend - generate both season and trend components. This setting is being deprecated. Please use forecasting_parameters instead.

seasonality
int or str

Set time series seasonality. If seasonality is set to 'auto', it will be inferred. This setting is being deprecated. Please use forecasting_parameters instead.

short_series_handling_configuration
str

The parameter defining how if AutoML should handle short time series.

Possible values: 'auto' (default), 'pad', 'drop' and None.

  • auto short series will be padded if there are no long series, otherwise short series will be dropped.
  • pad all the short series will be padded.
  • drop all the short series will be dropped".
  • None the short series will not be modified. If set to 'pad', the table will be padded with the zeroes and empty values for the regressors and random values for target with the mean equal to target value median for given time series id. If median is more or equal to zero, the minimal padded value will be clipped by zero: Input:

Output assuming minimal number of values is four:

Note: We have two parameters short_series_handling_configuration and legacy short_series_handling. When both parameters are set we are synchronize them as shown in the table below (short_series_handling_configuration and short_series_handling for brevity are marked as handling_configuration and handling respectively).

freq
str or None

Forecast frequency.

When forecasting, this parameter represents the period with which the forecast is desired, for example daily, weekly, yearly, etc. The forecast frequency is dataset frequency by default. You can optionally set it to greater (but not lesser) than dataset frequency. We'll aggregate the data and generate the results at forecast frequency. For example, for daily data, you can set the frequency to be daily, weekly or monthly, but not hourly. The frequency needs to be a pandas offset alias. Please refer to pandas documentation for more information: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects

target_aggregation_function
str or None

The function to be used to aggregate the time series target column to conform to a user specified frequency. If the target_aggregation_function is set, but the freq parameter is not set, the error is raised. The possible target aggregation functions are: "sum", "max", "min" and "mean".

enable_voting_ensemble
bool

Whether to enable/disable VotingEnsemble iteration. The default is True. For more information about ensembles, see Ensemble configuration.

enable_stack_ensemble
bool

Whether to enable/disable StackEnsemble iteration. The default is None. If enable_onnx_compatible_models flag is being set, then StackEnsemble iteration will be disabled. Similarly, for Timeseries tasks, StackEnsemble iteration will be disabled by default, to avoid risks of overfitting due to small training set used in fitting the meta learner. For more information about ensembles, see Ensemble configuration.

debug_log
str

The log file to write debug information to. If not specified, 'automl.log' is used.

training_data
DataFrame or Dataset or DatasetDefinition or TabularDataset

The training data to be used within the experiment. It should contain both training features and a label column (optionally a sample weights column). If training_data is specified, then the label_column_name parameter must also be specified.

training_data was introduced in version 1.0.81.

validation_data
DataFrame or Dataset or DatasetDefinition or TabularDataset

The validation data to be used within the experiment. It should contain both training features and label column (optionally a sample weights column). If validation_data is specified, then training_data and label_column_name parameters must be specified.

validation_data was introduced in version 1.0.81. For more information, see Configure data splits and cross-validation in automated machine learning.

test_data
Dataset or TabularDataset

The test data to be used for a test run that will automatically be started after model training is complete. The test run will get predictions using the best model and will compute metrics given these predictions.

If this parameter or the test_size parameter are not specified then no test run will be executed automatically after model training is completed. Test data should contain both features and label column. If test_data is specified then the label_column_name parameter must be specified.

test_size
float

What fraction of the training data to hold out for test data for a test run that will automatically be started after model training is complete. The test run will get predictions using the best model and will compute metrics given these predictions.

This should be between 0.0 and 1.0 non-inclusive. If test_size is specified at the same time as validation_size, then the test data is split from training_data before the validation data is split. For example, if validation_size=0.1, test_size=0.1 and the original training data has 1000 rows, then the test data will have 100 rows, the validation data will contain 90 rows and the training data will have 810 rows.

For regression based tasks, random sampling is used. For classification tasks, stratified sampling is used. Forecasting does not currently support specifying a test dataset using a train/test split.

If this parameter or the test_data parameter are not specified then no test run will be executed automatically after model training is completed.

label_column_name
Union[str, int]

The name of the label column. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers.

This parameter is applicable to training_data, validation_data and test_data parameters. label_column_name was introduced in version 1.0.81.

weight_column_name
Union[str, int]

The name of the sample weight column. Automated ML supports a weighted column as an input, causing rows in the data to be weighted up or down. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers.

This parameter is applicable to training_data and validation_data parameters. weight_column_names was introduced in version 1.0.81.

cv_split_column_names
list(str)

List of names of the columns that contain custom cross validation split. Each of the CV split columns represents one CV split where each row are either marked 1 for training or 0 for validation.

This parameter is applicable to training_data parameter for custom cross validation purposes. cv_split_column_names was introduced in version 1.6.0

Use either cv_split_column_names or cv_splits_indices.

For more information, see Configure data splits and cross-validation in automated machine learning.

enable_local_managed
bool

Disabled parameter. Local managed runs can not be enabled at this time.

enable_dnn
bool

Whether to include DNN based models during model selection. The default is False.

Remarks

The following code shows a basic example of creating an AutoMLConfig object and submitting an experiment for regression:


   automl_settings = {
       "n_cross_validations": 3,
       "primary_metric": 'r2_score',
       "enable_early_stopping": True,
       "experiment_timeout_hours": 1.0,
       "max_concurrent_iterations": 4,
       "max_cores_per_iteration": -1,
       "verbosity": logging.INFO,
   }

   automl_config = AutoMLConfig(task = 'regression',
                               compute_target = compute_target,
                               training_data = train_data,
                               label_column_name = label,
                               **automl_settings
                               )

   ws = Workspace.from_config()
   experiment = Experiment(ws, "your-experiment-name")
   run = experiment.submit(automl_config, show_output=True)

A full sample is available at Regression

Examples of using AutoMLConfig for forecasting are in these notebooks:

Examples of using AutoMLConfig for all task types can be found in these automated ML notebooks.

For background on automated ML, see the articles:

For more information about different options for configuring training/validation data splits and cross-validation for your automated machine learning, AutoML, experiments, see Configure data splits and cross-validation in automated machine learning.

Methods

get_supported_dataset_languages

Get supported languages and their corresponding language codes in ISO 639-3.

get_supported_dataset_languages

Get supported languages and their corresponding language codes in ISO 639-3.

get_supported_dataset_languages(use_gpu: bool) -> typing.Dict[typing.Any, typing.Any]

Parameters

cls

Class object of AutoMLConfig.

use_gpu

boolean indicating whether gpu compute is being used or not.

Returns

dictionary of format {: }. Language code adheres to ISO 639-3 standard, please refer to https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes