AutoMLConfig class
Definition
Represents configuration for submitting an automated ML experiment in Azure Machine Learning.
This configuration object contains and persists the parameters for configuring the experiment run, as well as the training data to be used at run time. For guidance on selecting your settings, see https://aka.ms/AutoMLConfig.
AutoMLConfig(task: str, path: typing.Union[str, NoneType] = None, iterations: typing.Union[int, NoneType] = None, primary_metric: typing.Union[str, NoneType] = None, compute_target: typing.Union[typing.Any, NoneType] = None, spark_context: typing.Union[typing.Any, NoneType] = None, X: typing.Union[typing.Any, NoneType] = None, y: typing.Union[typing.Any, NoneType] = None, sample_weight: typing.Union[typing.Any, NoneType] = None, X_valid: typing.Union[typing.Any, NoneType] = None, y_valid: typing.Union[typing.Any, NoneType] = None, sample_weight_valid: typing.Union[typing.Any, NoneType] = None, cv_splits_indices: typing.Union[typing.List[typing.List[typing.Any]], NoneType] = None, validation_size: typing.Union[float, NoneType] = None, n_cross_validations: typing.Union[int, NoneType] = None, y_min: typing.Union[float, NoneType] = None, y_max: typing.Union[float, NoneType] = None, num_classes: typing.Union[int, NoneType] = None, featurization: typing.Union[str, azureml.automl.core.featurization.featurizationconfig.FeaturizationConfig] = 'auto', max_cores_per_iteration: int = 1, max_concurrent_iterations: int = 1, iteration_timeout_minutes: typing.Union[int, NoneType] = None, mem_in_mb: typing.Union[int, NoneType] = None, enforce_time_on_windows: bool = True, experiment_timeout_hours: typing.Union[float, NoneType] = None, experiment_exit_score: typing.Union[float, NoneType] = None, enable_early_stopping: bool = False, blocked_models: typing.Union[typing.List[str], NoneType] = None, blacklist_models: typing.Union[typing.List[str], NoneType] = None, exclude_nan_labels: bool = True, verbosity: int = 20, enable_tf: bool = False, model_explainability: bool = True, allowed_models: typing.Union[typing.List[str], NoneType] = None, whitelist_models: typing.Union[typing.List[str], NoneType] = None, enable_onnx_compatible_models: bool = False, enable_voting_ensemble: bool = True, enable_stack_ensemble: typing.Union[bool, NoneType] = None, debug_log: str = 'automl.log', training_data: typing.Union[typing.Any, NoneType] = None, validation_data: typing.Union[typing.Any, NoneType] = None, label_column_name: typing.Union[str, NoneType] = None, weight_column_name: typing.Union[str, NoneType] = None, cv_split_column_names: typing.Union[typing.List[str], NoneType] = None, enable_local_managed: bool = False, enable_dnn: bool = False, **kwargs: typing.Any) -> None
- Inheritance
-
builtins.objectAutoMLConfig
Parameters
The type of task to run. Values can be 'classification', 'regression', or 'forecasting' depending on the type of automated ML problem to solve.
- path
- str
The full path to the Azure Machine Learning project folder. If not specified, the default is to use the current directory or ".".
- iterations
- int
The total number of different algorithm and parameter combinations to test during an automated ML experiment. If not specified, the default is 1000 iterations.
- primary_metric
- str or azureml.train.automl.constants.Metric
The metric that Automated Machine Learning will optimize for model selection. Automated Machine Learning collects more metrics than it can optimize. You can use get_primary_metrics(task) to get a list of valid metrics for your given task. For more information on how metrics are calculated, see https://docs.microsoft.com/azure/machine-learning/how-to-configure-auto-train#primary-metric.
If not specified, accuracy is used for classification tasks, normalized root mean squared is used for forecasting and regression tasks, accuracy is used for image classification and image multi label classification, and mean average precision is used for image object detection.
- compute_target
- AbstractComputeTarget
The Azure Machine Learning compute target to run the Automated Machine Learning experiment on. See https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-remote for more information on compute targets.
- spark_context
- SparkContext
The Spark context. Only applicable when used inside Azure Databricks/Spark environment.
- X
- DataFrame or ndarray or Dataset or DatasetDefinition or TabularDataset
The training features to use when fitting pipelines during an experiment. This setting is being deprecated. Please use training_data and label_column_name instead.
- y
- DataFrame or ndarray or Dataset or DatasetDefinition or TabularDataset
The training labels to use when fitting pipelines during an experiment. This is the value your model will predict. This setting is being deprecated. Please use training_data and label_column_name instead.
- sample_weight
- DataFrame or ndarray or TabularDataset
The weight to give to each training sample when running fitting pipelines, each row should correspond to a row in X and y data.
Specify this parameter when specifying X.
- X_valid
- DataFrame or ndarray or Dataset or DatasetDefinition or TabularDataset
Validation features to use when fitting pipelines during an experiment.
If specified, then y_valid or sample_weight_valid must also be specified.
- y_valid
- DataFrame or ndarray or Dataset or DatasetDefinition or TabularDataset
Validation labels to use when fitting pipelines during an experiment.
Both X_valid and y_valid must be specified together.
- sample_weight_valid
- DataFrame or ndarray or TabularDataset
The weight to give to each validation sample when running scoring pipelines, each row should correspond to a row in X and y data.
Specify this parameter when specifying X_valid.
- cv_splits_indices
- List[List[ndarray]]
Indices where to split training data for cross validation. Each row is a separate cross fold and within each crossfold, provide 2 numpy arrays, the first with the indices for samples to use for training data and the second with the indices to use for validation data. i.e., [[t1, v1], [t2, v2], ...] where t1 is the training indices for the first cross fold and v1 is the validation indices for the first cross fold.
To specify existing data as validation data, use validation_data. To let AutoML extract validation
data out of training data instead, specify either n_cross_validations or validation_size.
Use cv_split_column_names if you have cross validation column(s) in training_data.
- validation_size
- float
What fraction of the data to hold out for validation when user validation data is not specified. This should be between 0.0 and 1.0 non-inclusive.
Specify validation_data to provide validation data, otherwise set n_cross_validations or
validation_size to extract validation data out of the specified training data.
For custom cross validation fold, use cv_split_column_names.
For more information, see Configure data splits and cross-validation in automated machine learning.
- n_cross_validations
- int
How many cross validations to perform when user validation data is not specified.
Specify validation_data to provide validation data, otherwise set n_cross_validations or
validation_size to extract validation data out of the specified training data.
For custom cross validation fold, use cv_split_column_names.
For more information, see Configure data splits and cross-validation in automated machine learning.
- y_min
- float
Minimum value of y for a regression experiment. The combination of y_min and y_max are used to
normalize test set metrics based on the input data range. If not specified, the minimum value is
inferred from the data.
- y_max
- float
Maximum value of y for a regression experiment. The combination of y_min and y_max are used to
normalize test set metrics based on the input data range. If not specified, the maximum value is
inferred from the data.
- num_classes
- int
The number of classes in the label data for a classification experiment.
- featurization
- str or FeaturizationConfig
'auto' / 'off' / FeaturizationConfig Indicator for whether featurization step should be done automatically or not, or whether customized featurization should be used. Note: If the input data is sparse, featurization cannot be turned on.
Column type is automatically detected. Based on the detected column type preprocessing/featurization is done as follows:
Categorical: Target encoding, one hot encoding, drop high cardinality categories, impute missing values.
Numeric: Impute missing values, cluster distance, weight of evidence.
DateTime: Several features such as day, seconds, minutes, hours etc.
Text: Bag of words, pre-trained Word embedding, text target encoding.
More details can be found in the article Configure automated ML experiments in Python.
To customize featurization step, provide a FeaturizationConfig object. Customized featurization currently supports blocking a set of transformers, updating column purpose, editing transformer parameters, and dropping columns. For more information, see Customize feature engineering.
Note: Timeseries features are handled separately when the task type is set to forecasting independent of this parameter.
- max_cores_per_iteration
- int
The maximum number of threads to use for a given training iteration. Acceptable values:
Greater than 1 and less than or equal to the maximum number of cores on the compute target.
Equal to -1, which means to use all the possible cores per iteration per child-run.
Equal to 1, the default.
- max_concurrent_iterations
- int
Represents the maximum number of iterations that would be executed in parallel. The default value is 1.
AmlCompute clusters support one interation running per node. For multiple AutoML experiment parent runs executed in parallel on a single AmlCompute cluster, the sum of the
max_concurrent_iterationsvalues for all experiments should be less than or equal to the maximum number of nodes. Otherwise, runs will be queued until nodes are available.DSVM supports multiple iterations per node.
max_concurrent_iterationsshould be less than or equal to the number of cores on the DSVM. For multiple experiments run in parallel on a single DSVM, the sum of themax_concurrent_iterationsvalues for all experiments should be less than or equal to the maximum number of nodes.Databricks -
max_concurrent_iterationsshould be less than or equal to the number of worker nodes on Databricks.
max_concurrent_iterations does not apply to local runs. Formerly, this parameter
was named concurrent_iterations.
- iteration_timeout_minutes
- int
Maximum time in minutes that each iteration can run for before it terminates. If not specified, a value of 1 month or 43200 minutes is used.
- mem_in_mb
- int
Maximum memory usage that each iteration can run for before it terminates. If not specified, a value of 1 PB or 1073741824 MB is used.
- enforce_time_on_windows
- bool
Whether to enforce a time limit on model training at each iteration on Windows. The default is True. If running from a Python script file (.py), see the documentation for allowing resource limits on Windows.
- experiment_timeout_hours
- float
Maximum amount of time in hours that all iterations combined can take before the experiment terminates. Can be a decimal value like 0.25 representing 15 minutes. If not specified, the default experiment timeout is 6 days. To specify a timeout less than or equal to 1 hour, make sure your dataset's size is not greater than 10,000,000 (rows times column) or an error results.
- experiment_exit_score
- float
Target score for experiment. The experiment terminates after this score is reached. If not specified (no criteria), the experiment runs until no further progress is made on the primart metric. For for more information on exit criteria, see this article.
- enable_early_stopping
- bool
Whether to enable early termination if the score is not improving in the short term. The default is False.
Default behavior for stopping criteria:
If iteration and experiment timeout are not specified, then early stopping is turned on and
experiment_timeout = 6 days, num_iterations = 1000.
If experiment timeout is specified, then early_stopping = off, num_iterations = 1000.
Early stopping logic:
No early stopping for first 20 iterations (landmarks).
Early stopping window starts on the 21st iteration and looks for early_stopping_n_iters iterations
(currently set to 10). This means that the first iteration where stopping can occur is the 31st.
AutoML still schedules 2 ensemble iterations AFTER early stopping, which might result in
higher scores.
Early stopping is triggered if the absolute value of best score calculated is the same for past
early_stopping_n_iters iterations, that is, if there is no improvement in score for early_stopping_n_iters iterations.
- blocked_models
- list(str) or list(Classification) for classification task, or list(Regression) for regression task, or list(Forecasting) for forecasting task
A list of algorithms to ignore for an experiment. If enable_tf is False, TensorFlow models
are included in blocked_models.
- blacklist_models
- list(str) or list(Classification) for classification task, or list(Regression) for regression task, or list(Forecasting) for forecasting task
Deprecated parameter, use blocked_models instead.
- exclude_nan_labels
- bool
Whether to exclude rows with NaN values in the label. The default is True.
- verbosity
- int
The verbosity level for writing to the log file. The default is INFO or 20. Acceptable values are defined in the Python logging library.
- enable_tf
- bool
Deprecated parameter to enable/disable Tensorflow algorithms. The default is False.
- model_explainability
- bool
Whether to enable explaining the best AutoML model at the end of all AutoML training iterations. The default is True. For more information, see Interpretability: model explanations in automated machine learning.
- allowed_models
- list(str) or list(Classification) for classification task, or list(Regression) for regression task, or list(Forecasting) for forecasting task
A list of model names to search for an experiment. If not specified, then all models supported
for the task are used minus any specified in blocked_models or deprecated TensorFlow models.
The supported models for each task type are described in the
SupportedModels class.
- whitelist_models
- list(str) or list(Classification) for classification task, or list(Regression) for regression task, or list(Forecasting) for forecasting task
Deprecated parameter, use allowed_models instead.
- enable_onnx_compatible_models
- bool
Whether to enable or disable enforcing the ONNX-compatible models. The default is False. For more information about Open Neural Network Exchange (ONNX) and Azure Machine Learning, see this article.
- time_column_name
- str
The name of the time column. This parameter is required when forecasting to specify the datetime column in the input data used for building the time series and inferring its frequency.
- max_horizon
- int
The desired maximum forecast horizon in units of time-series frequency. The default value is 1.
Units are based on the time interval of your training data, e.g., monthly, weekly that the forecaster should predict out. When task type is forecasting, this parameter is required. For more information on setting forecasting parameters, see Auto-train a time-series forecast model.
The names of columns used to group a timeseries. It can be used to create multiple series. If grain is not defined, the data set is assumed to be one time-series. This parameter is used with task type forecasting.
The names of columns to drop for forecasting tasks. To customize drop columns for classification
and regression tasks, use the featurization parameter.
The number of past periods to lag from the target column. The default is 1.
When forecasting, this parameter represents the number of rows to lag the target values based on the frequency of the data. This is represented as a list or single integer. Lag should be used when the relationship between the independent variables and dependant variable do not match up or correlate by default. For example, when trying to forecast demand for a product, the demand in any month may depend on the price of specific commodities 3 months prior. In this example, you may want to lag the target (demand) negatively by 3 months so that the model is training on the correct relationship. For more information, see Auto-train a time-series forecast model.
- feature_lags
- str
Flag for generating lags for the numeric features
- target_rolling_window_size
- int
The number of past periods used to create a rolling window average of the target column.
When forecasting, this parameter represents n historical periods to use to generate forecasted values, <= training set size. If omitted, n is the full training set size. Specify this parameter when you only want to consider a certain amount of history when training the model.
- country_or_region
- str
The country/region used to generate holiday features. These should be ISO 3166 two-letter country/region code, for example 'US' or 'GB'.
- use_stl
- str
Configure STL Decomposition of the time-series target column. use_stl can take three values: None (default) - no stl decomposition, 'season' - only generate season component and season_trend - generate both season and trend components.
- seasonality
- int
Set time series seasonality. If seasonality is set to -1, it will be inferred. If use_stl is not set, this parameter will not be used.
- freq
- str
The time series data set frequency.
When forecasting this parameter represents the period with which the events are supposed to happen, for example daily, weekly, yearly, etc. The frequency needs to be a pandas offset alias. Please refer to pandas documentation for more information: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects
- enable_voting_ensemble
- bool
Whether to enable/disable VotingEnsemble iteration. The default is True. For more information about ensembles, see Ensemble configuration.
- enable_stack_ensemble
- Optional[bool]
Whether to enable/disable StackEnsemble iteration. The default is None. If enable_onnx_compatible_models flag is being set, then StackEnsemble iteration will be disabled. Similarly, for Timeseries tasks, StackEnsemble iteration will be disabled by default, to avoid risks of overfitting due to small training set used in fitting the meta learner. For more information about ensembles, see Ensemble configuration.
- debug_log
- str
The log file to write debug information to. If not specified, 'automl.log' is used.
- training_data
- DataFrame or Dataset or DatasetDefinition or TabularDataset
The training data to be used within the experiment.
It should contain both training features and a label column (optionally a sample weights column).
If training_data is specified, then the label_column_name parameter must also be specified.
training_data was introduced in version 1.0.81.
- validation_data
- DataFrame or Dataset or DatasetDefinition or TabularDataset
The validation data to be used within the experiment.
It should contain both training features and label column (optionally a sample weights column).
If validation_data is specified, then training_data and label_column_name parameters must
be specified.
validation_data was introduced in version 1.0.81. For more information, see
Configure data splits and cross-validation in automated machine learning.
The name of the label column. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers.
This parameter is applicable to training_data and validation_data parameters.
label_column_name was introduced in version 1.0.81.
The name of the sample weight column. Automated ML supports a weighted column as an input, causing rows in the data to be weighted up or down. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers.
This parameter is applicable to training_data and validation_data parameters.
weight_column_names was introduced in version 1.0.81.
List of names of the columns that contain custom cross validation split. Each of the CV split columns represents one CV split where each row are either marked 1 for training or 0 for validation.
This parameter is applicable to training_data parameter for custom cross validation purposes.
cv_split_column_names was introduced in version 1.6.0
Use either cv_split_column_names or cv_splits_indices.
For more information, see Configure data splits and cross-validation in automated machine learning.
- enable_local_managed
- bool
flag whether to allow local managed runs
- enable_dnn
- bool
Whether to include DNN based models during model selection. The default is False.
Remarks
The following code shows a basic example of creating an AutoMLConfig object and submitting an experiment for regression:
automl_settings = {
"n_cross_validations": 3,
"primary_metric": 'r2_score',
"enable_early_stopping": True,
"experiment_timeout_hours": 1.0,
"max_concurrent_iterations": 4,
"max_cores_per_iteration": -1,
"verbosity": logging.INFO,
}
automl_config = AutoMLConfig(task = 'regression',
compute_target = compute_target,
training_data = train_data,
label_column_name = label,
**automl_settings
)
ws = Workspace.from_config()
experiment = Experiment(ws, "your-experiment-name")
run = experiment.submit(automl_config, show_output=True)
A full sample is available at https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning /regression/auto-ml-regression.ipynb
Examples of using AutoMLConfig for forecasting are in these notebooks:
Examples of using AutoMLConfig for all task types can be found in these automated ML notebooks.
For background on automated ML, see the articles:
Configure automated ML experiments in Python. In this article, there is information about the different algorithms and primary metrics used for each task type.
Auto-train a time-series forecast model. In this article, there is information about which constructor parameters and
**kwargsare used in forecasting.
For more information about different options for configuring training/validation data splits and cross-validation for your automated machine learning, AutoML, experiments, see Configure data splits and cross-validation in automated machine learning.
Methods
| get_supported_dataset_languages(use_gpu: bool) -> typing.Dict[typing.Any, typing.Any] |
Get supported languages and their corresponding language codes in ISO 639-3. |
get_supported_dataset_languages(use_gpu: bool) -> typing.Dict[typing.Any, typing.Any]
Get supported languages and their corresponding language codes in ISO 639-3.
get_supported_dataset_languages(use_gpu: bool) -> typing.Dict[typing.Any, typing.Any]
Parameters
- use_gpu
boolean indicating whether gpu compute is being used or not.
Returns
dictionary of format {