What is automated machine learning (AutoML)?

Automated machine learning, also referred to as automated ML or AutoML, is the process of automating the time consuming, iterative tasks of machine learning model development. It allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity all while sustaining model quality. Automated ML is based on a breakthrough from our Microsoft Research division.

Traditional machine learning model development is resource-intensive, requiring significant domain knowledge and time to produce and compare dozens of models. With automated machine learning, you'll accelerate the time it takes to get production-ready ML models with great ease and efficiency.

When to use AutoML: classify, regression, & forecast

Apply automated ML when you want Azure Machine Learning to train and tune a model for you using the target metric you specify. Automated ML democratizes the machine learning model development process, and empowers its users, no matter their data science expertise, to identify an end-to-end machine learning pipeline for any problem.

Data scientists, analysts, and developers across industries can use automated ML to:

  • Implement ML solutions without extensive programming knowledge
  • Save time and resources
  • Leverage data science best practices
  • Provide agile problem-solving

Classification

Classification is a common machine learning task. Classification is a type of supervised learning in which models learn using training data, and apply those learnings to new data. Azure Machine Learning offers featurizations specifically for these tasks, such as deep neural network text featurizers for classification. Learn more about featurization options.

The main goal of classification models is to predict which categories new data will fall into based on learnings from its training data. Common classification examples include fraud detection, handwriting recognition, and object detection. Learn more and see an example at Create a classification model with automated ML.

See examples of classification and automated machine learning in these Python notebooks: Fraud Detection, Marketing Prediction, and Newsgroup Data Classification

Regression

Similar to classification, regression tasks are also a common supervised learning task. Azure Machine Learning offers featurizations specifically for these tasks.

Different from classification where predicted output values are categorical, regression models predict numerical output values based on independent predictors. In regression, the objective is to help establish the relationship among those independent predictor variables by estimating how one variable impacts the others. For example, automobile price based on features like, gas mileage, safety rating, etc. Learn more and see an example of regression with automated machine learning.

See examples of regression and automated machine learning for predictions in these Python notebooks: CPU Performance Prediction,

Time-series forecasting

Building forecasts is an integral part of any business, whether it's revenue, inventory, sales, or customer demand. You can use automated ML to combine techniques and approaches and get a recommended, high-quality time-series forecast. Learn more with this how-to: automated machine learning for time series forecasting.

An automated time-series experiment is treated as a multivariate regression problem. Past time-series values are "pivoted" to become additional dimensions for the regressor together with other predictors. This approach, unlike classical time series methods, has an advantage of naturally incorporating multiple contextual variables and their relationship to one another during training. Automated ML learns a single, but often internally branched model for all items in the dataset and prediction horizons. More data is thus available to estimate model parameters and generalization to unseen series becomes possible.

Advanced forecasting configuration includes:

  • holiday detection and featurization
  • time-series and DNN learners (Auto-ARIMA, Prophet, ForecastTCN)
  • many models support through grouping
  • rolling-origin cross validation
  • configurable lags
  • rolling window aggregate features

See examples of regression and automated machine learning for predictions in these Python notebooks: Sales Forecasting, Demand Forecasting, and Beverage Production Forecast.

How AutoML works

During training, Azure Machine Learning creates a number of pipelines in parallel that try different algorithms and parameters for you. The service iterates through ML algorithms paired with feature selections, where each iteration produces a model with a training score. The higher the score, the better the model is considered to "fit" your data. It will stop once it hits the exit criteria defined in the experiment.

Using Azure Machine Learning, you can design and run your automated ML training experiments with these steps:

  1. Identify the ML problem to be solved: classification, forecasting, or regression

  2. Choose whether you want to use the Python SDK or the studio web experience: Learn about the parity between the Python SDK and studio web experience.

    Important

    The functionality in this studio, https://ml.azure.com, is accessible from Enterprise workspaces only. Learn more about editions and upgrading.

  3. Specify the source and format of the labeled training data: Numpy arrays or Pandas dataframe

  4. Configure the compute target for model training, such as your local computer, Azure Machine Learning Computes, remote VMs, or Azure Databricks. Learn about automated training on a remote resource.

  5. Configure the automated machine learning parameters that determine how many iterations over different models, hyperparameter settings, advanced preprocessing/featurization, and what metrics to look at when determining the best model.

  6. Submit the training run.

  7. Review the results

The following diagram illustrates this process. Automated Machine learning

You can also inspect the logged run information, which contains metrics gathered during the run. The training run produces a Python serialized object (.pkl file) that contains the model and data preprocessing.

While model building is automated, you can also learn how important or relevant features are to the generated models.

Learn how to use a remote compute target.

Feature engineering

Feature engineering is the process of using domain knowledge of the data to create features that help ML algorithms learn better. In Azure Machine Learning, scaling and normalization techniques are applied to facilitate feature engineering. Collectively, these techniques and feature engineering are referred to as featurization.

For automated machine learning experiments, featurization is applied automatically, but can also be customized based on your data. Learn more about what featurization is included.

Note

Automated machine learning featurization steps (feature normalization, handling missing data, converting text to numeric, etc.) become part of the underlying model. When using the model for predictions, the same featurization steps applied during training are applied to your input data automatically.

Automatic featurization (standard)

In every automated machine learning experiment, your data is automatically scaled or normalized to help algorithms perform well. During model training, one of the following scaling or normalization techniques will be applied to each model. Learn how AutoML helps prevent over-fitting and imbalanced data in your models.

Scaling & normalization Description
StandardScaleWrapper Standardize features by removing the mean and scaling to unit variance
MinMaxScalar Transforms features by scaling each feature by that column's minimum and maximum
MaxAbsScaler Scale each feature by its maximum absolute value
RobustScalar This Scaler features by their quantile range
PCA Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space
TruncatedSVDWrapper This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition, which means it can work with scipy.sparse matrices efficiently
SparseNormalizer Each sample (that is, each row of the data matrix) with at least one non-zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one

Customize featurization

Additional feature engineering techniques such as, encoding and transforms are also available.

Enable this setting with:

  • Azure Machine Learning studio: Enable Automatic featurization in the View additional configuration section with these steps.

  • Python SDK: Specify "feauturization": 'auto' / 'off' / 'FeaturizationConfig' in your AutoMLConfig object. Learn more about [enabling featurization]((how-to-configure-auto-features.md).

Ensemble models

Automated machine learning supports ensemble models, which are enabled by default. Ensemble learning improves machine learning results and predictive performance by combining multiple models as opposed to using single models. The ensemble iterations appear as the final iterations of your run. Automated machine learning uses both voting and stacking ensemble methods for combining models:

  • Voting: predicts based on the weighted average of predicted class probabilities (for classification tasks) or predicted regression targets (for regression tasks).
  • Stacking: stacking combines heterogenous models and trains a meta-model based on the output from the individual models. The current default meta-models are LogisticRegression for classification tasks and ElasticNet for regression/forecasting tasks.

The Caruana ensemble selection algorithm with sorted ensemble initialization is used to decide which models to use within the ensemble. At a high level, this algorithm initializes the ensemble with up to five models with the best individual scores, and verifies that these models are within 5% threshold of the best score to avoid a poor initial ensemble. Then for each ensemble iteration, a new model is added to the existing ensemble and the resulting score is calculated. If a new model improved the existing ensemble score, the ensemble is updated to include the new model.

See the how-to for changing default ensemble settings in automated machine learning.

Guidance on local vs. remote managed ML compute targets

The web interface for automated ML always uses a remote compute target. But when you use the Python SDK, you will choose either a local compute or a remote compute target for automated ML training.

  • Local compute: Training occurs on your local laptop or VM compute.
  • Remote compute: Training occurs on Machine Learning compute clusters.

Choose compute target

Consider these factors when choosing your compute target:

  • Choose a local compute: If your scenario is about initial explorations or demos using small data and short trains (i.e. seconds or a couple of minutes per child run), training on your local computer might be a better choice. There is no setup time, the infrastructure resources (your PC or VM) are directly available.
  • Chose a remote ML compute cluster: If you are training with larger datasets like in production training creating models which need longer trains, remote compute will provide much better end-to-end time performance because AutoML will parallelize trains across the cluster's nodes. On a remote compute, the start up time for the internal infrastructure will add around 1.5 minutes per child run, plus additional minutes for the cluster infrastructure if the VMs are not yet up and running.

Pros and cons

Consider these pros and cons when choosing to use local vs. remote.

Pros (Advantages) Cons (Handicaps)
Local compute target
  • No environment start up time
  • Subset of features
  • Can't parallelize runs
  • Worse for large data.
  • No data streaming while training
  • No DNN-based featurization
  • Python SDK only
  • Remote ML compute clusters
  • Full set of features
  • Parallelize child runs
  • Large data support
  • DNN-based featurization
  • Dynamic scalability of compute cluster on demand
  • No-code experience (web UI) also available
  • Start up time for cluster nodes
  • Start up time for each child run
  • Feature availability

    More features are available when you use the remote compute, as shown in the table below. Some of these features are available only in an Enterprise workspace.

    Feature Remote Local Requires
    Enterprise workspace
    Data streaming (Large data support, up to 100 GB)
    DNN-BERT-based text featurization and training
    Out-of-the-box GPU support (training and inference)
    Image Classification and Labeling support
    Auto-ARIMA, Prophet and ForecastTCN models for forecasting
    Multiple runs/iterations in parallel
    Create models with interpretability in AutoML studio web experience UI
    Feature engineering customization in studio web experience UI
    Azure ML hyperparameter tuning
    Azure ML Pipeline workflow support
    Continue a run
    Forecasting
    Create and run experiments in notebooks
    Register and visualize experiment's info and metrics in UI
    Data guardrails

    Many models

    The Many Models Solution Accelerator (preview) builds on Azure Machine Learning and enables you to use automated ML to train, operate, and manage hundreds or even thousands of machine learning models.

    For example, building a model for each instance or individual in the following scenarios can lead to improved results:

    • Predicting sales for each individual store
    • Predictive maintenance for hundreds of oil wells
    • Tailoring an experience for individual users.

    For more information, see the Many Models Solution Accelerator on GitHub.

    AutoML in Azure Machine Learning

    Azure Machine Learning offers two experiences for working with automated ML

    Experiment settings

    The following settings allow you to configure your automated ML experiment.

    The Python SDK The studio web experience
    Split data into train/validation sets
    Supports ML tasks: classification, regression, and forecasting
    Optimizes based on primary metric
    Supports AML compute as compute target
    Configure forecast horizon, target lags & rolling window
    Set exit criteria
    Set concurrent iterations
    Drop columns
    Block algorithms
    Cross validation
    Supports training on Azure Databricks clusters
    View engineered feature names
    Featurization summary
    Featurization for holidays
    Log file verbosity levels

    Model settings

    These settings can be applied to the best model as a result of your automated ML experiment.

    The Python SDK The studio web experience
    Best model registration, deployment, explainability
    Enable voting ensemble & stack ensemble models
    Show best model based on non-primary metric
    Enable/disable ONNX model compatibility
    Test the model

    Run control settings

    These settings allow you to review and control your experiment runs and its child runs.

    The Python SDK The studio web experience
    Run summary table
    Cancel runs & child runs
    Get guardrails
    Pause & resume runs

    AutoML & ONNX

    With Azure Machine Learning, you can use automated ML to build a Python model and have it converted to the ONNX format. Once the models are in the ONNX format, they can be run on a variety of platforms and devices. Learn more about accelerating ML models with ONNX.

    See how to convert to ONNX format in this Jupyter notebook example. Learn which algorithms are supported in ONNX.

    The ONNX runtime also supports C#, so you can use the model built automatically in your C# apps without any need for recoding or any of the network latencies that REST endpoints introduce. Learn more about inferencing ONNX models with the ONNX runtime C# API.

    Next steps

    There are multiple resources to get you up and running with AutoML.

    Tutorials/ how-tos

    Tutorials are end-to-end introductory examples of AutoML scenarios.

    How to articles provide additional detail into what functionality AutoML offers. For example,

    Jupyter notebook samples

    Review detailed code examples and use cases in the Github notebook repository for automated machine learning samples.

    Python SDK reference

    Deepen your expertise of SDK design patterns and class specifications with the AutoML class reference documentation.

    Note

    Automated machine learning capabilities are also available in other Microsoft solutions such as, ML.NET, HDInsight, Power BI and SQL Server