Connect(); Special Issue 2018
Volume 33 Number 13
Accelerate AI Solutions with Automated Machine Learning
By Krishna Anumalasetty; Special Issue 2018
Machine learning (ML) is being used in a wide range of applications, from autonomous cars and credit card fraud detection to predictive maintenance in manufacturing and beyond.
But there’s a problem. Building ML solutions is complex and requires highly skilled personnel with Ph.D.s in mathematics or other quantitative fields. The demand for data scientists has outpaced supply, inhibiting adoption of ML among enterprises. Many companies have vast stores of data, yet they’re unable to employ predictive analytics to improve business decision making and achieve success.
The automated ML capability in Azure Machine Learning is designed to overcome these obstacles and make AI more accessible to every developer and every organization. In this article, I’ll show how automated ML can be used to quickly build an energy demand forecasting solution.
The Machine Learning Lifecycle
Look at the data science lifecycle process shown in Figure 1. It breaks the lifecycle into four stages: Business Understanding, Data Acquisition & Understanding, Modeling, Deployment. Every ML solution should start with the business problem you’re working to solve—that is, the Business Understanding stage. Next, you acquire and explore the data—the Data Acquisition & Understanding step—to bring in the raw data that may be in various data sources in different formats. Ingesting, exploring, and de-duping are some of the activities in Data Acquisition & Understanding.
Figure 1 Data Science Lifecycle Process
The Modeling stage has three steps, starting with the feature-engineering process. Here you may transform the data into new features and generate entirely new features. Features are the predictors that influence prediction. An easy way to understand features is to take the example of estimating a house price, which is influenced by the size and location of the house, the number of bedrooms, and other factors. So, if you’re generating an ML model to predict a house value, these factors are the features.
Next comes the model training process (in the Modeling stage) where you build ML models by applying many different algorithms and hyperparameters to the dataset. This process involves evaluating the different models you’ve built to choose the one that works best for the current application. From there, you test and finally deploy the model into production, where it’s used to run predictions from the new data.
Once in the Deployment stage, you must monitor the model for drifts, retraining it periodically when drifts reach a certain threshold or when new data makes clear that the model isn’t performing to expectation.
Now, let’s see how automated ML helps build an ML model. The first step is to install the set of Azure Machine Learning Python libraries. Automated ML libraries are included and part of Azure ML libraries and will be used in your ML scripts.
As a developer, if you’re familiar with languages such as C# or Java, you should be able to get up to speed in Python in no time. You can install the Azure ML SDK, which includes automated ML libraries on your computer, however, for simplicity we’ll use Azure Notebooks, which is an Azure service to host Jupyter Notebooks. Jupyter Notebook is an interactive notebook environment popular with data scientists for ML solution development. You can learn more about using Jupyter Notebooks from Frank La Vigne’s Artificially Intelligent column in the February 2018 issue of MSDN Magazine (msdn.com/magazine/mt829269).
Automated ML libraries are free to use and come pre-installed on the Azure Notebooks service, eliminating the need for a setup process. Sign in to Azure Notebooks using your Microsoft Account. If you already have an Azure Subscription, you can use the Azure Active Directory (Azure AD) account (which may be the same as your corporate credentials if your organization federated with Azure AD).
Use your Azure subscription ID to setup an ML workspace that you can share and collaborate with your teammates either through the Azure Portal, or through the API, like so:
ws = Workspace.create(name = "energy_ml_ws", subscription_id = "<azure_subscription_id>", resource_group = "energy_ml_rg", location = "eastus", create_resource_group = True, exist_ok = True)
Energy Demand Forecasting
With workspace creation complete, let’s work on the business problem—an energy demand forecasting example that features time series data. Time series forecasting is the task of predicting future values in a time-ordered sequence of observations. It’s a common problem and has applications in many industries. For example, retail companies need to forecast future product sales so they can organize their supply chains to meet demand. Similarly, package delivery companies need to estimate the demand for their services so they can plan workforce requirements and delivery routes ahead of time. In many cases, the financial impact due to inaccurate forecasts can be significant, making forecasting a business-critical activity.
This is certainly true for energy utilities, which must maintain a fine balance between the energy consumed on the grid and the energy supplied to it. Purchase too much power and the operator must store the excess energy, which is expensive. Purchase too little and it can lead to blackouts, leaving customers in the dark. Grid operators can make short-term decisions to manage energy supply to the grid and keep the load in balance, but an accurate forecast of energy demand is essential for operators to make these decisions with confidence.
Let’s walk through how you can use ML to build an energy demand forecasting solution. There’s a public dataset available from the New York Independent System Operator (NYISO), which operates the power grid for New York State. The dataset has hourly power demand data for New York City over a period of five years.
The NYISO dataset has a timestamp and energy demand in MWh at hourly increments from 2012 to 2017. To inspect and explore the data, let’s first read the data that’s in Azure Blob Storage into a Pandas dataframe. Pandas dataframes are popular with the ML community and offer an easy way to manipulate data. Here’s the code for the import:
import pandas as pd demand = pd.read_csv( "https://antaignitedata.blob.core.windows.net/antaignitedata/nyc_demand.csv", parse_dates=['timeStamp'])
Let’s print the first few rows of the data by invoking the head method on the dataframe, which prints the top few rows and produces a simple table with timestamps and energy demand values. If I plot the chart of energy demand over a week in July 2017, it results in a line chart that shows fluctuating hourly values over the course of seven days. For this I use matplotlib, a library that helps easily plot charts within Jupyter Notebook. Here’s the code:
plt_df = demand.loc[(demand.timeStamp>'2016-07-01') & (demand.timeStamp<='2016-07-07')] plt.plot(plt_df['timeStamp'], plt_df['demand']) plt.title('New York City power demand over one week in July 2017') plt.xticks(rotation=45) plt.show()
Of course, weather has a direct impact on energy consumption. On a hot day, use of air conditioning will significantly increase demand for electricity. So an additional dataset containing hourly weather conditions in New York City over the same time period can be used to improve prediction. I can access weather data from darksky.net. Now, I’ll read the weather dataset into a dataframe and augment the NYISO dataset with weather data by merging the two datasets, using this code:
weather = pd.read_csv( "https://antaignitedata.blob.core.windows.net/antaignitedata/nyc_weather.csv", parse_dates=['timeStamp']) df = pd.merge(demand, weather, on=['timeStamp'], how='outer')
The resulting merged dataset will look something similar to the table in Figure 2, effectively adding weather information—precipitation and temperatures—to the NYISO dataset. To see the first few rows of the data, invoke the head method on the dataframe.
Figure 2 Energy Demand Dataset Merged with Weather Dataset
Now we come to the modeling step, which involves three disciplines: feature engineering, model training and model evaluation. This is where data scientists spend most of their time, and it’s where automated ML can really help simplify things.
The dataset I’m using is time series data, which requires me to generate features. Typical features for time series data are date time features, lag features and window features. To keep things simple, I won’t go into lag features or window features. Automated ML automatically generates date time features, eliminating the need to manually generate these. It also performs many data preprocessing tasks. For example, automated ML can impute missing values in the dataset. If I inspect the dataset further I’ll find that there are a few missing rows in the dataset and a few missing values, as well. In the following code, you can see that row 49175 contains a missing value, indicated by “nan,” which stands for “not a number”:
Output of the line is as follows:
array([Timestamp('2017-08-11 01:00:00'), nan, 0.0, 69.26], dtype=object)
Next, I need to split the dataset into a training dataset and a test dataset. The training dataset is used to build the model. Once trained, I need to test the model using the test dataset to evaluate the performance and to ensure the model is satisfactory before deployment. The splitting can be random, but in the case of energy consumption data where there’s seasonality, I would want to split the dataset strategically. A small portion of training dataset is set aside as a validation dataset. Training dataset is used to train the model. Validation dataset is used to generalize the model so that the model isn’t overfitted with the training data. This helps to make sure the model also performs well on new data and not just on the training data.
Now, the column that I’m working to predict is demand. I move the demand values into its own vector “y,” which is called the label column. I’m setting aside as my test dataset any data that’s newer than July 1, 2017. All data that’s older than July 1, 2017, is the training data. The code here shows this:
train, test = (df.loc[df['timeStamp']<'2016-07-01'], df.loc[df['timeStamp']>='2016-07-01'])
The data that was read into the dataframe has the features and label (the column I’m trying to predict). I need to move the label column into its own vector y, like so:
X = train.drop(['demand'], axis=1) y = train['demand'] y = y.values
Set aside a validation dataset within the train dataset. With timeseries data, the validation dataset is typically for a specific period. Here’s how the split is done:
split_index = int(X.shape * 0.9) X_train = X[0:split_index] y_train = y[0:split_index] X_validation = X[split_index+1:] y_validation = y[split_index+1:]
The feature engineering part of the Modeling has been completed.
The next choice a user faces is deciding which algorithm works for the dataset. Complexity can be a problem here as there are many algorithms to choose from, including support vector machine (SVM), lasso regression, ridge regression and more. What’s more, each algorithm has hyperparameters that must be tuned. Hyperparameters for each of the algorithms can be an infinite set, which means the combination of algorithms and hyperparameters is itself infinite. So, theoretically you would need to build an infinite number of models to find the best one.
Users combine features, learners (algorithms) and hyperparameters to build multiple models for a given business problem, so they can ultimately find one that yields optimal accuracy. Building many models and evaluating them is a manual, time-consuming and tedious task that can take weeks or even months. On top of that you need to maintain the solutions and the models that are deployed. As data evolves, you need to periodically perform the model building process again. Building a single model involves the tasks of feature engineering, algorithm selection and hyperparameter tuning. To get a good performing model, you need to build and evaluate several models and perform these tedious tasks repeatedly.
For those that are experienced in data science, automated ML simplifies and eliminates manual processing by performing feature engineering, algorithm and hyperparameter selection, and tuning for you to improve productivity and save time. For new data scientists, the abstraction of algorithm selection and hyperparameter tuning simplifies the complexity and helps you build ML solutions quickly.
As a user I would like to save all the ML models I’ve created, as well as the history of all the training jobs, so I have a record of how I got here and can refer back to it when needed. To do this, let’s create a project folder where all the artifacts get stored, and an experiment object to associate all the run history of automated ML training jobs. First, I create a project folder to store all the project-related files, like so:
project_folder = './sample_projects/automl-energydemandforecasting'
Then, I’ll choose a name for the run history container in the workspace, with this code:
experiment_name = 'energy_ml_exp'
Next, I’ll create an experiment object and associate it with the workspace. Here’s the code:
From there, automated ML needs only two steps to build the models before I configure the AutoMLConfig object and run the experiment. Figure 3shows the conceptual representation of automated ML.
Now, let’s configure the automated ML settings and submit an automated ML experiment. Here’s the code:
Figure 3 Conceptual Automated Machine Learning Diagram
automl_config_local = AutoMLConfig(task = 'regression', debug_log = 'automl_errors.log', primary_metric = 'spearman_correlation', iterations = 150, X = X_train, y = y_train, X_valid = X_validation, y_valid = y_validation, preprocess = True, path=project_folder)
Notice in the config settings that I set iterations = 150. Because I set the number of iterations to 150, automated ML will generate 150 different combinations of algorithms and hyperparameters, which culminate into 150 different models. Automated ML itself uses an ML model that was trained with millions of pipelines. Using Matrix factorization techniques, automated ML generates the algorithms and hyperparameter combinations in an intelligent way, leading to much faster convergence on an optimal model. Also, setting preprocess = True triggers automated ML to perform a data preprocessing and feature engineering tasks. This includes generating the date time features, imputing missing values, converting categorical values into one hot encoding, and the like—all of which eliminate the need for a data scientist to do these manually.
Now you can complete the second step, which is submitting the automated ML experiment, like so:
local_run = experiment.submit(automl_config_local, show_output=True)
At this point automated ML generates a set of algorithm and hyperparameter combinations. Training of the models is done on Azure Virtual Machine managed by Azure Notebooks service. Automated ML gives you the choice of running the model training jobs on your local computer, or in Azure Cloud to scale up and scale out as needed for additional performance. You can run these iterations in parallel on a cluster. You can monitor the progress of the runs in Azure Portal, or in the Jupyter Notebook through the widget extension that comes with the SDK. Automated ML evaluates the generated models based on your criteria, and can render a leaderboard that shows the models in order of performance.
You can also review a variety of charts for each of the models and inspect different metrics to help make decisions. In cases where accuracy is extremely important, you can further tune the generated model manually to improve its performance.
All the models generated by automated ML are stored in durable storage in Azure, as a serialized python object in a format called PKL.
Model Evaluation and Testing
Now that automated ML has generated a high-quality model, let’s use it to run predictions on the test data that was set aside. First, I need to select the best model from all the models that were generated, using the following code:
best_run, fitted_model = local_run.get_output() print(best_run) print(fitted_model)
That yields the following code, describing the best model generated by automated ML:
Run(Experiment: enery_ml_exp, Id: AutoML_dc619226-fbdb-41e6-872c-6c59ab2b5209_1, Type: None, Status: Completed) Pipeline(memory=None, steps=[('datatransformer', DataTransformer(logger=None, task=None)), ('sparsenormalizer', <automl.client.core.common.model_wrappers.SparseNormalizer object at 0x7f7f87ee3828>), ('decisiontreeregressor', DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=0.2, max_leaf_node... min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best'))])
Notice that though I’ve been calling these models, they are in fact pipelines that include all the preprocessing transformation through which the data must go. Now let’s test the model by running predictions on the test data to produce a charted comparison of actual data against predicted data. Use the following code:
x_test = test.drop(['demand'], axis=1) y_test = test['demand'] y_pred_test = fitted_model.predict(x_test) y_residual_test = y_test - y_pred_test plt.plot(x_test['timeStamp'], y_test, label='Actual') plt.plot(x_test['timeStamp'], y_pred_test, label="Predicted") plt.xticks(rotation=90) plt.title('Actual demand vs predicted for test data ') plt.legend() plt.show()
With a little more feature engineering and the use of lag features, this model can be improved even further. Moving forward, we can expect automated ML to evolve to automate more and more areas of feature engineering, including lag feature generation.
Automated ML lets you build an ML model with just a few simple steps. We’re in the early stages of making ML model building accessible to developers to truly democratize AI. You can find many more samples and tutorials of using automated ML at aka.ms/AutomatedMLDocs.
Krishna Anumalasetty has been working as a program manager in Azure and cloud services for the last seven years, with four of those in ML and AI. Anumalasetty is the program manager that incubated automated ML as part of the AI stack in Azure. He has worked on enabling enterprise excellence capabilities, such as encryption @REST data protection, VNET capabilities, hybrid scenarios with on-premises connectivity, Azure Active Directory integration and more. He is currently working on enriching the automated ML capabilities with deep learning.
Thanks to the following Microsoft technical experts for reviewing this article: Sujatha Sagiraju, Bharat Sandhu
Bharat Sandhu is the Director of product marketing for Azure AI & Advanced Analytics. He is focused on enabling organizations to unlock the potential of AI to transform their businesses. Prior to his current role, he has held roles in enterprise sales and business development for various emerging technologies at Microsoft and National Instruments. Bharat holds a B.S. in Computer Engineering from Texas A&M University and MBA from INSEAD.
Sujatha Sagiraju is a Group Program Manager in the Azure Cloud & AI group and is a Microsoft veteran. Her expertise is in building large scale distributed systems. Her latest passion is accelerating and democratizing AI via Automated Machine Learning.