Using linear regression in Azure Machine Learning
Kate Baroni and Ben Boatman are enterprise solution architects in Microsoft’s Data Insights Center of Excellence. In this article, they describe their experience migrating an existing regression analysis suite to a cloud-based solution using Azure Machine Learning.
Try Azure Machine Learning for free
No credit card or Azure subscription required. Get started now.
Our project started with two goals in mind:
- Use predictive analytics to improve the accuracy of our organization’s monthly revenue projections
- Use Azure Machine Learning to confirm, optimize, increase velocity, and scale of our results.
Like many businesses, our organization goes through a monthly revenue forecasting process. Our small team of business analysts was tasked with using Azure Machine Learning to support the process and improve forecast accuracy. The team spent several months collecting data from multiple sources and running the data attributes through statistical analysis identifying key attributes relevant to services sales forecasting. The next step was to begin prototyping statistical regression models on the data in Excel. Within a few weeks, we had an Excel regression model that was outperforming the current field and finance forecasting processes. This became the baseline prediction result.
We then took the next step to moving our predictive analytics over to Azure Machine Learning to find out how Machine Learning could improve on predictive performance.
Achieving predictive performance parity
Our first priority was to achieve parity between Machine Learning and Excel regression models. Given the same data, and the same split for training and testing data, we wanted to achieve predictive performance parity between Excel and Machine Learning. Initially we failed. The Excel model outperformed the Machine Learning model. The failure was due to a lack of understanding of the base tool setting in Machine Learning. After a sync with the Machine Learning product team, we gained a better understanding of the base setting required for our data sets, and achieved parity between the two models.
Create regression model in Excel
Our Excel Regression used the standard linear regression model found in the Excel Analysis ToolPak.
We calculated Mean Absolute % Error and used it as the performance measure for the model. It took 3 months to arrive at a working model using Excel. We brought much of the learning into the Machine Learning Studio experiment which ultimately was beneficial in understanding requirements.
Create comparable experiment in Azure Machine Learning
We followed these steps to create our experiment in Machine Learning Studio:
- Uploaded the dataset as a csv file to Machine Learning Studio (very small file)
- Created a new experiment and used the Select Columns in Dataset module to select the same data features used in Excel
- Used the Split Data module (with Relative Expression mode) to divide the data into the same training datasets as had been done in Excel
- Experimented with the Linear Regression module (default options only), documented, and compared the results to our Excel regression model
Review initial results
At first, the Excel model clearly outperformed the Machine Learning Studio model:
|Mean Absolute Error||$9.5M||$ 19.4M|
|Mean Absolute Error (%)||6.03%||12.2%|
When we ran our process and results by the developers and data scientists on the Machine Learning team, they quickly provided some useful tips.
- When you use the Linear Regression module in Machine Learning Studio, two methods are provided:
- Online Gradient Descent: May be more suitable for larger-scale problems
- Ordinary Least Squares: This is the method most people think of when they hear linear regression. For small datasets, Ordinary Least Squares can be a more optimal choice.
- Consider tweaking the L2 Regularization Weight parameter to improve performance. It is set to 0.001 by default, but for our small data set we set it to 0.005 to improve performance.
When we applied the recommendations, we achieved the same baseline performance in Machine Learning Studio as with Excel:
|Excel||Studio (Initial)||Studio w/ Least Squares|
|Labeled value||Actuals (numeric)||same||same|
|Learner||Excel -> Data Analysis -> Regression||Linear Regression.||Linear Regression|
|Learner options||N/A||Defaults||ordinary least squares
L2 = 0.005
|Data Set||26 rows, 3 features, 1 label. All numeric.||same||same|
|Split: Train||Excel trained on the first 18 rows, tested on the last 8 rows.||same||same|
|Split: Test||Excel regression formula applied to the last 8 rows||same||same|
|Adjusted R Square||0.96||N/A|
|Coefficient of Determination||N/A||0.78||0.952049|
|Mean Absolute Error||$9.5M||$ 19.4M||$9.5M|
|Mean Absolute Error (%)||6.03%||12.2%||6.03%|
In addition, the Excel coefficients compared well to the feature weights in the Azure trained model:
|Excel Coefficients||Azure Feature Weights|
We wanted to consume the Machine Learning web service within Excel. Our business analysts rely on Excel and we needed a way to call the Machine Learning web service with a row of Excel data and have it return the predicted value to Excel.
We also wanted to optimize our model, using the options and algorithms available in Machine Learning Studio.
Integration with Excel
Our solution was to operationalize our Machine Learning regression model by creating a web service from the trained model. Within a few minutes, the web service was created and we could call it directly from Excel to return a predicted revenue value.
The Web Services Dashboard section includes a downloadable Excel workbook. The workbook comes pre-formatted with the web service API and schema information embedded. When you click Download Excel Workbook, the workbook opens and you can save it to your local computer.
With the workbook open, copy your predefined parameters into the blue Parameter section as shown below. Once the parameters are entered, Excel calls out to the Machine Learning web service and the predicted scored labels will display in the green Predicted Values section. The workbook will continue to create predictions for parameters based on your trained model for all row items entered under Parameters. For more information on how to use this feature, see Consuming an Azure Machine Learning Web Service from Excel.
Optimization and further experiments
Now that we had a baseline with our Excel model, we moved ahead to optimize our Machine Learning Linear Regression Model. We used the module Filter-Based Feature Selection to improve on our selection of initial data elements and it helped us achieve a performance improvement of 4.6% Mean Absolute Error. For future projects we will use this feature which could save us weeks in iterating through data attributes to find the right set of features to use for modelling.
If you want to experiment with regression, a good dataset to try is the Energy Efficiency Regression sample dataset, which has lots of numerical attributes. The dataset is provided as part of the sample datasets in Machine Learning Studio. You can use a variety of learning modules to predict either Heating Load or Cooling Load. The chart below is a performance comparison of different regression learns against the Energy Efficiency dataset predicting for the target variable Cooling Load:
|Model||Mean Absolute Error||Root Mean Squared Error||Relative Absolute Error||Relative Squared Error||Coefficient of Determination|
|Boosted Decision Tree||0.930113||1.4239||0.106647||0.021662||0.978338|
|Linear Regression (Gradient Descent)||2.035693||2.98006||0.233414||0.094881||0.905119|
|Neural Network Regression||1.548195||2.114617||0.177517||0.047774||0.952226|
|Linear Regression (Ordinary Least Squares)||1.428273||1.984461||0.163767||0.042074||0.957926|
We learned a lot by from running Excel regression and Azure Machine Learning experiments in parallel. Creating the baseline model in Excel and comparing it to models using Machine Learning Linear Regression helped us learn Azure Machine Learning, and we discovered opportunities to improve data selection and model performance.
We also found that it is advisable to use Filter-Based Feature Selection to accelerate future prediction projects. By applying feature selection to your data, you can create an improved model in Machine Learning with better overall performance.
The ability to transfer the predictive analytic forecasting from Machine Learning to Excel systemically allows a significant increase in the ability to successfully provide results to a broad business user audience.
Here are some resources for helping you work with regression:
- Regression in Excel. If you’ve never tried regression in Excel, this tutorial makes it easy: http://www.excel-easy.com/examples/regression.html
- Regression vs forecasting. Tyler Chessman wrote a blog article explaining how to do time series forecasting in Excel, which contains a good beginner’s description of linear regression. http://sqlmag.com/sql-server-analysis-services/understanding-time-series-forecasting-concepts
- Ordinary Least Squares Linear Regression: Flaws, Problems and Pitfalls. For an introduction and discussion of Regression: http://www.clockbackward.com/2009/06/18/ordinary-least-squares-linear-regression-flaws-problems-and-pitfalls/