Analyze data with Azure Machine Learning
This tutorial uses Azure Machine Learning to build a predictive machine learning model based on data stored in Azure SQL Data Warehouse. Specifically, this builds a targeted marketing campaign for Adventure Works, the bike shop, by predicting if a customer is likely to buy a bike or not.
To step through this tutorial, you need:
- A SQL Data Warehouse pre-loaded with AdventureWorksDW sample data. To provision this, see Create a SQL Data Warehouse and choose to load the sample data. If you already have a data warehouse but do not have sample data, you can load sample data manually.
1. Get the data
The data is in the dbo.vTargetMail view in the AdventureWorksDW database. To read this data:
- Sign into Azure Machine Learning studio and click on my experiments.
- Click +NEW on the bottom left of the screen and select Blank Experiment.
- Enter a name for your experiment: Targeted Marketing.
- Drag the Import data module under Data Input and output from the modules pane into the canvas.
- Specify the details of your SQL Data Warehouse database in the Properties pane.
- Specify the database query to read the data of interest.
SELECT [CustomerKey] ,[GeographyKey] ,[CustomerAlternateKey] ,[MaritalStatus] ,[Gender] ,cast ([YearlyIncome] as int) as SalaryYear ,[TotalChildren] ,[NumberChildrenAtHome] ,[EnglishEducation] ,[EnglishOccupation] ,[HouseOwnerFlag] ,[NumberCarsOwned] ,[CommuteDistance] ,[Region] ,[Age] ,[BikeBuyer] FROM [dbo].[vTargetMail]
Run the experiment by clicking Run under the experiment canvas.
After the experiment finishes running successfully, click the output port at the bottom of the Reader module and select Visualize to see the imported data.
2. Clean the data
To clean the data, drop some columns that are not relevant for the model. To do this:
- Drag the Select Columns in Dataset module under Data Transformation < Manipulation into the canvas. Connect this module to the Import Data module.
- Click Launch column selector in the Properties pane to specify which columns you wish to drop.
- Exclude two columns: CustomerAlternateKey and GeographyKey.
3. Build the model
We will split the data 80-20: 80% to train a machine learning model and 20% to test the model. We will make use of the “Two-Class” algorithms for this binary classification problem.
- Drag the Split module into the canvas.
- In the properties pane, enter 0.8 for Fraction of rows in the first output dataset.
- Drag the Two-Class Boosted Decision Tree module into the canvas.
- Drag the Train Model module into the canvas and specify inputs by connecting it to the Two-Class Boosted Decision Tree (ML algorithm) and Split (data to train the algorithm on) modules.
- Then, click Launch column selector in the Properties pane. Select the BikeBuyer column as the column to predict.
4. Score the model
Now, we will test how the model performs on test data. We will compare the algorithm of our choice with a different algorithm to see which performs better.
- Drag Score Model module into the canvas and connect it to Train Model and Split Data modules.
- Drag the Two-Class Bayes Point Machine into the experiment canvas. We will compare how this algorithm performs in comparison to the Two-Class Boosted Decision Tree.
- Copy and Paste the modules Train Model and Score Model in the canvas.
- Drag the Evaluate Model module into the canvas to compare the two algorithms.
- Run the experiment.
- Click the output port at the bottom of the Evaluate Model module and click Visualize.
The metrics provided are the ROC curve, precision-recall diagram and lift curve. Looking at these metrics, we can see that the first model performed better than the second one. To look at the what the first model predicted, click on output port of the Score Model and click Visualize.
You will see two more columns added to your test dataset.
- Scored Probabilities: the likelihood that a customer is a bike buyer.
- Scored Labels: the classification done by the model – bike buyer (1) or not (0). This probability threshold for labeling is set to 50% and can be adjusted.
Comparing the column BikeBuyer (actual) with the Scored Labels (prediction), you can see how well the model has performed. As next steps, you can use this model to make predictions for new customers and publish this model as a web service or write results back to SQL Data Warehouse.
To learn more about building predictive machine learning models, refer to Introduction to Machine Learning on Azure.