Tutorial: Predict automobile price with the designer (preview)
APPLIES TO: Basic edition Enterprise edition (Upgrade to Enterprise)
In this two-part tutorial, you learn how to use the Azure Machine Learning designer to develop and deploy a predictive analytics solution that predicts the price of any car.
In part one of the tutorial, you'll learn how to:
- Create a new pipeline.
- Import data.
- Prepare data.
- Train a machine learning model.
- Evaluate a machine learning model.
In part two of the tutorial, you'll deploy your model as a real-time inferencing endpoint to predict the price of any car based on technical specifications you send it.
A completed version of this tutorial is available as a sample pipeline.
To find it, go to the designer in your workspace. In the New pipeline section, select Sample 1 - Regression: Automobile Price Prediction(Basic).
Create a new pipeline
Azure Machine Learning pipelines organize multiple machine learning and data processing steps into a single resource. Pipelines let you organize, manage, and reuse complex machine learning workflows across projects and users.
To create an Azure Machine Learning pipeline, you need an Azure Machine Learning workspace. In this section, you learn how to create both these resources.
Create a new workspace
If you have an Azure Machine Learning workspace with an Enterprise edition, skip to the next section.
Sign in to the Azure portal by using the credentials for your Azure subscription.
In the upper-left corner of the Azure portal, select + Create a resource.
Use the search bar to find Machine Learning.
Select Machine Learning.
In the Machine Learning pane, select Create to begin.
Provide the following information to configure your new workspace:
Field Description Workspace name Enter a unique name that identifies your workspace. In this example, we use docs-ws. Names must be unique across the resource group. Use a name that's easy to recall and to differentiate from workspaces created by others. Subscription Select the Azure subscription that you want to use. Resource group Use an existing resource group in your subscription, or enter a name to create a new resource group. A resource group holds related resources for an Azure solution. In this example, we use docs-aml. Location Select the location closest to your users and the data resources to create your workspace. Workspace edition Select Enterprise. This tutorial requires the use of the Enterprise edition. The Enterprise edition is in preview and doesn't currently add any extra costs.
After you're finished configuring the workspace, select Create.
It can take several minutes to create your workspace in the cloud.
When the process is finished, a deployment success message appears.
To view the new workspace, select Go to resource.
Create the pipeline
Sign in to ml.azure.com, and select the workspace you want to work with.
Select Easy-to-use prebuilt modules.
At the top of the canvas, select the default pipeline name Pipeline-Created-on. Rename it to Automobile price prediction. The name doesn't need to be unique.
There are several sample datasets included in the designer for you to experiment with. For this tutorial, use Automobile price data (Raw).
To the left of the pipeline canvas is a palette of datasets and modules. Select Datasets, and then view the Samples section to view the available sample datasets.
Select the dataset Automobile price data (Raw), and drag it onto the canvas.
Visualize the data
You can visualize the data to understand the dataset that you'll use.
Select the Automobile price data (Raw) module.
In the properties pane to the right of the canvas, select Outputs.
Select the graph icon to visualize the data.
Select the different columns in the data window to view information about each one.
Each row represents an automobile, and the variables associated with each automobile appear as columns. There are 205 rows and 26 columns in this dataset.
Datasets typically require some preprocessing before analysis. You might have noticed some missing values when you inspected the dataset. These missing values must be cleaned so that the model can analyze the data correctly.
Remove a column
When you train a model, you have to do something about the data that's missing. In this dataset, the normalized-losses column is missing many values, so you exclude that column from the model altogether.
Enter Select in the search box at the top of the palette to find the Select Columns in Dataset module.
Drag the Select Columns in Dataset module onto the canvas. Drop the module below the dataset module.
Connect the Automobile price data (Raw) dataset to the Select Columns in Dataset module. Drag from the dataset's output port, which is the small circle at the bottom of the dataset on the canvas, to the input port of Select Columns in Dataset, which is the small circle at the top of the module.
You create a flow of data through your pipeline when you connect the output port of one module to an input port of another.
Select the Select Columns in Dataset module.
In the properties pane to the right of the canvas, select All columns.
Select the + to add a new rule.
From the drop-down menu, select Exclude and Column names.
Enter normalized-losses in the text box.
In the lower right, select Save to close the column selector.
Select the Select Columns in Dataset module.
In the properties pane, select the Comment text box and enter Exclude normalized losses.
Comments will appear on the graph to help you organize your pipeline.
Clean missing data
Your dataset still has missing values after you remove the normalized-losses column. You can remove the remaining missing data by using the Clean Missing Data module.
Cleaning the missing values from input data is a prerequisite for using most of the modules in the designer.
Enter Clean in the search box to find the Clean Missing Data module.
Drag the Clean Missing Data module to the pipeline canvas. Connect it to the Select Columns in Dataset module.
In the properties pane, select Remove entire row under Cleaning mode.
In the properties pane Comment box, enter Remove missing value rows.
Your pipeline should now look something like this:
Train a machine learning model
Now that you have the modules in place to process the data, you can set up the training modules.
Because you want to predict price, which is a number, you can use a regression algorithm. For this example, you use a linear regression model.
Split the data
Splitting data is a common task in machine learning. You will split your data into two separate datasets. One dataset will train the model and the other will test how well the model performed.
Enter split data in the search box to find the Split Data module. Connect the left port of the Clean Missing Data module to the Split Data module.
Be sure that the left output ports of Clean Missing Data connects to Split Data. The left port contains the the cleaned data. The right port contains the discarted data.
Select the Split Data module.
In the properties pane, set the Fraction of rows in the first output dataset to 0.7.
This option splits 70 percent of the data to train the model and 30 percent for testing it. The 70 percent dataset will be accessible through the left output port. The remaining data will be available through the right output port.
In the properties pane Comment box, enter Split the dataset into training set (0.7) and test set (0.3).
Train the model
Train the model by giving it a dataset that includes the price. The algorithm constructs a model that explains the relationship between the features and the price as presented by the training data.
To select the learning algorithm, clear your module palette search box.
Expand Machine Learning Algorithms.
This option displays several categories of modules that you can use to initialize learning algorithms.
Select Regression > Linear Regression, and drag it to the pipeline canvas.
Find and drag the Train Model module to the pipeline canvas.
Connect the output of the Linear Regression module to the left input of the Train Model module.
Connect the training data output (left port) of the Split Data module to the right input of the Train Model module.
Be sure that the left output ports of Split Data connects to Train Model. The left port contains the the training set. The right port contains the test set.
Select the Train Model module.
In the properties pane, select Edit column selector.
In the Label column dialog box, expand the drop-down menu and select Column names.
In the text box, enter price to specify the value that your model is going to predict.
Your pipeline should look like this:
Score a machine learning model
After you train your model by using 70 percent of the data, you can use it to score the other 30 percent to see how well your model functions.
Enter score model in the search box to find the Score Model module. Drag the module to the pipeline canvas.
Connect the output of the Train Model module to the left input port of Score Model. Connect the test data output (right port) of the Split Data module to the right input port of Score Model.
Evaluate a machine learning model
Use the Evaluate Model module to evaluate how well your model scored the test dataset.
Enter evaluate in the search box to find the Evaluate Model module. Drag the module to the pipeline canvas.
Connect the output of the Score Model module to the left input of Evaluate Model.
The final pipeline should look something like this:
Run the pipeline
A pipeline runs on a compute target, which is a compute resource that's attached to your workspace. After you create a compute target, you can reuse it for future runs.
Select Run at the top of the canvas to run the pipeline.
When the Settings pane appears, select Select compute target.
If you already have an available compute target, you can select it to run this pipeline.
The designer can run experiments only on Azure Machine Learning Compute targets. Other compute targets won't be shown.
Enter a name for the compute resource.
In the Set up pipeline run dialog box, select + New experiment for the Experiment.
Experiments group similar pipeline runs together. If you run a pipeline multiple times, you can select the same experiment for successive runs.
Enter a descriptive name for Experiment Name.
You can view run status and details at the top right of the canvas.
It takes approximately five minutes to create a compute resource. After the resource is created, you can reuse it and skip this wait time for future runs.
The compute resource autoscales to zero nodes when it's idle to save cost. When you use it again after a delay, you might experience approximately five minutes of wait time while it scales back up.
View scored labels
After the run completes, you can view the results of the pipeline run. First, look at the predictions generated by the regression model.
Select the Score Model module to view its output.
In the properties pane, select Outputs > graph icon to view results.
Here you can see the predicted prices and the actual prices from the testing data.
Use the Evaluate Model to see how well the trained model performed on the test dataset.
Select the Evaluate Model module to view its output.
In the properties pane, select Output > graph icon to view results.
The following statistics are shown for your model:
- Mean Absolute Error (MAE): The average of absolute errors. An error is the difference between the predicted value and the actual value.
- Root Mean Squared Error (RMSE): The square root of the average of squared errors of predictions made on the test dataset.
- Relative Absolute Error: The average of absolute errors relative to the absolute difference between actual values and the average of all actual values.
- Relative Squared Error: The average of squared errors relative to the squared difference between the actual values and the average of all actual values.
- Coefficient of Determination: Also known as the R squared value, this statistical metric indicates how well a model fits the data.
For each of the error statistics, smaller is better. A smaller value indicates that the predictions are closer to the actual values. For the coefficient of determination, the closer its value is to one (1.0), the better the predictions.
Clean up resources
You can use the resources that you created as prerequisites for other Azure Machine Learning tutorials and how-to articles.
If you don't plan to use anything that you created, delete the entire resource group so you don't incur any charges.
In the Azure portal, select Resource groups on the left side of the window.
In the list, select the resource group that you created.
Select Delete resource group.
Deleting the resource group also deletes all resources that you created in the designer.
Delete individual assets
In the designer where you created your experiment, delete individual assets by selecting them and then selecting the Delete button.
The compute target that you created here automatically autoscales to zero nodes when it's not being used. This action is taken to minimize charges. If you want to delete the compute target, take these steps:
You can unregister datasets from your workspace by selecting each dataset and selecting Unregister.
To delete a dataset, go to the storage account by using the Azure portal or Azure Storage Explorer and manually delete those assets.
In part one of this tutorial, you completed the following tasks:
- Create a pipeline
- Prepare the data
- Train the model
- Score and evaluate the model
In part two, you'll learn how to deploy your model as a real-time endpoint.