Tutorial: Use R to create a machine learning model (preview)


The Azure Machine Learning R SDK is currently in public preview. The preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

In this tutorial you'll use the Azure Machine Learning R SDK (preview) to create a logistic regression model that predicts the likelihood of a fatality in a car accident. You'll see how the Azure Machine Learning cloud resources work with R to provide a scalable environment for training and deploying a model.

In this tutorial, you perform the following tasks:

  • Create an Azure Machine Learning workspace
  • Open RStudio from your workspace
  • Clone https://github.com/Azure/azureml-sdk-for-r the files necessary to run this tutorial into your workspace
  • Load data and prepare for training
  • Upload data to a datastore so it is available for remote training
  • Create a compute resource to train the model remotely
  • Train a caret model to predict probability of fatality
  • Deploy a prediction endpoint
  • Test the model from R

If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning today.

Create a workspace

An Azure Machine Learning workspace is a foundational resource in the cloud that you use to experiment, train, and deploy machine learning models. It ties your Azure subscription and resource group to an easily consumed object in the service.

There are many ways to create a workspace. In this tutorial, you create a workspace via the Azure portal, a web-based console for managing your Azure resources.

  1. Sign in to the Azure portal by using the credentials for your Azure subscription.

  2. In the upper-left corner of the Azure portal, select + Create a resource.

    Screenshot that shows the Create a resource option.

  3. Use the search bar to find Machine Learning.

  4. Select Machine Learning.

  5. In the Machine Learning pane, select Create to begin.

  6. Provide the following information to configure your new workspace:

    Field Description
    Workspace name Enter a unique name that identifies your workspace. In this example, we use docs-ws. Names must be unique across the resource group. Use a name that's easy to recall and to differentiate from workspaces created by others.
    Subscription Select the Azure subscription that you want to use.
    Resource group Use an existing resource group in your subscription, or enter a name to create a new resource group. A resource group holds related resources for an Azure solution. In this example, we use docs-aml.
    Location Select the location closest to your users and the data resources to create your workspace.
    Workspace edition Select Basic as the workspace type for this tutorial. The workspace type determines the features to which you'll have access and pricing. Everything in this tutorial can be performed with either a Basic or Enterprise workspace.
  7. After you're finished configuring the workspace, select Review + Create.


    It can take several minutes to create your workspace in the cloud.

    When the process is finished, a deployment success message appears.

  8. To view the new workspace, select Go to resource.


Take note of your workspace and subscription. You'll need these to ensure you create your experiment in the right place.

Open RStudio

This example uses a compute instance in your workspace for an install-free and pre-configured experience. Use your own environment if you prefer to have control over your environment, packages and dependencies on your own machine.

Use RStudio on an Azure ML compute instance to run this tutorial.

  1. Select Compute on the left.

  2. Add a compute resource if one does not already exist.

  3. Once the compute is running, use the RStudio link to open RStudio.

Clone the sample vignettes

Clone the https://github.com/Azure/azureml-sdk-for-r GitHub repository for a copy of the vignette files you will run in this tutorial.

  1. In RStudio, navigate to the "Terminal" tab and cd into the directory where you would like to clone the repository.

  2. Run git clone https://github.com/Azure/azureml-sdk-for-r in the terminal to clone the repository.

  3. In RStudio, navigate to the vignettes folder of the cloned azureml-sdk-for-r folder. Under vignettes, select the train-and-deploy-first-model.Rmd file to find vignette used in this tutorial. The additional files used for the vignette are located in the train-and-deploy-first-model subfolder. Once you've opened the vignette, set the working directory to the file's location via Session > Set Working Directory > To Source File Location.


The rest of this article contains the same content as you see in the train-and-deploy-first-model.Rmd file. If you are experienced with RMarkdown, feel free to use the code from that file. Or you can copy/paste the code snippets from there, or from this article into an R script or the command line.

Set up your development environment

The setup for your development work in this tutorial includes the following actions:

  • Install required packages
  • Connect to a workspace, so that your compute instance can communicate with remote resources
  • Create an experiment to track your runs
  • Create a remote compute target to use for training

Install required packages

The compute instance already has the latest version of the R SDK from CRAN installed. If you would like to install the development version from GitHub instead to pick up the latest bug fixes, please run the following:



During the installation process, if you get the prompt "Would you like to install Miniconda? [Y/n]:", please respond with "n" as the compute instance already has Anaconda installed and a Miniconda installation is not needed.

Now go ahead and import the azuremlsdk package.


The training and scoring scripts (accidents.R and accident_predict.R) have some additional dependencies. If you plan on running those scripts locally, make sure you have those required packages as well.

Load your workspace

Instantiate a workspace object from your existing workspace. The following code will load the workspace details from the config.json file. You can also retrieve a workspace using get_workspace().

ws <- load_workspace_from_config()

Create an experiment

An Azure ML experiment tracks a grouping of runs, typically from the same training script. Create an experiment to track the runs for training the caret model on the accidents data.

experiment_name <- "accident-logreg"
exp <- experiment(ws, experiment_name)

Create a compute target

By using Azure Machine Learning Compute (AmlCompute), a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create a single-node AmlCompute cluster as your training environment. The code below creates the compute cluster for you if it doesn't already exist in your workspace.

You may need to wait a few minutes for your compute cluster to be provisioned if it doesn't already exist.

cluster_name <- "rcluster"
compute_target <- get_compute(ws, cluster_name = cluster_name)
if (is.null(compute_target)) {
  vm_size <- "STANDARD_D2_V2" 
  compute_target <- create_aml_compute(workspace = ws,
                                       cluster_name = cluster_name,
                                       vm_size = vm_size,
                                       max_nodes = 1)


Prepare data for training

This tutorial uses data from the US National Highway Traffic Safety Administration (with thanks to Mary C. Meyer and Tremika Finney). This dataset includes data from over 25,000 car crashes in the US, with variables you can use to predict the likelihood of a fatality. First, import the data into R and transform it into a new dataframe accidents for analysis, and export it to an Rdata file.

nassCDS <- read.csv("nassCDS.csv", 
accidents <- na.omit(nassCDS[,c("dead","dvcat","seatbelt","frontal","sex","ageOFocc","yearVeh","airbag","occRole")])
accidents$frontal <- factor(accidents$frontal, labels=c("notfrontal","frontal"))
accidents$occRole <- factor(accidents$occRole)
accidents$dvcat <- ordered(accidents$dvcat, 

saveRDS(accidents, file="accidents.Rd")

Upload data to the datastore

Upload data to the cloud so that it can be access by your remote training environment. Each Azure Machine Learning workspace comes with a default datastore that stores the connection information to the Azure blob container that is provisioned in the storage account attached to the workspace. The following code will upload the accidents data you created above to that datastore.

ds <- get_default_datastore(ws)

target_path <- "accidentdata"
                          target_path = target_path,
                          overwrite = TRUE)

Train a model

For this tutorial, fit a logistic regression model on your uploaded data using your remote compute cluster. To submit a job, you need to:

  • Prepare the training script
  • Create an estimator
  • Submit the job

Prepare the training script

A training script called accidents.R has been provided for you in the train-and-deploy-first-model directory. Notice the following details inside the training script that have been done to leverage Azure Machine Learning for training:

  • The training script takes an argument -d to find the directory that contains the training data. When you define and submit your job later, you point to the datastore for this argument. Azure ML will mount the storage folder to the remote cluster for the training job.
  • The training script logs the final accuracy as a metric to the run record in Azure ML using log_metric_to_run(). The Azure ML SDK provides a set of logging APIs for logging various metrics during training runs. These metrics are recorded and persisted in the experiment run record. The metrics can then be accessed at any time or viewed in the run details page in studio. See the reference for the full set of logging methods log_*().
  • The training script saves your model into a directory named outputs. The ./outputs folder receives special treatment by Azure ML. During training, files written to ./outputs are automatically uploaded to your run record by Azure ML and persisted as artifacts. By saving the trained model to ./outputs, you'll be able to access and retrieve your model file even after the run is over and you no longer have access to your remote training environment.

Create an estimator

An Azure ML estimator encapsulates the run configuration information needed for executing a training script on the compute target. Azure ML runs are run as containerized jobs on the specified compute target. By default, the Docker image built for your training job will include R, the Azure ML SDK, and a set of commonly used R packages. See the full list of default packages included here.

To create the estimator, define:

  • The directory that contains your scripts needed for training (source_directory). All the files in this directory are uploaded to the cluster node(s) for execution. The directory must contain your training script and any additional scripts required.
  • The training script that will be executed (entry_script).
  • The compute target (compute_target), in this case the AmlCompute cluster you created earlier.
  • The parameters required from the training script (script_params). Azure ML will run your training script as a command-line script with Rscript. In this tutorial you specify one argument to the script, the data directory mounting point, which you can access with ds$path(target_path).
  • Any environment dependencies required for training. The default Docker image built for training already contains the three packages (caret, e1071, and optparse) needed in the training script. So you don't need to specify additional information. If you are using R packages that are not included by default, use the estimator's cran_packages parameter to add additional CRAN packages. See the estimator() reference for the full set of configurable options.
est <- estimator(source_directory = "train-and-deploy-first-model",
                 entry_script = "accidents.R",
                 script_params = list("--data_folder" = ds$path(target_path)),
                 compute_target = compute_target

Submit the job on the remote cluster

Finally submit the job to run on your cluster. submit_experiment() returns a Run object that you then use to interface with the run. In total, the first run takes about 10 minutes. But for later runs, the same Docker image is reused as long as the script dependencies don't change. In this case, the image is cached and the container startup time is much faster.

run <- submit_experiment(exp, est)

You can view the run's details in RStudio Viewer. Clicking the "Web View" link provided will bring you to Azure Machine Learning studio, where you can monitor the run in the UI.


Model training happens in the background. Wait until the model has finished training before you run more code.

wait_for_run_completion(run, show_output = TRUE)

You -- and colleagues with access to the workspace -- can submit multiple experiments in parallel, and Azure ML will take of scheduling the tasks on the compute cluster. You can even configure the cluster to automatically scale up to multiple nodes, and scale back when there are no more compute tasks in the queue. This configuration is a cost-effective way for teams to share compute resources.

Retrieve training results

Once your model has finished training, you can access the artifacts of your job that were persisted to the run record, including any metrics logged and the final trained model.

Get the logged metrics

In the training script accidents.R, you logged a metric from your model: the accuracy of the predictions in the training data. You can see metrics in the studio, or extract them to the local session as an R list as follows:

metrics <- get_run_metrics(run)

If you've run multiple experiments (say, using differing variables, algorithms, or hyperparamers), you can use the metrics from each run to compare and choose the model you'll use in production.

Get the trained model

You can retrieve the trained model and look at the results in your local R session. The following code will download the contents of the ./outputs directory, which includes the model file.

download_files_from_run(run, prefix="outputs/")
accident_model <- readRDS("outputs/model.rds")

You see some factors that contribute to an increase in the estimated probability of death:

  • higher impact speed
  • male driver
  • older occupant
  • passenger

You see lower probabilities of death with:

  • presence of airbags
  • presence seatbelts
  • frontal collision

The vehicle year of manufacture does not have a significant effect.

You can use this model to make new predictions:

newdata <- data.frame( # valid values shown below
 dvcat="10-24",        # "1-9km/h" "10-24"   "25-39"   "40-54"   "55+"  
 seatbelt="none",      # "none"   "belted"  
 frontal="frontal",    # "notfrontal" "frontal"
 sex="f",              # "f" "m"
 ageOFocc=16,          # age in years, 16-97
 yearVeh=2002,         # year of vehicle, 1955-2003
 airbag="none",        # "none"   "airbag"   
 occRole="pass"        # "driver" "pass"

## predicted probability of death for these variables, as a percentage
as.numeric(predict(accident_model,newdata, type="response")*100)

Deploy as a web service

With your model, you can predict the danger of death from a collision. Use Azure ML to deploy your model as a prediction service. In this tutorial, you will deploy the web service in Azure Container Instances (ACI).

Register the model

First, register the model you downloaded to your workspace with register_model(). A registered model can be any collection of files, but in this case the R model object is sufficient. Azure ML will use the registered model for deployment.

model <- register_model(ws, 
                        model_path = "outputs/model.rds", 
                        model_name = "accidents_model",
                        description = "Predict probablity of auto accident")

Define the inference dependencies

To create a web service for your model, you first need to create a scoring script (entry_script), an R script that will take as input variable values (in JSON format) and output a prediction from your model. For this tutorial, use the provided scoring file accident_predict.R. The scoring script must contain an init() method that loads your model and returns a function that uses the model to make a prediction based on the input data. See the documentation for more details.

Next, define an Azure ML environment for your script's package dependencies. With an environment, you specify R packages (from CRAN or elsewhere) that are needed for your script to run. You can also provide the values of environment variables that your script can reference to modify its behavior. By default, Azure ML will build the same default Docker image used with the estimator for training. Since the tutorial has no special requirements, create an environment with no special attributes.

r_env <- r_environment(name = "basic_env")

If you want to use your own Docker image for deployment instead, specify the custom_docker_image parameter. See the r_environment() reference for the full set of configurable options for defining an environment.

Now you have everything you need to create an inference config for encapsulating your scoring script and environment dependencies.

inference_config <- inference_config(
  entry_script = "accident_predict.R",
  source_directory = "train-and-deploy-first-model",
  environment = r_env)

Deploy to ACI

In this tutorial, you will deploy your service to ACI. This code provisions a single container to respond to inbound requests, which is suitable for testing and light loads. See aci_webservice_deployment_config() for additional configurable options. (For production-scale deployments, you can also deploy to Azure Kubernetes Service.)

aci_config <- aci_webservice_deployment_config(cpu_cores = 1, memory_gb = 0.5)

Now you deploy your model as a web service. Deployment can take several minutes.

aci_service <- deploy_model(ws, 

wait_for_deployment(aci_service, show_output = TRUE)

Test the deployed service

Now that your model is deployed as a service, you can test the service from R using invoke_webservice(). Provide a new set of data to predict from, convert it to JSON, and send it to the service.


newdata <- data.frame( # valid values shown below
 dvcat="10-24",        # "1-9km/h" "10-24"   "25-39"   "40-54"   "55+"  
 seatbelt="none",      # "none"   "belted"  
 frontal="frontal",    # "notfrontal" "frontal"
 sex="f",              # "f" "m"
 ageOFocc=22,          # age in years, 16-97
 yearVeh=2002,         # year of vehicle, 1955-2003
 airbag="none",        # "none"   "airbag"   
 occRole="pass"        # "driver" "pass"

prob <- invoke_webservice(aci_service, toJSON(newdata))

You can also get the web service's HTTP endpoint, which accepts REST client calls. You can share this endpoint with anyone who wants to test the web service or integrate it into an application.


Clean up resources

Delete the resources once you no longer need them. Don't delete any resource you plan to still use.

Delete the web service:


Delete the registered model:


Delete the compute cluster:


Delete everything


The resources that you created can be used as prerequisites to other Azure Machine Learning tutorials and how-to articles.

If you don't plan to use the resources that you created, delete them so you don't incur any charges:

  1. In the Azure portal, select Resource groups on the far left.

  2. From the list, select the resource group that you created.

  3. Select Delete resource group.

    Screenshot of the selections to delete a resource group in the Azure portal.

  4. Enter the resource group name. Then select Delete.

You can also keep the resource group but delete a single workspace. Display the workspace properties and select Delete.

Next steps

  • Now that you've completed your first Azure Machine Learning experiment in R, learn more about the Azure Machine Learning SDK for R.

  • Learn more about Azure Machine Learning with R from the examples in the other vignettes folders.