Train with datasets in Azure Machine Learning

APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

In this article, you learn the two ways to consume Azure Machine Learning datasets in a remote experiment training runs without worrying about connection strings or data paths.

  • Option 1: If you have structured data, create a TabularDataset and use it directly in your training script.

  • Option 2: If you have unstructured data, create a FileDataset and mount or download files to a remote compute for training.

Azure Machine Learning datasets provide a seamless integration with Azure Machine Learning training products like ScriptRun, Estimator, HyperDrive and Azure Machine Learning pipelines.


To create and train with datasets, you need:


Some Dataset classes have dependencies on the azureml-dataprep package. For Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux, Ubuntu, Fedora, and CentOS.

Option 1: Use datasets directly in training scripts

In this example, you create a TabularDataset and use it as a direct input to your estimator object for training.

Create a TabularDataset

The following code creates an unregistered TabularDataset from a web url. You can also create datasets from local files or paths in datastores. Learn more about how to create datasets.

from azureml.core.dataset import Dataset

web_path =''
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_path)

Access the input dataset in your training script

TabularDataset objects provide the ability to load the data into a pandas or spark DataFrame so that you can work with familiar data preparation and training libraries. To leverage this capability, you can pass a TabularDataset as the input in your training configuration, and then retrieve it in your script.

To do so, access the input dataset through the Run object in your training script and use the to_pandas_dataframe() method.

%%writefile $script_folder/

from azureml.core import Dataset, Run

run = Run.get_context()
# get the input dataset by name
dataset = run.input_datasets['titanic']
# load the TabularDataset to pandas DataFrame
df = dataset.to_pandas_dataframe()

Configure the estimator

An estimator object is used to submit the experiment run. Azure Machine Learning has pre-configured estimators for common machine learning frameworks, as well as a generic estimator.

This code creates a generic estimator object, est, that specifies

  • A script directory for your scripts. All the files in this directory are uploaded into the cluster nodes for execution.
  • The training script,
  • The input dataset for training, titanic. as_named_input() is required so that the input dataset can be referenced by the assigned name in your training script.
  • The compute target for the experiment.
  • The environment definition for the experiment.
est = Estimator(source_directory=script_folder,
                # pass dataset object as an input with name 'titanic'
                environment_definition= conda_env)

# Submit the estimator as part of your experiment run
experiment_run = experiment.submit(est)

Option 2: Mount files to a remote compute target

If you want to make your data files available on the compute target for training, use FileDataset to mount or download files referred by it.

Mount vs. Download

Mounting or downloading files of any format are supported for datasets created from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL.

When you mount a dataset, you attach the files referenced by the dataset to a directory (mount point) and make it available on the compute target. Mounting is supported for Linux-based computes, including Azure Machine Learning Compute, virtual machines, and HDInsight. When you download a dataset, all the files referenced by the dataset will be downloaded to the compute target. Downloading is supported for all compute types.

If your script processes all files referenced by the dataset, and your compute disk can fit your full dataset, downloading is recommended to avoid the overhead of streaming data from storage services. If your data size exceeds the compute disk size, downloading is not possible. For this scenario, we recommend mounting since only the data files used by your script are loaded at the time of processing.

The following code mounts dataset to the temp directory at mounted_path

import tempfile
mounted_path = tempfile.mkdtemp()

# mount dataset onto the mounted_path of a Linux-based compute
mount_context = dataset.mount(mounted_path)


import os
print (mounted_path)

Create a FileDataset

The following example creates an unregistered FileDataset from web urls. Learn more about how to create datasets from other sources.

from azureml.core.dataset import Dataset

web_paths = [
mnist_ds = Dataset.File.from_files(path = web_paths)

Configure the estimator

Besides passing the dataset through the inputs parameter in the estimator, you can also pass the dataset through script_params and get the data path (mounting point) in your training script via arguments. This way, you can keep your training script independent of azureml-sdk. In other words, you will be able use the same training script for local debugging and remote training on any cloud platform.

An SKLearn estimator object is used to submit the run for scikit-learn experiments. Learn more about training with the SKlearn estimator.

from azureml.train.sklearn import SKLearn

script_params = {
    # mount the dataset on the remote compute and pass the mounted path as an argument to the training script
    '--data-folder': mnist_ds.as_named_input('mnist').as_mount(),
    '--regularization': 0.5

est = SKLearn(source_directory=script_folder,

# Run the experiment
run = experiment.submit(est)

Retrieve the data in your training script

After you submit the run, data files referred by the mnist dataset will be mounted to the compute target. The following code shows how to retrieve the data in your script.

%%writefile $script_folder/

import argparse
import os
import numpy as np
import glob

from utils import load_data

# retrieve the 2 arguments configured through script_params in estimator
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
parser.add_argument('--regularization', type=float, dest='reg', default=0.01, help='regularization rate')
args = parser.parse_args()

data_folder = args.data_folder
print('Data folder:', data_folder)

# get the file paths on the compute
X_train_path = glob.glob(os.path.join(data_folder, '**/train-images-idx3-ubyte.gz'), recursive=True)[0]
X_test_path = glob.glob(os.path.join(data_folder, '**/t10k-images-idx3-ubyte.gz'), recursive=True)[0]
y_train_path = glob.glob(os.path.join(data_folder, '**/train-labels-idx1-ubyte.gz'), recursive=True)[0]
y_test = glob.glob(os.path.join(data_folder, '**/t10k-labels-idx1-ubyte.gz'), recursive=True)[0]

# load train and test set into numpy arrays
X_train = load_data(X_train_path, False) / 255.0
X_test = load_data(X_test_path, False) / 255.0
y_train = load_data(y_train_path, True).reshape(-1)
y_test = load_data(y_test, True).reshape(-1)

Notebook examples

The dataset notebooks demonstrate and expand upon concepts in this article.

Next steps