Create Azure Machine Learning datasets

APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

In this article, you learn how to create Azure Machine Learning datasets to access data for your local or remote experiments.

With Azure Machine Learning datasets, you can:

  • Keep a single copy of data in your storage, referenced by datasets.

  • Seamlessly access data during model training without worrying about connection strings or data paths.

  • Share data and collaborate with other users.

Prerequisites

To create and work with datasets, you need:

Note

Some dataset classes have dependencies on the azureml-dataprep package. For Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux, Ubuntu, Fedora, and CentOS.

Dataset types

There are two dataset types, based on how users consume them in training:

  • TabularDataset represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a Pandas or Spark DataFrame. You can create a TabularDataset object from .csv, .tsv, and parquet files, and from SQL query results. For a complete list, see TabularDatasetFactory class.

  • The FileDataset class references single or multiple files in your datastores or public URLs. By this method, you can download or mount the files to your compute as a FileDataset object. The files can be in any format, which enables a wider range of machine learning scenarios, including deep learning.

To learn more about upcoming API changes, see Dataset API change notice.

Create datasets

By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost. You can create both TabularDataset and FileDataset data sets by using the Python SDK or workspace landing page (preview).

For the data to be accessible by Azure Machine Learning, datasets must be created from paths in Azure datastores or public web URLs.

Use the SDK

To create datasets from an Azure datastore by using the Python SDK:

  1. Verify that you have contributor or owner access to the registered Azure datastore.

  2. Create the dataset by referencing a path in the datastore:

    from azureml.core.workspace import Workspace
    from azureml.core.datastore import Datastore
    from azureml.core.dataset import Dataset
    
    datastore_name = 'your datastore name'
    
    # get existing workspace
    workspace = Workspace.from_config()
    
    # retrieve an existing datastore in the workspace by name
    datastore = Datastore.get(workspace, datastore_name)
    

Create a TabularDataset

You can create TabularDatasets through the SDK or by using Azure Machine Learning Studio. You can specify a time stamp from a column in the data or from the path pattern that the data is stored in to enable a time series trait. This specification allows for easy and efficient filtering by time.

Use the from_delimited_files() method on the TabularDatasetFactory class to read files in .csv or .tsv format, and to create an unregistered TabularDataset. If you're reading from multiple files, results will be aggregated into one tabular representation.

# create a TabularDataset from multiple paths in datastore
datastore_paths = [
                  (datastore, 'weather/2018/11.csv'),
                  (datastore, 'weather/2018/12.csv'),
                  (datastore, 'weather/2019/*.csv')
                 ]
weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

By default, when you create a TabularDataset, column data types are inferred automatically. If the inferred types don't match your expectations, you can specify column types by using the following code. You can also learn more about supported data types.

from azureml.data.dataset_factory import DataType

# create a TabularDataset from a delimited file behind a public web url and convert column "Survived" to boolean
web_path ='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_path, set_column_types={'Survived': DataType.to_bool()})

# preview the first 3 rows of titanic_ds
titanic_ds.take(3).to_pandas_dataframe()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 False 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S
1 2 True 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 True 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S

Use the from_sql_query() method on the TabularDatasetFactory class to read from Azure SQL Database:


from azureml.core import Dataset, Datastore

# create tabular dataset from a SQL database in datastore
sql_datastore = Datastore.get(workspace, 'mssql')
sql_ds = Dataset.Tabular.from_sql_query((sql_datastore, 'SELECT * FROM my_table'))

In TabularDatasets, you can specify a time stamp from a column in the data or from wherever the path pattern data is stored to enable a time series trait. This specification allows for easy and efficient filtering by time.

Use the with_timestamp_columns() method on theTabularDataset class to specify your time stamp column and to enable filtering by time. For more information, see Tabular time series-related API demo with NOAA weather data.

# create a TabularDataset with time series trait
datastore_paths = [(datastore, 'weather/*/*/*/data.parquet')]

# get a coarse timestamp column from the path pattern
dataset = Dataset.Tabular.from_parquet_files(path=datastore_path, partition_format='weather/{coarse_time:yyy/MM/dd}/data.parquet')

# set coarse timestamp to the virtual column created, and fine grain timestamp from a column in the data
dataset = dataset.with_timestamp_columns(fine_grain_timestamp='datetime', coarse_grain_timestamp='coarse_time')

# filter with time-series-trait-specific methods
data_slice = dataset.time_before(datetime(2019, 1, 1))
data_slice = dataset.time_after(datetime(2019, 1, 1))
data_slice = dataset.time_between(datetime(2019, 1, 1), datetime(2019, 2, 1))
data_slice = dataset.time_recent(timedelta(weeks=1, days=1))

Create a FileDataset

Use the from_files() method on the FileDatasetFactory class to load files in any format and to create an unregistered FileDataset:

# create a FileDataset pointing to files in 'animals' folder and its subfolders recursively
datastore_paths = [
                  (datastore, 'animals')
                 ]
animal_ds = Dataset.File.from_files(path=datastore_paths)

# create a FileDataset from image and label files behind public web urls
web_paths = [
            'https://azureopendatastorage.blob.core.windows.net/mnist/train-images-idx3-ubyte.gz',
            'https://azureopendatastorage.blob.core.windows.net/mnist/train-labels-idx1-ubyte.gz'
           ]
mnist_ds = Dataset.File.from_files(path=web_paths)

On the web

The following steps and animation show how to create a dataset in Azure Machine Learning Studio, https://ml.azure.com.

Create a dataset with the UI

To create a dataset in the studio:

  1. Sign in at https://ml.azure.com.
  2. Select Datasets in the Assets section of the left pane.
  3. Select Create Dataset to choose the source of your dataset. This source can be local files, a datastore, or public URLs.
  4. Select Tabular or File for Dataset type.
  5. Select Next to review the Settings and preview, Schema and Confirm details forms; they are intelligently populated based on file type. Use these forms to check your selections and to further configure your dataset prior to creation.
  6. Select Create to complete your dataset creation.

Register datasets

To complete the creation process, register your datasets with a workspace. Use the register() method to register datasets with your workspace in order to share them with others and reuse them across various experiments:

titanic_ds = titanic_ds.register(workspace=workspace,
                                 name='titanic_ds',
                                 description='titanic training data')

Note

Datasets created through Azure Machine Learning Studio are automatically registered to the workspace.

Create datasets with Azure Open Datasets

Azure Open Datasets are curated public datasets that you can use to add scenario-specific features to machine learning solutions for more accurate models. Datasets include public-domain data for weather, census, holidays, public safety, and location that help you train machine learning models and enrich predictive solutions. Open Datasets are in the cloud on Microsoft Azure and are included in both the SDK and the workspace UI.

Use the SDK

To create datasets with Azure Open Datasets from the SDK, make sure you've installed the package with pip install azureml-opendatasets. Each discrete data set is represented by its own class in the SDK, and certain classes are available as either a TabularDataset, FileDataset, or both. See the reference documentation for a full list of classes.

Most classes inherit from and return an instance of TabularDataset. Examples of these classes include PublicHolidays, BostonSafety, and UsPopulationZip. To create a TabularDataset from these types of classes, use the constructor with no arguments. When you register a dataset created from Open Datasets, no data is immediately downloaded, but the data will be accessed later when requested (during training, for example) from a central storage location.

from azureml.opendatasets import UsPopulationZip

tabular_dataset = UsPopulationZip()
tabular_dataset = tabular_dataset.register(workspace=workspace, name="pop data", description="US population data by zip code")

You can retrieve certain classes as either a TabularDataset or FileDataset, which allows you to manipulate and/or download the files directly. Other classes can get a dataset only by using either the get_tabular_dataset() or get_file_dataset() functions. The following code sample shows a few examples of these types of classes:

from azureml.opendatasets import MNIST

# MNIST class can return either TabularDataset or FileDataset
tabular_dataset = MNIST.get_tabular_dataset()
file_dataset = MNIST.get_file_dataset()

from azureml.opendatasets import Diabetes

# Diabetes class can return ONLY return TabularDataset and must be called from the static function
diabetes_tabular = Diabetes.get_tabular_dataset()

Use the UI

You can also create datasets from Open Datasets classes through the UI. In your workspace, select the Datasets tab under Assets. On the Create dataset drop-down menu, select From Open Datasets.

Open Dataset with the UI

Select a dataset by selecting its tile. (You have the option to filter by using the search bar.) Select Next.

Choose dataset

Choose a name under which to register the dataset, and optionally filter the data by using the available filters. In this case, for the public holidays dataset, you filter the time period to one year and the country code to only the US. Select Create.

Set dataset params and create dataset

The dataset is now available in your workspace under Datasets. You can use it in the same way as other datasets you've created.

Version datasets

You can register a new dataset under the same name by creating a new version. A dataset version is a way to bookmark the state of your data so that you can apply a specific version of the dataset for experimentation or future reproduction. Learn more about dataset versions.

# create a TabularDataset from Titanic training data
web_paths = [
            'https://dprepdata.blob.core.windows.net/demo/Titanic.csv',
            'https://dprepdata.blob.core.windows.net/demo/Titanic2.csv'
           ]
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_paths)

# create a new version of titanic_ds
titanic_ds = titanic_ds.register(workspace = workspace,
                                 name = 'titanic_ds',
                                 description = 'new titanic training data',
                                 create_new_version = True)

Access datasets in your script

Registered datasets are accessible both locally and remotely on compute clusters like the Azure Machine Learning compute. To access your registered dataset across experiments, use the following code to access your workspace and registered dataset by name. By default, the get_by_name() method on the Dataset class returns the latest version of the dataset that's registered with the workspace.

%%writefile $script_folder/train.py

from azureml.core import Dataset, Run

run = Run.get_context()
workspace = run.experiment.workspace

dataset_name = 'titanic_ds'

# Get a dataset by name
titanic_ds = Dataset.get_by_name(workspace=workspace, name=dataset_name)

# Load a TabularDataset into pandas DataFrame
df = titanic_ds.to_pandas_dataframe()

Next steps