Create Azure Machine Learning datasets

In this article, you learn how to create Azure Machine Learning datasets to access data for your local or remote experiments with the Azure Machine Learning Python SDK. To understand where datasets fit in Azure Machine Learning's overall data access workflow, see the Securely access data article.

By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost, and don't risk the integrity of your data sources. Also datasets are lazily evaluated, which aids in workflow performance speeds. You can create datasets from datastores, public URLs, and Azure Open Datasets.

For a low-code experience, Create Azure Machine Learning datasets with the Azure Machine Learning studio..

With Azure Machine Learning datasets, you can:

  • Keep a single copy of data in your storage, referenced by datasets.

  • Seamlessly access data during model training without worrying about connection strings or data paths.Learn more about how to train with datasets.

  • Share data and collaborate with other users.

Prerequisites

To create and work with datasets, you need:

Note

Some dataset classes have dependencies on the azureml-dataprep package, which is only compatible with 64-bit Python. For Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux (7, 8), Ubuntu (14.04, 16.04, 18.04), Fedora (27, 28), Debian (8, 9), and CentOS (7). If you are using unsupported distros, please follow this guide to install .NET Core 2.1 to proceed.

Compute size guidance

When creating a dataset, review your compute processing power and the size of your data in memory. The size of your data in storage is not the same as the size of data in a dataframe. For example, data in CSV files can expand up to 10x in a dataframe, so a 1 GB CSV file can become 10 GB in a dataframe.

If your data is compressed, it can expand further; 20 GB of relatively sparse data stored in compressed parquet format can expand to ~800 GB in memory. Since Parquet files store data in a columnar format, if you only need half of the columns, then you only need to load ~400 GB in memory.

Learn more about optimizing data processing in Azure Machine Learning.

Dataset types

There are two dataset types, based on how users consume them in training; FileDatasets and TabularDatasets. Both types can be used in Azure Machine Learning training workflows involving, estimators, AutoML, hyperDrive and pipelines.

FileDataset

A FileDataset references single or multiple files in your datastores or public URLs. If your data is already cleansed, and ready to use in training experiments, you can download or mount the files to your compute as a FileDataset object.

We recommend FileDatasets for your machine learning workflows, since the source files can be in any format, which enables a wider range of machine learning scenarios, including deep learning.

Create a FileDataset with the Python SDK or the Azure Machine Learning studio .

TabularDataset

A TabularDataset represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a pandas or Spark DataFrame so you can work with familiar data preparation and training libraries without having to leave your notebook. You can create a TabularDataset object from .csv, .tsv, .parquet, .jsonl files, and from SQL query results.

With TabularDatasets, you can specify a time stamp from a column in the data or from wherever the path pattern data is stored to enable a time series trait. This specification allows for easy and efficient filtering by time. For an example, see Tabular time series-related API demo with NOAA weather data.

Create a TabularDataset with the Python SDK or Azure Machine Learning studio.

Note

AutoML workflows generated via the Azure Machine Learning studio currently only support TabularDatasets.

Access datasets in a virtual network

If your workspace is in a virtual network, you must configure the dataset to skip validation. For more information on how to use datastores and datasets in a virtual network, see Secure a workspace and associated resources.

Create datasets

For the data to be accessible by Azure Machine Learning, datasets must be created from paths in Azure datastores or public web URLs.

To create datasets from an Azure datastore with the Python SDK:

  1. Verify that you have contributor or owner access to the registered Azure datastore.

  2. Create the dataset by referencing paths in the datastore. You can create a dataset from multiple paths in multiple datastores. There is no hard limit on the number of files or data size that you can create a dataset from.

Note

For each data path, a few requests will be sent to the storage service to check whether it points to a file or a folder. This overhead may lead to degraded performance or failure. A dataset referencing one folder with 1000 files inside is considered referencing one data path. We recommend creating dataset referencing less than 100 paths in datastores for optimal performance.

Create a FileDataset

Use the from_files() method on the FileDatasetFactory class to load files in any format and to create an unregistered FileDataset.

If your storage is behind a virtual network or firewall, set the parameter validate=False in your from_files() method. This bypasses the initial validation step, and ensures that you can create your dataset from these secure files. Learn more about how to use datastores and datasets in a virtual network.

# create a FileDataset pointing to files in 'animals' folder and its subfolders recursively
datastore_paths = [(datastore, 'animals')]
animal_ds = Dataset.File.from_files(path=datastore_paths)

# create a FileDataset from image and label files behind public web urls
web_paths = ['https://azureopendatastorage.blob.core.windows.net/mnist/train-images-idx3-ubyte.gz',
             'https://azureopendatastorage.blob.core.windows.net/mnist/train-labels-idx1-ubyte.gz']
mnist_ds = Dataset.File.from_files(path=web_paths)

To reuse and share datasets across experiment in your workspace, register your dataset.

Tip

Upload files from a local directory and create a FileDataset in a single method with the public preview method, upload_directory(). This method is an experimental preview feature, and may change at any time.

This method uploads data to your underlying storage, and as a result incur storage costs.

Create a TabularDataset

Use the from_delimited_files() method on the TabularDatasetFactory class to read files in .csv or .tsv format, and to create an unregistered TabularDataset. If you're reading from multiple files, results will be aggregated into one tabular representation.

If your storage is behind a virtual network or firewall, set the parameter validate=False in your from_delimited_files() method. This bypasses the initial validation step, and ensures that you can create your dataset from these secure files. Learn more about how to use datastores and datasets in a virtual network.

The following code gets the existing workspace and the desired datastore by name. And then passes the datastore and file locations to the path parameter to create a new TabularDataset, weather_ds.

from azureml.core import Workspace, Datastore, Dataset

datastore_name = 'your datastore name'

# get existing workspace
workspace = Workspace.from_config()
    
# retrieve an existing datastore in the workspace by name
datastore = Datastore.get(workspace, datastore_name)

# create a TabularDataset from 3 file paths in datastore
datastore_paths = [(datastore, 'weather/2018/11.csv'),
                   (datastore, 'weather/2018/12.csv'),
                   (datastore, 'weather/2019/*.csv')]

weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

By default, when you create a TabularDataset, column data types are inferred automatically. If the inferred types don't match your expectations, you can specify column types by using the following code. The parameter infer_column_type is only applicable for datasets created from delimited files. Learn more about supported data types.

from azureml.core import Dataset
from azureml.data.dataset_factory import DataType

# create a TabularDataset from a delimited file behind a public web url and convert column "Survived" to boolean
web_path ='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_path, set_column_types={'Survived': DataType.to_bool()})

# preview the first 3 rows of titanic_ds
titanic_ds.take(3).to_pandas_dataframe()
(Index) PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 False 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S
1 2 True 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 True 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S

To reuse and share datasets across experiments in your workspace, register your dataset.

Create a dataset from pandas dataframe

To create a TabularDataset from an in memory pandas dataframe, write the data to a local file, like a csv, and create your dataset from that file. The following code demonstrates this workflow.

# azureml-core of version 1.0.72 or higher is required
# azureml-dataprep[pandas] of version 1.1.34 or higher is required

from azureml.core import Workspace, Dataset
local_path = 'data/prepared.csv'
dataframe.to_csv(local_path)

# upload the local file to a datastore on the cloud

subscription_id = 'xxxxxxxxxxxxxxxxxxxxx'
resource_group = 'xxxxxx'
workspace_name = 'xxxxxxxxxxxxxxxx'

workspace = Workspace(subscription_id, resource_group, workspace_name)

# get the datastore to upload prepared data
datastore = workspace.get_default_datastore()

# upload the local file from src_dir to the target_path in datastore
datastore.upload(src_dir='data', target_path='data')

# create a dataset referencing the cloud location
dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, ('data/prepared.csv'))])

Tip

Create and register a TabularDataset from an in memory spark or pandas dataframe with a single method with public preview methods, register_spark_dataframe() and register_pandas_dataframe(). These register methods are experimental preview features, and may change at any time.

These methods upload data to your underlying storage, and as a result incur storage costs.

Register datasets

To complete the creation process, register your datasets with a workspace. Use the register() method to register datasets with your workspace in order to share them with others and reuse them across experiments in your workspace:

titanic_ds = titanic_ds.register(workspace=workspace,
                                 name='titanic_ds',
                                 description='titanic training data')

Create datasets using Azure Resource Manager

There are a number of templates at https://github.com/Azure/azure-quickstart-templates/tree/master/101-machine-learning-dataset-create-* that can be used to create datasets.

For information on using these templates, see Use an Azure Resource Manager template to create a workspace for Azure Machine Learning.

Create datasets with Azure Open Datasets

Azure Open Datasets are curated public datasets that you can use to add scenario-specific features to machine learning solutions for more accurate models. Datasets include public-domain data for weather, census, holidays, public safety, and location that help you train machine learning models and enrich predictive solutions. Open Datasets are in the cloud on Microsoft Azure and are included in both the SDK and the studio.

Learn how to create Azure Machine Learning Datasets from Azure Open Datasets.

Train with datasets

Use your datasets in your machine learning experiments for training ML models. Learn more about how to train with datasets

Version datasets

You can register a new dataset under the same name by creating a new version. A dataset version is a way to bookmark the state of your data so that you can apply a specific version of the dataset for experimentation or future reproduction. Learn more about dataset versions.

# create a TabularDataset from Titanic training data
web_paths = ['https://dprepdata.blob.core.windows.net/demo/Titanic.csv',
             'https://dprepdata.blob.core.windows.net/demo/Titanic2.csv']
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_paths)

# create a new version of titanic_ds
titanic_ds = titanic_ds.register(workspace = workspace,
                                 name = 'titanic_ds',
                                 description = 'new titanic training data',
                                 create_new_version = True)

Next steps