Access data in Azure storage services

APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

In this article, learn how to easily access your data in Azure storage services via Azure Machine Learning datastores. Datastores are used to store connection information, like your subscription ID and token authorization. Using datastores allows you to access your storage without having to hard code connection information in your scripts. You can create datastores from these Azure storage solutions. For unsupported storage solutions, to save data egress cost during machine learning experiments, we recommend you move your data to our supported Azure storage solutions. Learn how to move your data.

This how-to shows examples of the following tasks:

Prerequisites

Create and register datastores

When you register an Azure storage solution as a datastore, you automatically create that datastore in a specific workspace. You can create and register datastores to a workspace using the Python SDK or Azure Machine Learning studio.

Using the Python SDK

All the register methods are on the Datastore class and have the form register_azure_*.

The information you need to populate the register() method can be found via Azure Machine Learning studio. Select Storage Accounts on the left pane and choose the storage account you want to register. The Overview page provides information such as, the account name and container or file share name. For authentication information, like account key or SAS token, navigate to Account Keys under the Settings pane on the left.

The following examples show you to register an Azure Blob Container or an Azure File Share as a datastore.

  • For an Azure Blob Container Datastore, use register_azure_blob-container()

    The following code creates and registers the datastore, my_datastore, to the workspace, ws. This datastore accesses the Azure blob container, my_blob_container, on the Azure storage account, my_storage_account using the provided account key.

       datastore = Datastore.register_azure_blob_container(workspace=ws, 
                                                          datastore_name='my_datastore', 
                                                          container_name='my_blob_container',
                                                          account_name='my_storage_account', 
                                                          account_key='your storage account key',
                                                          create_if_not_exists=True)
    

    If your storage account is in a VNET, only Azure blob datastore creation is supported. Set the parameter, grant_workspace_access to True to grant your workspace access to your storage account.

  • For an Azure File Share Datastore, use register_azure_file_share().

    The following code creates and registers the datastore, my_datastore, to the workspace, ws. This datastore accesses the Azure file share, my_file_share, on the Azure storage account, my_storage_account using the provided account key.

       datastore = Datastore.register_azure_file_share(workspace=ws, 
                                                      datastore_name='my_datastore', 
                                                      file_share_name='my_file_share',
                                                      account_name='my_storage account', 
                                                      account_key='your storage account key',
                                                      create_if_not_exists=True)
    

Storage guidance

We recommend Azure Blob Container. Both standard and premium storage are available for blobs. Although more expensive, we suggest premium storage due to faster throughput speeds that may improve the speed of your training runs, particularly if you train against a large data set. See the Azure pricing calculator for storage account cost information.

Using Azure Machine Learning studio

Create a new datastore in a few steps in Azure Machine Learning studio.

  1. Sign in to Azure Machine Learning studio.
  2. Select Datastores in the left pane under Manage.
  3. Select + New datastore.
  4. Complete the New datastore form. The form intelligently updates based on the Azure storage type and authentication type selections.

The information you need to populate the form can be found via Azure Machine Learning studio. Select Storage Accounts on the left pane and choose the storage account you want to register. The Overview page provides information such as, the account name and container or file share name. For authentication items, like account key or SAS token, navigate to Account Keys under the Settings pane on the left.

The following example demonstrates what the form would look like for creating an Azure blob datastore.

New datastore

Get datastores from your workspace

To get a specific datastore registered in the current workspace, use the get() static method on Datastore class:

#get named datastore from current workspace
datastore = Datastore.get(ws, datastore_name='your datastore name')

To get the list of datastores registered with a given workspace, you can use the datastores property on a workspace object:

#list all datastores registered in current workspace
datastores = ws.datastores
for name, datastore in datastores.items():
    print(name, datastore.datastore_type)

When you create a workspace, an Azure Blob Container and an Azure File Share are registered to the workspace named workspaceblobstore and workspacefilestore respectively. They store the connection information of the Blob Container and the File Share that is provisioned in the storage account attached to the workspace. The workspaceblobstore is set as the default datastore.

To get the workspace's default datastore:

datastore = ws.get_default_datastore()

To define a different default datastore for the current workspace, use set_default_datastore() method on the workspace object:

#define default datastore for current workspace
ws.set_default_datastore('your datastore name')

Upload & download data

The upload() and download() methods described in the following examples are specific to and operate identically for the AzureBlobDatastore and AzureFileDatastore classes.

Upload

Upload either a directory or individual files to the datastore using the Python SDK.

To upload a directory to a datastore datastore:

import azureml.data
from azureml.data.azure_storage_datastore import AzureFileDatastore, AzureBlobDatastore

datastore.upload(src_dir='your source directory',
                 target_path='your target path',
                 overwrite=True,
                 show_progress=True)

The target_path parameter specifies the location in the file share (or blob container) to upload. It defaults to None, in which case the data gets uploaded to root. Otherwise, if overwrite=True any existing data at target_path is overwritten.

Or upload a list of individual files to the datastore via the upload_files() method.

Download

Similarly, download data from a datastore to your local file system.

datastore.download(target_path='your target path',
                   prefix='your prefix',
                   show_progress=True)

The target_path parameter is the location of the local directory to download the data to. To specify a path to the folder in the file share (or blob container) to download, provide that path to prefix. If prefix is None, all the contents of your file share (or blob container) will get downloaded.

Access your data during training

Important

Using Azure Machine Learning datasets is the new recommended way to access your data in training. Datasets provide functions that load tabular data into pandas or spark DataFrame, and the ability to download or mount files of any format from Azure Blob, Azure File, Azure Data Lake Gen 1, Azure Data Lake Gen 2, Azure SQL, Azure PostgreSQL. Learn more about how to train with datasets.

The following table lists the methods that tell the compute target how to use the datastores during runs.

Way Method Description
Mount as_mount() Use to mount the datastore on the compute target.
Download as_download() Use to download the contents of your datastore to the location specified by path_on_compute.

This download happens before the run.
Upload as_upload() Use to upload a file from the location specified by path_on_compute to your datastore.

This upload happens after your run.

To reference a specific folder or file in your datastore and make it available on the compute target, use the datastore path() method.

#to mount the full contents in your storage to the compute target
datastore.as_mount()

#to download the contents of the `./bar` directory in your storage to the compute target
datastore.path('./bar').as_download()

Note

Any specified datastore or datastore.path object resolves to an environment variable name of the format "$AZUREML_DATAREFERENCE_XXXX", whose value represents the mount/download path on the target compute. The datastore path on the target compute might not be the same as the execution path for the training script.

Examples

The following code examples are specific to the Estimator class for accessing data during training.

script_params is a dictionary containing parameters to the entry_script. Use it to pass in a datastore and describe how data is made available on the compute target. Learn more from our end-to-end tutorial.

from azureml.train.estimator import Estimator

script_params = {
    '--data_dir': datastore.path('/bar').as_mount()
}

est = Estimator(source_directory='your code directory',
                entry_script='train.py',
                script_params=script_params,
                compute_target=compute_target
                )

You can also pass in a list of datastores to the Estimator constructor inputs parameter to mount or copy data to/from your datastore(s). This code example:

  • Downloads all the contents in datastore1 to the compute target before your training script train.py is run
  • Downloads the folder './foo' in datastore2 to the compute target before train.py is run
  • Uploads the file './bar.pkl' from the compute target to the datastore3 after your script has run
est = Estimator(source_directory='your code directory',
                compute_target=compute_target,
                entry_script='train.py',
                inputs=[datastore1.as_download(), datastore2.path('./foo').as_download(), datastore3.as_upload(path_on_compute='./bar.pkl')])

Compute and datastore matrix

Datastores currently support storing connection information to the storage services listed in the following matrix. This matrix displays the available data access functionalities for the different compute targets and datastore scenarios. Learn more about the compute targets for Azure Machine Learning.

Compute AzureBlobDatastore AzureFileDatastore AzureDataLakeDatastore AzureDataLakeGen2Datastore AzurePostgreSqlDatastore AzureSqlDatabaseDatastore
Local as_download(), as_upload() as_download(), as_upload() N/A N/A
Azure Machine Learning Compute as_mount(), as_download(), as_upload(), ML pipelines as_mount(), as_download(), as_upload(), ML pipelines N/A N/A
Virtual machines as_download(), as_upload() as_download() as_upload() N/A N/A
HDInsight as_download() as_upload() as_download() as_upload() N/A N/A
Data transfer ML pipelines N/A ML pipelines ML pipelines
Databricks ML pipelines N/A ML pipelines N/A
Azure Batch ML pipelines N/A N/A N/A
Azure DataLake Analytics N/A N/A ML pipelines N/A

Note

There may be scenarios in which highly iterative, large data processes run faster using as_download() instead of as_mount(); this can be validated experimentally.

Accessing source code during training

Azure blob storage has higher throughput speeds than Azure file share and will scale to large numbers of jobs started in parallel. For this reason, we recommend configuring your runs to use blob storage for transferring source code files.

The following code example specifies in the run configuration which blob datastore to use for source code transfers.

# workspaceblobstore is the default blob storage
run_config.source_directory_data_store = "workspaceblobstore" 

Access data during scoring

Azure Machine Learning provides several ways to use your models for scoring. Some of these methods do not provide access to datastores. Use the following table to understand which methods allow you to access datastores during scoring:

Method Datastore access Description
Batch prediction Make predictions on large quantities of data asynchronously.
Web service   Deploy model(s) as a web service.
IoT Edge module   Deploy model(s) to IoT Edge devices.

For situations where the SDK doesn't provide access to datastores, you may be able to create custom code using the relevant Azure SDK to access the data. For example, the Azure Storage SDK for Python is a client library that you can use to access data stored in blobs or files.

Move data to supported Azure storage solutions

Azure Machine Learning supports accessing data from Azure Blob, Azure File, Azure Data Lake Gen 1, Azure Data Lake Gen 2, Azure SQL, Azure PostgreSQL. For unsupported storage, to save data egress cost during machine learning experiments, we recommend you move your data to our supported Azure storage solutions using Azure Data Factory. Azure Data Factory provides efficient and resilient data transfer with more than 80 prebuilt connectors—including Azure data services, on-premises data sources, Amazon S3 and Redshift, and Google BigQuery—at no additional cost. Follow step by step guide to move your data using Azure Data Factory.

Next steps