Access data from your datastores

In Azure Machine Learning service, datastores are compute location-independent mechanisms to access storage without requiring changes to your source code. Whether you write training code to take a path as a parameter, or provide a datastore directly to an estimator, Azure Machine Learning workflows ensure your datastore locations are accessible, and made available to your compute context.

This how-to shows examples of the following tasks:

Prerequisites

To use datastores, you first need a workspace.

Start by either creating a new workspace or retrieving an existing one:

import azureml.core
from azureml.core import Workspace, Datastore

ws = Workspace.from_config()

Choose a datastore

You can use the default datastore or bring your own.

Use the default datastore in your workspace

Each workspace has a registered, default datastore that you can use right away.

To get the workspace's default datastore:

ds = ws.get_default_datastore()

Register your own datastore with the workspace

If you have existing Azure Storage, you can register it as a datastore on your workspace. All the register methods are on the Datastore class and have the form register_azure_*.

The following examples show you to register an Azure Blob Container or an Azure File Share as a datastore.

  • For an Azure Blob Container Datastore, use register_azure_blob-container()

    ds = Datastore.register_azure_blob_container(workspace=ws, 
                                                 datastore_name='your datastore name', 
                                                 container_name='your azure blob container name',
                                                 account_name='your storage account name', 
                                                 account_key='your storage account key',
                                                 create_if_not_exists=True)
    
  • For an Azure File Share Datastore, use register_azure_file_share(). For example:

    ds = Datastore.register_azure_file_share(workspace=ws, 
                                             datastore_name='your datastore name', 
                                             file_share_name='your file share name',
                                             account_name='your storage account name', 
                                             account_key='your storage account key',
                                             create_if_not_exists=True)
    

Find & define datastores

To get a specified datastore registered in the current workspace, use get() :

#get named datastore from current workspace
ds = Datastore.get(ws, datastore_name='your datastore name')

To get a list of all datastores in a given workspace, use this code:

#list all datastores registered in current workspace
datastores = ws.datastores
for name, ds in datastores.items():
    print(name, ds.datastore_type)

To define a different default datastore for the current workspace, use set_default_datastore():

#define default datastore for current workspace
ws.set_default_datastore('your datastore name')

Upload & download data

The upload() and download() methods described in the following examples are specific to and operate identically for the AzureBlobDatastore and AzureFileDatastore classes.

Upload

Upload either a directory or individual files to the datastore using the Python SDK.

To upload a directory to a datastore ds:

import azureml.data
from azureml.data.azure_storage_datastore import AzureFileDatastore, AzureBlobDatastore

ds.upload(src_dir='your source directory',
          target_path='your target path',
          overwrite=True,
          show_progress=True)

target_path specifies the location in the file share (or blob container) to upload. It defaults to None, in which case the data gets uploaded to root. overwrite=True will overwrite any existing data at target_path.

Or upload a list of individual files to the datastore via the datastore's upload_files() method.

Download

Similarly, download data from a datastore to your local file system.

ds.download(target_path='your target path',
            prefix='your prefix',
            show_progress=True)

target_path is the location of the local directory to download the data to. To specify a path to the folder in the file share (or blob container) to download, provide that path to prefix. If prefix is None, all the contents of your file share (or blob container) will get downloaded.

Access datastores during training

Once you make your datastore available on the compute target, you can access it during training runs (for example, training or validation data) by simply passing the path to it as a parameter in your training script.

The following table lists the DataReference methods that tell the compute target how to use the datastore during runs.

Way Method Description
Mount as_mount() Use to mount the datastore on the compute target.
Download as_download() Use to download the contents of your datastore to the location specified by path_on_compute.
For training run context, this download happens before the run.
Upload as_upload() Use to upload a file from the location specified by path_on_compute to your datastore.
For training run context, this upload happens after your run.
import azureml.data
from azureml.data.data_reference import DataReference

ds.as_mount()
ds.as_download(path_on_compute='your path on compute')
ds.as_upload(path_on_compute='yourfilename')

To reference a specific folder or file in your datastore and make it available on the compute target, use the datastore's path() function.

#download the contents of the `./bar` directory in ds to the compute target
ds.path('./bar').as_download()

Note

Any ds or ds.path object resolves to an environment variable name of the format "$AZUREML_DATAREFERENCE_XXXX" whose value represents the mount/download path on the target compute. The datastore path on the target compute might not be the same as the execution path for the training script.

Compute context and datastore type matrix

The following matrix displays the available data access functionalities for the different compute context and datastore scenarios. The term "Pipeline" in this matrix refers to the ability to use datastores as an input or output in Azure Machine Learning Pipelines.

Local Compute Azure Machine Learning Compute Data Transfer Databricks HDInsight Azure Batch Azure DataLake Analytics Virtual Machines
AzureBlobDatastore [as_download()] [as_upload()] [as_mount()]
[as_download()] [as_upload()]
Pipeline
Pipeline Pipeline [as_download()]
[as_upload()]
Pipeline [as_download()]
[as_upload()]
AzureFileDatastore [as_download()] [as_upload()] [as_mount()]
[as_download()] [as_upload()] Pipeline
[as_download()] [as_upload()] [as_download()] [as_upload()]
AzureDataLakeDatastore Pipeline Pipeline Pipeline
AzureDataLakeGen2Datastore Pipeline
AzureDataPostgresSqlDatastore Pipeline
AzureSqlDatabaseDataDatastore Pipeline

Note

There may be scenarios in which highly iterative, large data processes run faster using [as_download()] instead of [as_mount()]; this can be validated experimentally.

Examples

The following code examples are specific to the Estimator class for accessing your datastore during training.

This code creates an estimator using the training script, train.py, from the indicated source directory using the parameters defined in script_params, all on the specified compute target.

from azureml.train.estimator import Estimator

script_params = {
    '--data_dir': ds.as_mount()
}

est = Estimator(source_directory='your code directory',
                entry_script='train.py',
                script_params=script_params,
                compute_target=compute_target
                )

You can also pass in a list of datastores to the Estimator constructor inputs parameter to mount or copy to/from your datastore(s). This code example:

  • Downloads all the contents in datastore ds1 to the compute target before your training script train.py is run
  • Downloads the folder './foo' in datastore ds2 to the compute target before train.py is run
  • Uploads the file './bar.pkl' from the compute target up to the datastore ds3 after your script has run
est = Estimator(source_directory='your code directory',
                compute_target=compute_target,
                entry_script='train.py',
                inputs=[ds1.as_download(), ds2.path('./foo').as_download(), ds3.as_upload(path_on_compute='./bar.pkl')])

Next steps