Datastore class

Definition

Represents a storage abstraction over an Azure Machine Learning storage account.

Datastores are attached to workspaces and are used to store connection information to Azure storage services so you can refer to them by name and don't need to remember the connection information and secret used to connect to the storage services.

Examples of supported Azure storage services that can be registered as datastores are:

  • Azure Blob Container

  • Azure File Share

  • Azure Data Lake

  • Azure Data Lake Gen2

  • Azure SQL Database

  • Azure Database for PostgreSQL

  • Databricks File System

  • Azure Database for MySQL

Use this class to perform management operations, including register, list, get, and remove datastores. Datastores for each service are created with the register* methods of this class. When using a datastore to access data, you must have permission to access that data, which depends on the credentials registered with the datastore.

For more information on datastores and how they can be used in machine learning see the following articles:

Datastore(workspace, name=None)
Inheritance
builtins.object
Datastore

Remarks

To interact with data in your datastores for machine learning tasks, like training, create an Azure Machine Learning dataset. Datasets provide functions that load tabular data into a pandas or Spark DataFrame. Datasets also provide the ability to download or mount files of any format from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL. Learn more about how to train with datasets.

The following example shows how to create a Datastore connected to Azure Blob Container.


   from msrest.exceptions import HttpOperationError

   blob_datastore_name='MyBlobDatastore'
   account_name=os.getenv("BLOB_ACCOUNTNAME_62", "<my-account-name>") # Storage account name
   container_name=os.getenv("BLOB_CONTAINER_62", "<my-container-name>") # Name of Azure blob container
   account_key=os.getenv("BLOB_ACCOUNT_KEY_62", "<my-account-key>") # Storage account key

   try:
       blob_datastore = Datastore.get(ws, blob_datastore_name)
       print("Found Blob Datastore with name: %s" % blob_datastore_name)
   except HttpOperationError:
       blob_datastore = Datastore.register_azure_blob_container(
           workspace=ws,
           datastore_name=blob_datastore_name,
           account_name=account_name, # Storage account name
           container_name=container_name, # Name of Azure blob container
           account_key=account_key) # Storage account key
       print("Registered blob datastore with name: %s" % blob_datastore_name)

   blob_data_ref = DataReference(
       datastore=blob_datastore,
       data_reference_name="blob_test_data",
       path_on_datastore="testdata")

Full sample is available from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-data-transfer.ipynb

Methods

get(workspace, datastore_name)

Get a datastore by name. This is same as calling the constructor.

get_default(workspace)

Get the default datastore for the workspace.

register_azure_blob_container(workspace, datastore_name, container_name, account_name, sas_token=None, account_key=None, protocol=None, endpoint=None, overwrite=False, create_if_not_exists=False, skip_validation=False, blob_cache_timeout=None, grant_workspace_access=False, subscription_id=None, resource_group=None)

Register an Azure Blob Container to the datastore.

You can choose to use SAS Token or Storage Account Key

register_azure_data_lake(workspace, datastore_name, store_name, tenant_id=None, client_id=None, client_secret=None, resource_url=None, authority_url=None, subscription_id=None, resource_group=None, overwrite=False)

Initialize a new Azure Data Lake Datastore.

Please see below for an example of how to register an Azure Data Lake Gen1 as a Datastore.


   adlsgen1_datastore_name='adlsgen1datastore'

   store_name=os.getenv("ADL_STORENAME", "<my_datastore_name>") # the ADLS name
   subscription_id=os.getenv("ADL_SUBSCRIPTION", "<my_subscription_id>") # subscription id of the ADLS
   resource_group=os.getenv("ADL_RESOURCE_GROUP", "<my_resource_group>") # resource group of ADLS
   tenant_id=os.getenv("ADL_TENANT", "<my_tenant_id>") # tenant id of service principal
   client_id=os.getenv("ADL_CLIENTID", "<my_client_id>") # client id of service principal
   client_secret=os.getenv("ADL_CLIENT_SECRET", "<my_client_secret>") # the secret of service principal

   adls_datastore = Datastore.register_azure_data_lake(
       workspace=ws,
       datastore_name=aslsgen1_datastore_name,
       subscription_id=subscription_id, # subscription id of ADLS account
       resource_group=resource_group, # resource group of ADLS account
       store_name=store_name, # ADLS account name
       tenant_id=tenant_id, # tenant id of service principal
       client_id=client_id, # client id of service principal
       client_secret=client_secret) # the secret of service principal
register_azure_data_lake_gen2(workspace, datastore_name, filesystem, account_name, tenant_id, client_id, client_secret, resource_url=None, authority_url=None, protocol=None, endpoint=None, overwrite=False)

Initialize a new Azure Data Lake Gen2 Datastore.

register_azure_file_share(workspace, datastore_name, file_share_name, account_name, sas_token=None, account_key=None, protocol=None, endpoint=None, overwrite=False, create_if_not_exists=False, skip_validation=False)

Register an Azure File Share to the datastore.

You can choose to use SAS Token or Storage Account Key

register_azure_my_sql(workspace, datastore_name, server_name, database_name, user_id, user_password, port_number=None, endpoint=None, overwrite=False, **kwargs)

Initialize a new Azure MySQL Datastore.

MMySQL datastore can only be used to create DataReference as input and output to DataTransferStep in Azure Machine Learning pipelines. More details can be found here.

Please see below for an example of how to register an Azure MySQL database as a Datastore.

register_azure_postgre_sql(workspace, datastore_name, server_name, database_name, user_id, user_password, port_number=None, endpoint=None, overwrite=False, enforce_ssl=True, **kwargs)

Initialize a new Azure PostgreSQL Datastore.

Please see below for an example of how to register an Azure PostgreSQL database as a Datastore.

register_azure_sql_database(workspace, datastore_name, server_name, database_name, tenant_id=None, client_id=None, client_secret=None, resource_url=None, authority_url=None, endpoint=None, overwrite=False, username=None, password=None, **kwargs)

Initialize a new Azure SQL database Datastore.

Please see below for an example of how to register an Azure SQL database as a Datastore.

register_dbfs(workspace, datastore_name)

Initialize a new Databricks File System (DBFS) datastore.

The DBFS datastore can only be used to create DataReference as input and PipelineData as output to DatabricksStep in Azure Machine Learning pipelines. More details can be found here..

set_as_default()

Set the default datastore.

unregister()

Unregisters the datastore. the underlying storage service will not be deleted.

get(workspace, datastore_name)

Get a datastore by name. This is same as calling the constructor.

get(workspace, datastore_name)

Parameters

workspace
Workspace

The workspace.

datastore_name
str, optional

The name of the datastore, defaults to None, which gets the default datastore.

Returns

The corresponding datastore for that name.

Return type

get_default(workspace)

Get the default datastore for the workspace.

get_default(workspace)

Parameters

workspace
Workspace

The workspace.

Returns

The default datastore for the workspace

Return type

register_azure_blob_container(workspace, datastore_name, container_name, account_name, sas_token=None, account_key=None, protocol=None, endpoint=None, overwrite=False, create_if_not_exists=False, skip_validation=False, blob_cache_timeout=None, grant_workspace_access=False, subscription_id=None, resource_group=None)

Register an Azure Blob Container to the datastore.

You can choose to use SAS Token or Storage Account Key

register_azure_blob_container(workspace, datastore_name, container_name, account_name, sas_token=None, account_key=None, protocol=None, endpoint=None, overwrite=False, create_if_not_exists=False, skip_validation=False, blob_cache_timeout=None, grant_workspace_access=False, subscription_id=None, resource_group=None)

Parameters

workspace
Workspace

The workspace.

datastore_name
str

The name of the datastore, case insensitive, can only contain alphanumeric characters and _.

container_name
str

The name of the azure blob container.

account_name
str

The storage account name.

sas_token
str, optional
default value: None

An account SAS token, defaults to None. For data read, we require a minimum of List & Read permissions for Containers & Objects and for data write we additionally require Write & Add permissions.

account_key
str, optional
default value: None

Access keys of your storage account, defaults to None.

protocol
str, optional
default value: None

Protocol to use to connect to the blob container. If None, defaults to https.

endpoint
str, optional
default value: None

The endpoint of the storage account. If None, defaults to core.windows.net.

overwrite
bool, optional
default value: False

overwrites an existing datastore. If the datastore does not exist, it will create one, defaults to False

create_if_not_exists
bool, optional
default value: False

create the file share if it does not exists, defaults to False

skip_validation
bool, optional
default value: False

skips validation of storage keys, defaults to False

blob_cache_timeout
int, optional
default value: None

When this blob is mounted, set the cache timeout to this many seconds. If None, defaults to no timeout (i.e. blobs will be cached for the duration of the job when read).

grant_workspace_access
bool, optional
default value: False

(Deprecated) This is deprecated because we no longer need the workspace managed identity to have access to your storage account in order to register Azure Blob Storage behind a VNet as a datastore. Defaults to False, setting this to True will use your current identity to try grant the workspace managed identity the Storage Blob Data Owner role to the storage account. With the workspace managed identity having this role, it will allow our services to communicate with the Azure Blob Storage even if the Azure Blob Storage is behind a VNet.

subscription_id
str, optional
default value: None

The subscription id of the storage account, defaults to None.

resource_group
str, optional
default value: None

The resource group of the storage account, defaults to None.

Returns

The blob datastore.

Return type

Remarks

If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.

register_azure_data_lake(workspace, datastore_name, store_name, tenant_id=None, client_id=None, client_secret=None, resource_url=None, authority_url=None, subscription_id=None, resource_group=None, overwrite=False)

Initialize a new Azure Data Lake Datastore.

Please see below for an example of how to register an Azure Data Lake Gen1 as a Datastore.


   adlsgen1_datastore_name='adlsgen1datastore'

   store_name=os.getenv("ADL_STORENAME", "<my_datastore_name>") # the ADLS name
   subscription_id=os.getenv("ADL_SUBSCRIPTION", "<my_subscription_id>") # subscription id of the ADLS
   resource_group=os.getenv("ADL_RESOURCE_GROUP", "<my_resource_group>") # resource group of ADLS
   tenant_id=os.getenv("ADL_TENANT", "<my_tenant_id>") # tenant id of service principal
   client_id=os.getenv("ADL_CLIENTID", "<my_client_id>") # client id of service principal
   client_secret=os.getenv("ADL_CLIENT_SECRET", "<my_client_secret>") # the secret of service principal

   adls_datastore = Datastore.register_azure_data_lake(
       workspace=ws,
       datastore_name=aslsgen1_datastore_name,
       subscription_id=subscription_id, # subscription id of ADLS account
       resource_group=resource_group, # resource group of ADLS account
       store_name=store_name, # ADLS account name
       tenant_id=tenant_id, # tenant id of service principal
       client_id=client_id, # client id of service principal
       client_secret=client_secret) # the secret of service principal
register_azure_data_lake(workspace, datastore_name, store_name, tenant_id=None, client_id=None, client_secret=None, resource_url=None, authority_url=None, subscription_id=None, resource_group=None, overwrite=False)

Parameters

workspace
Workspace

The workspace this datastore belongs to.

datastore_name
str

The datastore name.

store_name
str

The ADLS store name.

tenant_id
str, optional
default value: None

The Directory ID/Tenant ID of the service principal used to access data.

client_id
str, optional
default value: None

The Client ID/Application ID of the service principal used to access data.

client_secret
str, optional
default value: None

The Client Secret of the service principal used to access data.

resource_url
str, optional
default value: None

The resource URL, which determines what operations will be performed on the Data Lake store, if None, defaults to https://datalake.azure.net/ which allows us to perform filesystem operations.

authority_url
str, optional
default value: None

The authority URL used to authenticate the user, defaults to https://login.microsoftonline.com.

subscription_id
str, optional
default value: None

The ID of the subscription the ADLS store belongs to.

resource_group
str, optional
default value: None

The resource group the ADLS store belongs to.

overwrite
bool, optional
default value: False

Whether to overwrite an existing datastore. If the datastore does not exist, it will create one. The default is False.

Returns

Returns the Azure Data Lake Datastore.

Return type

Remarks

If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.

Note

Azure Data Lake Datastore supports data transfer and running U-Sql jobs using Azure Machine Learning Pipelines.

You can also use it as a data source for Azure Machine Learning Dataset which can be downloaded or mounted on any supported compute.

register_azure_data_lake_gen2(workspace, datastore_name, filesystem, account_name, tenant_id, client_id, client_secret, resource_url=None, authority_url=None, protocol=None, endpoint=None, overwrite=False)

Initialize a new Azure Data Lake Gen2 Datastore.

register_azure_data_lake_gen2(workspace, datastore_name, filesystem, account_name, tenant_id, client_id, client_secret, resource_url=None, authority_url=None, protocol=None, endpoint=None, overwrite=False)

Parameters

workspace
Workspace

The workspace this datastore belongs to.

datastore_name
str

The datastore name.

filesystem
str

The name of the Data Lake Gen2 filesystem.

account_name
str

The storage account name.

tenant_id
str

The Directory ID/Tenant ID of the service principal.

client_id
str

The Client ID/Application ID of the service principal.

client_secret
str

The secret of the service principal.

resource_url
str, optional
default value: None

The resource URL, which determines what operations will be performed on the data lake store, defaults to https://storage.azure.com/ which allows us to perform filesystem operations.

authority_url
str, optional
default value: None

The authority URL used to authenticate the user, defaults to https://login.microsoftonline.com.

protocol
str, optional
default value: None

Protocol to use to connect to the blob container. If None, defaults to https.

endpoint
str, optional
default value: None

The endpoint of the storage account. If None, defaults to core.windows.net.

overwrite
bool, optional
default value: False

Whether to overwrite an existing datastore. If the datastore does not exist, it will create one. The default is False.

Returns

Returns the Azure Data Lake Gen2 Datastore.

Return type

Remarks

If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.

register_azure_file_share(workspace, datastore_name, file_share_name, account_name, sas_token=None, account_key=None, protocol=None, endpoint=None, overwrite=False, create_if_not_exists=False, skip_validation=False)

Register an Azure File Share to the datastore.

You can choose to use SAS Token or Storage Account Key

register_azure_file_share(workspace, datastore_name, file_share_name, account_name, sas_token=None, account_key=None, protocol=None, endpoint=None, overwrite=False, create_if_not_exists=False, skip_validation=False)

Parameters

workspace
Workspace

The workspace this datastore belongs to.

datastore_name
str

The name of the datastore, case insensitive, can only contain alphanumeric characters and _.

file_share_name
str

The name of the azure file container.

account_name
str

The storage account name.

sas_token
str, optional
default value: None

An account SAS token, defaults to None. For data read, we require a minimum of List & Read permissions for Containers & Objects and for data write we additionally require Write & Add permissions.

account_key
str, optional
default value: None

Access keys of your storage account, defaults to None.

protocol
str, optional
default value: None

The protocol to use to connect to the file share. If None, defaults to https.

endpoint
str, optional
default value: None

The endpoint of the file share. If None, defaults to core.windows.net.

overwrite
bool, optional
default value: False

Whether to overwrite an existing datastore. If the datastore does not exist, it will create one. The default is False.

create_if_not_exists
bool, optional
default value: False

Whether to create the file share if it does not exists. The default is False.

skip_validation
bool, optional
default value: False

Whether to skip validation of storage keys. The default is False.

Returns

The file datastore.

Return type

Remarks

If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.

register_azure_my_sql(workspace, datastore_name, server_name, database_name, user_id, user_password, port_number=None, endpoint=None, overwrite=False, **kwargs)

Initialize a new Azure MySQL Datastore.

MMySQL datastore can only be used to create DataReference as input and output to DataTransferStep in Azure Machine Learning pipelines. More details can be found here.

Please see below for an example of how to register an Azure MySQL database as a Datastore.

register_azure_my_sql(workspace, datastore_name, server_name, database_name, user_id, user_password, port_number=None, endpoint=None, overwrite=False, **kwargs)

Parameters

workspace
Workspace

The workspace this datastore belongs to.

datastore_name
str

The datastore name.

server_name
str

The MySQL server name.

database_name
str

The MySQL database name.

user_id
str

The User ID of the MySQL server.

user_password
str

The user password of the MySQL server.

port_number
str
default value: None

The port number of the MySQL server.

endpoint
str, optional
default value: None

The endpoint of the MySQL server. If None, defaults to mysql.database.azure.com.

overwrite
bool, optional
default value: False

Whether to overwrite an existing datastore. If the datastore does not exist, it will create one. The default is False.

Returns

Returns the MySQL database Datastore.

Return type

Remarks

If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.


   mysql_datastore_name="mysqldatastore"
   server_name=os.getenv("MYSQL_SERVERNAME", "<my_server_name>") # FQDN name of the MySQL server
   database_name=os.getenv("MYSQL_DATBASENAME", "<my_database_name>") # Name of the MySQL database
   user_id=os.getenv("MYSQL_USERID", "<my_user_id>") # The User ID of the MySQL server
   user_password=os.getenv("MYSQL_USERPW", "<my_user_password>") # The user password of the MySQL server.

   mysql_datastore = Datastore.register_azure_my_sql(
       workspace=ws,
       datastore_name=mysql_datastore_name,
       server_name=server_name,
       database_name=database_name,
       user_id=user_id,
       user_password=user_password)

register_azure_postgre_sql(workspace, datastore_name, server_name, database_name, user_id, user_password, port_number=None, endpoint=None, overwrite=False, enforce_ssl=True, **kwargs)

Initialize a new Azure PostgreSQL Datastore.

Please see below for an example of how to register an Azure PostgreSQL database as a Datastore.

register_azure_postgre_sql(workspace, datastore_name, server_name, database_name, user_id, user_password, port_number=None, endpoint=None, overwrite=False, enforce_ssl=True, **kwargs)

Parameters

workspace
Workspace

The workspace this datastore belongs to.

datastore_name
str

The datastore name.

server_name
str

The PostgreSQL server name.

database_name
str

The PostgreSQL database name.

user_id
str

The User ID of the PostgreSQL server.

user_password
str

The User Password of the PostgreSQL server.

port_number
str
default value: None

The Port Number of the PostgreSQL server

endpoint
str, optional
default value: None

The endpoint of the PostgreSQL server. If None, defaults to postgres.database.azure.com.

overwrite
bool, optional
default value: False

Whether to overwrite an existing datastore. If the datastore does not exist, it will create one. The default is False.

enforce_ssl
bool
default value: True

Indicates SSL requirement of PostgreSQL server. Defaults to True.

Returns

Returns the PostgreSQL database Datastore.

Return type

Remarks

If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.


   psql_datastore_name="postgresqldatastore"
   server_name=os.getenv("PSQL_SERVERNAME", "<my_server_name>") # FQDN name of the PostgreSQL server
   database_name=os.getenv("PSQL_DATBASENAME", "<my_database_name>") # Name of the PostgreSQL database
   user_id=os.getenv("PSQL_USERID", "<my_user_id>") # The database user id
   user_password=os.getenv("PSQL_USERPW", "<my_user_password>") # The database user password

   psql_datastore = Datastore.register_azure_postgre_sql(
       workspace=ws,
       datastore_name=psql_datastore_name,
       server_name=server_name,
       database_name=database_name,
       user_id=user_id,
       user_password=user_password)

register_azure_sql_database(workspace, datastore_name, server_name, database_name, tenant_id=None, client_id=None, client_secret=None, resource_url=None, authority_url=None, endpoint=None, overwrite=False, username=None, password=None, **kwargs)

Initialize a new Azure SQL database Datastore.

Please see below for an example of how to register an Azure SQL database as a Datastore.

register_azure_sql_database(workspace, datastore_name, server_name, database_name, tenant_id=None, client_id=None, client_secret=None, resource_url=None, authority_url=None, endpoint=None, overwrite=False, username=None, password=None, **kwargs)

Parameters

workspace
Workspace

The workspace this datastore belongs to.

datastore_name
str

The datastore name.

server_name
str

The SQL server name.

database_name
str

The SQL database name.

tenant_id
str
default value: None

The Directory ID/Tenant ID of the service principal.

client_id
str
default value: None

The Client ID/Application ID of the service principal.

client_secret
str
default value: None

The secret of the service principal.

resource_url
str, optional
default value: None

The resource URL, which determines what operations will be performed on the SQL database store, if None, defaults to https://database.windows.net/.

authority_url
str, optional
default value: None

The authority URL used to authenticate the user, defaults to https://login.microsoftonline.com.

endpoint
str, optional
default value: None

The endpoint of the SQL server. If None, defaults to database.windows.net.

overwrite
bool, optional
default value: False

Whether to overwrite an existing datastore. If the datastore does not exist, it will create one. The default is False.

username
str
default value: None

The username of the database user to access the database.

password
str
default value: None

The password of the database user to access the database.

skip_validation
bool, optional

Whether to skip validation of connecting to the SQL database. Defaults to False.

Returns

Returns the SQL database Datastore.

Return type

Remarks

If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.


   sql_datastore_name="azuresqldatastore"
   server_name=os.getenv("SQL_SERVERNAME", "<my_server_name>") # Name of the Azure SQL server
   database_name=os.getenv("SQL_DATABASENAME", "<my_database_name>") # Name of the Azure SQL database
   username=os.getenv("SQL_USER_NAME", "<my_sql_user_name>") # The username of the database user.
   password=os.getenv("SQL_USER_PASSWORD", "<my_sql_user_password>") # The password of the database user.

   sql_datastore = Datastore.register_azure_sql_database(
       workspace=ws,
       datastore_name=sql_datastore_name,
       server_name=server_name,
       database_name=database_name,
       username=username,
       password=password)

register_dbfs(workspace, datastore_name)

Initialize a new Databricks File System (DBFS) datastore.

The DBFS datastore can only be used to create DataReference as input and PipelineData as output to DatabricksStep in Azure Machine Learning pipelines. More details can be found here..

register_dbfs(workspace, datastore_name)

Parameters

workspace
Workspace

The workspace this datastore belongs to.

datastore_name
str

The datastore name.

Returns

Returns the DBFS Datastore.

Return type

Remarks

If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.

set_as_default()

Set the default datastore.

set_as_default()

Parameters

datastore_name
str

The name of the datastore.

unregister()

Unregisters the datastore. the underlying storage service will not be deleted.

unregister()