Datastore class
Definition
Represents a storage abstraction over an Azure Machine Learning storage account.
Datastores are attached to workspaces and are used to store connection information to Azure storage services so you can refer to them by name and don't need to remember the connection information and secret used to connect to the storage services.
Examples of supported Azure storage services that can be registered as datastores are:
Azure Blob Container
Azure File Share
Azure Data Lake
Azure Data Lake Gen2
Azure SQL Database
Azure Database for PostgreSQL
Databricks File System
Azure Database for MySQL
Use this class to perform management operations, including register, list, get, and remove datastores.
Datastores for each service are created with the register* methods of this class. When using a datastore
to access data, you must have permission to access that data, which depends on the credentials registered
with the datastore.
For more information on datastores and how they can be used in machine learning see the following articles:
Datastore(workspace, name=None)
- Inheritance
-
builtins.objectDatastore
Remarks
To interact with data in your datastores for machine learning tasks, like training, create an Azure Machine Learning dataset. Datasets provide functions that load tabular data into a pandas or Spark DataFrame. Datasets also provide the ability to download or mount files of any format from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL. Learn more about how to train with datasets.
The following example shows how to create a Datastore connected to Azure Blob Container.
from msrest.exceptions import HttpOperationError
blob_datastore_name='MyBlobDatastore'
account_name=os.getenv("BLOB_ACCOUNTNAME_62", "<my-account-name>") # Storage account name
container_name=os.getenv("BLOB_CONTAINER_62", "<my-container-name>") # Name of Azure blob container
account_key=os.getenv("BLOB_ACCOUNT_KEY_62", "<my-account-key>") # Storage account key
try:
blob_datastore = Datastore.get(ws, blob_datastore_name)
print("Found Blob Datastore with name: %s" % blob_datastore_name)
except HttpOperationError:
blob_datastore = Datastore.register_azure_blob_container(
workspace=ws,
datastore_name=blob_datastore_name,
account_name=account_name, # Storage account name
container_name=container_name, # Name of Azure blob container
account_key=account_key) # Storage account key
print("Registered blob datastore with name: %s" % blob_datastore_name)
blob_data_ref = DataReference(
datastore=blob_datastore,
data_reference_name="blob_test_data",
path_on_datastore="testdata")
Full sample is available from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-data-transfer.ipynb
Methods
| get(workspace, datastore_name) |
Get a datastore by name. This is same as calling the constructor. |
| get_default(workspace) |
Get the default datastore for the workspace. |
| register_azure_blob_container(workspace, datastore_name, container_name, account_name, sas_token=None, account_key=None, protocol=None, endpoint=None, overwrite=False, create_if_not_exists=False, skip_validation=False, blob_cache_timeout=None, grant_workspace_access=False, subscription_id=None, resource_group=None) |
Register an Azure Blob Container to the datastore. You can choose to use SAS Token or Storage Account Key |
| register_azure_data_lake(workspace, datastore_name, store_name, tenant_id=None, client_id=None, client_secret=None, resource_url=None, authority_url=None, subscription_id=None, resource_group=None, overwrite=False) |
Initialize a new Azure Data Lake Datastore. Please see below for an example of how to register an Azure Data Lake Gen1 as a Datastore.
|
| register_azure_data_lake_gen2(workspace, datastore_name, filesystem, account_name, tenant_id, client_id, client_secret, resource_url=None, authority_url=None, protocol=None, endpoint=None, overwrite=False) |
Initialize a new Azure Data Lake Gen2 Datastore. |
| register_azure_file_share(workspace, datastore_name, file_share_name, account_name, sas_token=None, account_key=None, protocol=None, endpoint=None, overwrite=False, create_if_not_exists=False, skip_validation=False) |
Register an Azure File Share to the datastore. You can choose to use SAS Token or Storage Account Key |
| register_azure_my_sql(workspace, datastore_name, server_name, database_name, user_id, user_password, port_number=None, endpoint=None, overwrite=False, **kwargs) |
Initialize a new Azure MySQL Datastore. MMySQL datastore can only be used to create DataReference as input and output to DataTransferStep in Azure Machine Learning pipelines. More details can be found here. Please see below for an example of how to register an Azure MySQL database as a Datastore. |
| register_azure_postgre_sql(workspace, datastore_name, server_name, database_name, user_id, user_password, port_number=None, endpoint=None, overwrite=False, enforce_ssl=True, **kwargs) |
Initialize a new Azure PostgreSQL Datastore. Please see below for an example of how to register an Azure PostgreSQL database as a Datastore. |
| register_azure_sql_database(workspace, datastore_name, server_name, database_name, tenant_id=None, client_id=None, client_secret=None, resource_url=None, authority_url=None, endpoint=None, overwrite=False, username=None, password=None, **kwargs) |
Initialize a new Azure SQL database Datastore. Please see below for an example of how to register an Azure SQL database as a Datastore. |
| register_dbfs(workspace, datastore_name) |
Initialize a new Databricks File System (DBFS) datastore. The DBFS datastore can only be used to create DataReference as input and PipelineData as output to DatabricksStep in Azure Machine Learning pipelines. More details can be found here.. |
| set_as_default() |
Set the default datastore. |
| unregister() |
Unregisters the datastore. the underlying storage service will not be deleted. |
get(workspace, datastore_name)
Get a datastore by name. This is same as calling the constructor.
get(workspace, datastore_name)
Parameters
- workspace
- Workspace
The workspace.
- datastore_name
- str, optional
The name of the datastore, defaults to None, which gets the default datastore.
Returns
The corresponding datastore for that name.
Return type
get_default(workspace)
Get the default datastore for the workspace.
get_default(workspace)
Parameters
- workspace
- Workspace
The workspace.
Returns
The default datastore for the workspace
Return type
register_azure_blob_container(workspace, datastore_name, container_name, account_name, sas_token=None, account_key=None, protocol=None, endpoint=None, overwrite=False, create_if_not_exists=False, skip_validation=False, blob_cache_timeout=None, grant_workspace_access=False, subscription_id=None, resource_group=None)
Register an Azure Blob Container to the datastore.
You can choose to use SAS Token or Storage Account Key
register_azure_blob_container(workspace, datastore_name, container_name, account_name, sas_token=None, account_key=None, protocol=None, endpoint=None, overwrite=False, create_if_not_exists=False, skip_validation=False, blob_cache_timeout=None, grant_workspace_access=False, subscription_id=None, resource_group=None)
Parameters
- workspace
- Workspace
The workspace.
- datastore_name
- str
The name of the datastore, case insensitive, can only contain alphanumeric characters and _.
- container_name
- str
The name of the azure blob container.
- account_name
- str
The storage account name.
- sas_token
- str, optional
An account SAS token, defaults to None. For data read, we require a minimum of List & Read permissions for Containers & Objects and for data write we additionally require Write & Add permissions.
- account_key
- str, optional
Access keys of your storage account, defaults to None.
- protocol
- str, optional
Protocol to use to connect to the blob container. If None, defaults to https.
- endpoint
- str, optional
The endpoint of the storage account. If None, defaults to core.windows.net.
- overwrite
- bool, optional
overwrites an existing datastore. If the datastore does not exist, it will create one, defaults to False
- create_if_not_exists
- bool, optional
create the file share if it does not exists, defaults to False
- skip_validation
- bool, optional
skips validation of storage keys, defaults to False
- blob_cache_timeout
- int, optional
When this blob is mounted, set the cache timeout to this many seconds. If None, defaults to no timeout (i.e. blobs will be cached for the duration of the job when read).
- grant_workspace_access
- bool, optional
(Deprecated) This is deprecated because we no longer need the workspace managed identity to have access to your storage account in order to register Azure Blob Storage behind a VNet as a datastore. Defaults to False, setting this to True will use your current identity to try grant the workspace managed identity the Storage Blob Data Owner role to the storage account. With the workspace managed identity having this role, it will allow our services to communicate with the Azure Blob Storage even if the Azure Blob Storage is behind a VNet.
- subscription_id
- str, optional
The subscription id of the storage account, defaults to None.
- resource_group
- str, optional
The resource group of the storage account, defaults to None.
Returns
The blob datastore.
Return type
Remarks
If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.
register_azure_data_lake(workspace, datastore_name, store_name, tenant_id=None, client_id=None, client_secret=None, resource_url=None, authority_url=None, subscription_id=None, resource_group=None, overwrite=False)
Initialize a new Azure Data Lake Datastore.
Please see below for an example of how to register an Azure Data Lake Gen1 as a Datastore.
adlsgen1_datastore_name='adlsgen1datastore'
store_name=os.getenv("ADL_STORENAME", "<my_datastore_name>") # the ADLS name
subscription_id=os.getenv("ADL_SUBSCRIPTION", "<my_subscription_id>") # subscription id of the ADLS
resource_group=os.getenv("ADL_RESOURCE_GROUP", "<my_resource_group>") # resource group of ADLS
tenant_id=os.getenv("ADL_TENANT", "<my_tenant_id>") # tenant id of service principal
client_id=os.getenv("ADL_CLIENTID", "<my_client_id>") # client id of service principal
client_secret=os.getenv("ADL_CLIENT_SECRET", "<my_client_secret>") # the secret of service principal
adls_datastore = Datastore.register_azure_data_lake(
workspace=ws,
datastore_name=aslsgen1_datastore_name,
subscription_id=subscription_id, # subscription id of ADLS account
resource_group=resource_group, # resource group of ADLS account
store_name=store_name, # ADLS account name
tenant_id=tenant_id, # tenant id of service principal
client_id=client_id, # client id of service principal
client_secret=client_secret) # the secret of service principal
register_azure_data_lake(workspace, datastore_name, store_name, tenant_id=None, client_id=None, client_secret=None, resource_url=None, authority_url=None, subscription_id=None, resource_group=None, overwrite=False)
Parameters
- workspace
- Workspace
The workspace this datastore belongs to.
- datastore_name
- str
The datastore name.
- store_name
- str
The ADLS store name.
- tenant_id
- str, optional
The Directory ID/Tenant ID of the service principal used to access data.
- client_id
- str, optional
The Client ID/Application ID of the service principal used to access data.
- client_secret
- str, optional
The Client Secret of the service principal used to access data.
- resource_url
- str, optional
The resource URL, which determines what operations will be performed on the Data Lake
store, if None, defaults to https://datalake.azure.net/ which allows us to perform filesystem
operations.
- authority_url
- str, optional
The authority URL used to authenticate the user, defaults to
https://login.microsoftonline.com.
- subscription_id
- str, optional
The ID of the subscription the ADLS store belongs to.
- overwrite
- bool, optional
Whether to overwrite an existing datastore. If the datastore does not exist, it will create one. The default is False.
Returns
Returns the Azure Data Lake Datastore.
Return type
Remarks
If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.
Note
Azure Data Lake Datastore supports data transfer and running U-Sql jobs using Azure Machine Learning Pipelines.
You can also use it as a data source for Azure Machine Learning Dataset which can be downloaded or mounted on any supported compute.
register_azure_data_lake_gen2(workspace, datastore_name, filesystem, account_name, tenant_id, client_id, client_secret, resource_url=None, authority_url=None, protocol=None, endpoint=None, overwrite=False)
Initialize a new Azure Data Lake Gen2 Datastore.
register_azure_data_lake_gen2(workspace, datastore_name, filesystem, account_name, tenant_id, client_id, client_secret, resource_url=None, authority_url=None, protocol=None, endpoint=None, overwrite=False)
Parameters
- workspace
- Workspace
The workspace this datastore belongs to.
- datastore_name
- str
The datastore name.
- filesystem
- str
The name of the Data Lake Gen2 filesystem.
- account_name
- str
The storage account name.
- tenant_id
- str
The Directory ID/Tenant ID of the service principal.
- client_id
- str
The Client ID/Application ID of the service principal.
- client_secret
- str
The secret of the service principal.
- resource_url
- str, optional
The resource URL, which determines what operations will be performed on
the data lake store, defaults to https://storage.azure.com/ which allows us to perform filesystem
operations.
- authority_url
- str, optional
The authority URL used to authenticate the user, defaults to
https://login.microsoftonline.com.
- protocol
- str, optional
Protocol to use to connect to the blob container. If None, defaults to https.
- endpoint
- str, optional
The endpoint of the storage account. If None, defaults to core.windows.net.
- overwrite
- bool, optional
Whether to overwrite an existing datastore. If the datastore does not exist, it will create one. The default is False.
Returns
Returns the Azure Data Lake Gen2 Datastore.
Return type
Remarks
If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.
register_azure_file_share(workspace, datastore_name, file_share_name, account_name, sas_token=None, account_key=None, protocol=None, endpoint=None, overwrite=False, create_if_not_exists=False, skip_validation=False)
Register an Azure File Share to the datastore.
You can choose to use SAS Token or Storage Account Key
register_azure_file_share(workspace, datastore_name, file_share_name, account_name, sas_token=None, account_key=None, protocol=None, endpoint=None, overwrite=False, create_if_not_exists=False, skip_validation=False)
Parameters
- workspace
- Workspace
The workspace this datastore belongs to.
- datastore_name
- str
The name of the datastore, case insensitive, can only contain alphanumeric characters and _.
- file_share_name
- str
The name of the azure file container.
- account_name
- str
The storage account name.
- sas_token
- str, optional
An account SAS token, defaults to None. For data read, we require a minimum of List & Read permissions for Containers & Objects and for data write we additionally require Write & Add permissions.
- account_key
- str, optional
Access keys of your storage account, defaults to None.
- protocol
- str, optional
The protocol to use to connect to the file share. If None, defaults to https.
- endpoint
- str, optional
The endpoint of the file share. If None, defaults to core.windows.net.
- overwrite
- bool, optional
Whether to overwrite an existing datastore. If the datastore does not exist, it will create one. The default is False.
- create_if_not_exists
- bool, optional
Whether to create the file share if it does not exists. The default is False.
- skip_validation
- bool, optional
Whether to skip validation of storage keys. The default is False.
Returns
The file datastore.
Return type
Remarks
If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.
register_azure_my_sql(workspace, datastore_name, server_name, database_name, user_id, user_password, port_number=None, endpoint=None, overwrite=False, **kwargs)
Initialize a new Azure MySQL Datastore.
MMySQL datastore can only be used to create DataReference as input and output to DataTransferStep in Azure Machine Learning pipelines. More details can be found here.
Please see below for an example of how to register an Azure MySQL database as a Datastore.
register_azure_my_sql(workspace, datastore_name, server_name, database_name, user_id, user_password, port_number=None, endpoint=None, overwrite=False, **kwargs)
Parameters
- workspace
- Workspace
The workspace this datastore belongs to.
- datastore_name
- str
The datastore name.
- server_name
- str
The MySQL server name.
- database_name
- str
The MySQL database name.
- user_id
- str
The User ID of the MySQL server.
- user_password
- str
The user password of the MySQL server.
- endpoint
- str, optional
The endpoint of the MySQL server. If None, defaults to mysql.database.azure.com.
- overwrite
- bool, optional
Whether to overwrite an existing datastore. If the datastore does not exist, it will create one. The default is False.
Returns
Returns the MySQL database Datastore.
Return type
Remarks
If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.
mysql_datastore_name="mysqldatastore"
server_name=os.getenv("MYSQL_SERVERNAME", "<my_server_name>") # FQDN name of the MySQL server
database_name=os.getenv("MYSQL_DATBASENAME", "<my_database_name>") # Name of the MySQL database
user_id=os.getenv("MYSQL_USERID", "<my_user_id>") # The User ID of the MySQL server
user_password=os.getenv("MYSQL_USERPW", "<my_user_password>") # The user password of the MySQL server.
mysql_datastore = Datastore.register_azure_my_sql(
workspace=ws,
datastore_name=mysql_datastore_name,
server_name=server_name,
database_name=database_name,
user_id=user_id,
user_password=user_password)
register_azure_postgre_sql(workspace, datastore_name, server_name, database_name, user_id, user_password, port_number=None, endpoint=None, overwrite=False, enforce_ssl=True, **kwargs)
Initialize a new Azure PostgreSQL Datastore.
Please see below for an example of how to register an Azure PostgreSQL database as a Datastore.
register_azure_postgre_sql(workspace, datastore_name, server_name, database_name, user_id, user_password, port_number=None, endpoint=None, overwrite=False, enforce_ssl=True, **kwargs)
Parameters
- workspace
- Workspace
The workspace this datastore belongs to.
- datastore_name
- str
The datastore name.
- server_name
- str
The PostgreSQL server name.
- database_name
- str
The PostgreSQL database name.
- user_id
- str
The User ID of the PostgreSQL server.
- user_password
- str
The User Password of the PostgreSQL server.
- endpoint
- str, optional
The endpoint of the PostgreSQL server. If None, defaults to postgres.database.azure.com.
- overwrite
- bool, optional
Whether to overwrite an existing datastore. If the datastore does not exist, it will create one. The default is False.
- enforce_ssl
- bool
Indicates SSL requirement of PostgreSQL server. Defaults to True.
Returns
Returns the PostgreSQL database Datastore.
Return type
Remarks
If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.
psql_datastore_name="postgresqldatastore"
server_name=os.getenv("PSQL_SERVERNAME", "<my_server_name>") # FQDN name of the PostgreSQL server
database_name=os.getenv("PSQL_DATBASENAME", "<my_database_name>") # Name of the PostgreSQL database
user_id=os.getenv("PSQL_USERID", "<my_user_id>") # The database user id
user_password=os.getenv("PSQL_USERPW", "<my_user_password>") # The database user password
psql_datastore = Datastore.register_azure_postgre_sql(
workspace=ws,
datastore_name=psql_datastore_name,
server_name=server_name,
database_name=database_name,
user_id=user_id,
user_password=user_password)
register_azure_sql_database(workspace, datastore_name, server_name, database_name, tenant_id=None, client_id=None, client_secret=None, resource_url=None, authority_url=None, endpoint=None, overwrite=False, username=None, password=None, **kwargs)
Initialize a new Azure SQL database Datastore.
Please see below for an example of how to register an Azure SQL database as a Datastore.
register_azure_sql_database(workspace, datastore_name, server_name, database_name, tenant_id=None, client_id=None, client_secret=None, resource_url=None, authority_url=None, endpoint=None, overwrite=False, username=None, password=None, **kwargs)
Parameters
- workspace
- Workspace
The workspace this datastore belongs to.
- datastore_name
- str
The datastore name.
- server_name
- str
The SQL server name.
- database_name
- str
The SQL database name.
- resource_url
- str, optional
The resource URL, which determines what operations will be performed on the SQL database store, if None, defaults to https://database.windows.net/.
- authority_url
- str, optional
The authority URL used to authenticate the user, defaults to https://login.microsoftonline.com.
- endpoint
- str, optional
The endpoint of the SQL server. If None, defaults to database.windows.net.
- overwrite
- bool, optional
Whether to overwrite an existing datastore. If the datastore does not exist, it will create one. The default is False.
- skip_validation
- bool, optional
Whether to skip validation of connecting to the SQL database. Defaults to False.
Returns
Returns the SQL database Datastore.
Return type
Remarks
If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.
sql_datastore_name="azuresqldatastore"
server_name=os.getenv("SQL_SERVERNAME", "<my_server_name>") # Name of the Azure SQL server
database_name=os.getenv("SQL_DATABASENAME", "<my_database_name>") # Name of the Azure SQL database
username=os.getenv("SQL_USER_NAME", "<my_sql_user_name>") # The username of the database user.
password=os.getenv("SQL_USER_PASSWORD", "<my_sql_user_password>") # The password of the database user.
sql_datastore = Datastore.register_azure_sql_database(
workspace=ws,
datastore_name=sql_datastore_name,
server_name=server_name,
database_name=database_name,
username=username,
password=password)
register_dbfs(workspace, datastore_name)
Initialize a new Databricks File System (DBFS) datastore.
The DBFS datastore can only be used to create DataReference as input and PipelineData as output to DatabricksStep in Azure Machine Learning pipelines. More details can be found here..
register_dbfs(workspace, datastore_name)
Parameters
- workspace
- Workspace
The workspace this datastore belongs to.
- datastore_name
- str
The datastore name.
Returns
Returns the DBFS Datastore.
Return type
Remarks
If you are attaching storage from different region than workspace region, it can result in higher latency and additional network usage costs.
set_as_default()
Set the default datastore.
set_as_default()
Parameters
- datastore_name
- str
The name of the datastore.
unregister()
Unregisters the datastore. the underlying storage service will not be deleted.
unregister()