存取 Azure 儲存體服務中的資料Access data in Azure storage services

在本文中,您將瞭解如何透過 Azure Machine Learning 資料存放區,輕鬆地存取 Azure 儲存體服務中的資料。In this article, learn how to easily access your data in Azure storage services via Azure Machine Learning datastores. 資料存放區是用來儲存連接資訊,例如您的訂用帳戶識別碼和權杖授權。Datastores are used to store connection information, like your subscription ID and token authorization. 使用資料存放區可讓您存取儲存體,而不需要在腳本中硬編碼連接資訊。Using datastores allows you to access your storage without having to hard code connection information in your scripts. 您可以從這些Azure 儲存體解決方案建立資料存放區。You can create datastores from these Azure storage solutions.

本操作說明示範下列工作的範例:This how-to shows examples of the following tasks:

必要條件Prerequisites

建立並註冊資料存放區Create and register datastores

當您將 Azure 儲存體解決方案註冊為數據存放區時,會自動在特定工作區中建立該資料存放區。When you register an Azure storage solution as a datastore, you automatically create that datastore in a specific workspace. 您可以使用 Python SDK 或工作區登陸頁面來建立資料存放區,並將其註冊至工作區。You can create and register datastores to a workspace using the Python SDK or the workspace landing page.

使用 Python SDKUsing the Python SDK

所有的暫存器方法都位於Datastore類別上,其格式為 register_azure_ *。All the register methods are on the Datastore class and have the form register_azure_*.

您可以透過Azure 入口網站找到填入 register ()方法所需的資訊。The information you need to populate the register() method can be found via the Azure portal. 在左窗格中選取 [儲存體帳戶],然後選擇您要註冊的儲存體帳戶。Select Storage Accounts on the left pane and choose the storage account you want to register. [總覽] 頁面會提供帳戶名稱和容器或檔案共用名稱等資訊。The Overview page provides information such as, the account name and container or file share name. 如需驗證資訊,例如帳戶金鑰或 SAS 權杖,請流覽至左側 [設定] 窗格底下的 [帳戶金鑰]。For authentication information, like account key or SAS token, navigate to Account Keys under the Settings pane on the left.

下列範例會示範如何將 Azure Blob 容器或 Azure 檔案共用註冊為數據存放區。The following examples show you to register an Azure Blob Container or an Azure File Share as a datastore.

  • 若為Azure Blob 容器資料存放區,請使用register_azure_blob-container()For an Azure Blob Container Datastore, use register_azure_blob-container()

    下列程式碼會建立資料存放區(my_datastore),並將其註冊至工作空間,wsThe following code creates and registers the datastore, my_datastore, to the workspace, ws. 此資料存放區會使用所提供的帳戶金鑰,在 Azure 儲存體帳戶上存取 Azure blob 容器 my_blob_containermy_storage_accountThis datastore accesses the Azure blob container, my_blob_container, on the Azure storage account, my_storage_account using the provided account key.

       datastore = Datastore.register_azure_blob_container(workspace=ws, 
                                                          datastore_name='my_datastore', 
                                                          container_name='my_blob_container',
                                                          account_name='my_storage_account', 
                                                          account_key='your storage account key',
                                                          create_if_not_exists=True)
    
  • 針對Azure 檔案共用資料存放區,請使用register_azure_file_share()For an Azure File Share Datastore, use register_azure_file_share().

    下列程式碼會建立資料存放區(my_datastore),並將其註冊至工作空間,wsThe following code creates and registers the datastore, my_datastore, to the workspace, ws. 此資料存放區會使用所提供的帳戶金鑰,在 Azure 儲存體帳戶上存取 Azure 檔案共用 my_file_sharemy_storage_accountThis datastore accesses the Azure file share, my_file_share, on the Azure storage account, my_storage_account using the provided account key.

       datastore = Datastore.register_azure_file_share(workspace=ws, 
                                                      datastore_name='my_datastore', 
                                                      file_share_name='my_file_share',
                                                      account_name='my_storage account', 
                                                      account_key='your storage account key',
                                                      create_if_not_exists=True)
    

儲存體指引Storage guidance

我們建議 Azure Blob 容器。We recommend Azure Blob Container. Standard 和 premium 儲存體都適用于 blob。Both standard and premium storage are available for blobs. 雖然較昂貴,但我們建議使用高階儲存體,因為輸送量速度較快,可以改善定型執行的速度,特別是當您針對大型資料集進行定型時。Although more expensive, we suggest premium storage due to faster throughput speeds that may improve the speed of your training runs, particularly if you train against a large data set. 如需儲存體帳戶的成本資訊,請參閱Azure 定價計算機See the Azure pricing calculator for storage account cost information.

使用工作區登陸頁面Using the workspace landing page

在工作區登陸頁面的幾個步驟中,建立新的資料存放區。Create a new datastore in a few steps in the workspace landing page.

  1. 登入工作區登陸頁面 (英文)。Sign in to the workspace landing page.
  2. 在左窗格中選取 [管理] 底下的 [資料存放區]。Select Datastores in the left pane under Manage.
  3. 選取 [ + 新增資料存放區]。Select + New datastore.
  4. 完成 [新的資料存放區] 表單。Complete the New datastore form. 表單會根據 Azure 儲存體類型和驗證類型的選擇,以智慧方式更新。The form intelligently updates based on the Azure storage type and authentication type selections.

您可以透過Azure 入口網站找到填入表單所需的資訊。The information you need to populate the form can be found via the Azure portal. 在左窗格中選取 [儲存體帳戶],然後選擇您要註冊的儲存體帳戶。Select Storage Accounts on the left pane and choose the storage account you want to register. [總覽] 頁面會提供帳戶名稱和容器或檔案共用名稱等資訊。The Overview page provides information such as, the account name and container or file share name. 針對驗證專案(例如帳戶金鑰或 SAS 權杖),流覽至左側 [設定] 窗格底下的 [帳戶金鑰]。For authentication items, like account key or SAS token, navigate to Account Keys under the Settings pane on the left .

下列範例示範建立 Azure blob 資料存放區時,表單的外觀。The following example demonstrates what the form would look like for creating an Azure blob datastore.

新的資料存放區

從您的工作區取得資料存放區Get datastores from your workspace

若要取得在目前工作區中註冊的特定資料存放區,請在資料存放區類別上使用get()靜態方法:To get a specific datastore registered in the current workspace, use the get() static method on Datastore class:

#get named datastore from current workspace
datastore = Datastore.get(ws, datastore_name='your datastore name')

若要取得向指定工作區註冊的資料存放區清單,您可以使用工作區物件上的datastores屬性:To get the list of datastores registered with a given workspace, you can use the datastores property on a workspace object:

#list all datastores registered in current workspace
datastores = ws.datastores
for name, datastore in datastores.items():
    print(name, datastore.datastore_type)

當您建立工作區時,會將 Azure Blob 容器和 Azure 檔案共用登錄至名為 workspaceblobstore 的工作區,並分別 workspacefilestoreWhen you create a workspace, an Azure Blob Container and an Azure File Share are registered to the workspace named workspaceblobstore and workspacefilestore respectively. 它們會儲存 Blob 容器的連線資訊,以及在附加至工作區的儲存體帳戶中布建的檔案共用。They store the connection information of the Blob Container and the File Share that is provisioned in the storage account attached to the workspace. @No__t-0 設定為預設資料存放區。The workspaceblobstore is set as the default datastore.

取得工作區的預設資料存放區:To get the workspace's default datastore:

datastore = ws.get_default_datastore()

若要為目前的工作區定義不同的預設資料存放區,請在工作區物件上使用set_default_datastore()方法:To define a different default datastore for the current workspace, use set_default_datastore() method on the workspace object:

#define default datastore for current workspace
ws.set_default_datastore('your datastore name')

上傳 & 下載資料Upload & download data

下列範例中所述的upload()@no__t 3方法,是針對AzureBlobDatastoreAzureFileDatastore類別的特定和運作方式。The upload() and download() methods described in the following examples are specific to and operate identically for the AzureBlobDatastore and AzureFileDatastore classes.

上傳Upload

使用 Python SDK 將目錄或個別檔案上傳至資料存放區。Upload either a directory or individual files to the datastore using the Python SDK.

上傳目錄至資料存放區 datastoreTo upload a directory to a datastore datastore:

import azureml.data
from azureml.data.azure_storage_datastore import AzureFileDatastore, AzureBlobDatastore

datastore.upload(src_dir='your source directory',
                 target_path='your target path',
                 overwrite=True,
                 show_progress=True)

@No__t-0 參數會指定檔案共用(或 blob 容器)中要上傳的位置。The target_path parameter specifies the location in the file share (or blob container) to upload. 該位置預設為 None,這表示會將資料上傳至根目錄。It defaults to None, in which case the data gets uploaded to root. overwrite=True 時,會覆寫 target_path 的任何現有資料。When overwrite=True any existing data at target_path is overwritten.

或透過 upload_files() 方法,將個別檔案清單上傳至資料存放區。Or upload a list of individual files to the datastore via the upload_files() method.

下載Download

同樣地,將資料從資料存放區下載至本機檔案系統。Similarly, download data from a datastore to your local file system.

datastore.download(target_path='your target path',
                   prefix='your prefix',
                   show_progress=True)

@No__t-0 參數是用來下載資料的本機目錄位置。The target_path parameter is the location of the local directory to download the data to. 若要指定檔案共用 (或 Blob 容器) 中資料夾的路徑以進行下載,請在 prefix 中提供該路徑。To specify a path to the folder in the file share (or blob container) to download, provide that path to prefix. 如果 prefixNone,表示將會下載檔案共用 (或 Blob 容器) 的所有內容。If prefix is None, all the contents of your file share (or blob container) will get downloaded.

在定型期間存取您的資料Access your data during training

重要

使用Azure Machine Learning 資料集(預覽)是在定型中存取資料的新建議方式。Using Azure Machine Learning datasets (preview) is the new recommended way to access your data in training. 資料集提供將表格式資料載入 pandas 或 spark 資料框架的功能,以及從 Azure Blob、Azure 檔案、Azure Data Lake Gen 1、Azure Data Lake Gen 2、Azure SQL、Azure 于 postgresql 下載或掛接任何格式檔案的功能。Datasets provide functions that load tabular data into pandas or spark DataFrame, and the ability to download or mount files of any format from Azure Blob, Azure File, Azure Data Lake Gen 1, Azure Data Lake Gen 2, Azure SQL, Azure PostgreSQL. 深入瞭解如何使用資料集進行定型Learn more about how to train with datasets.

下表列出指示計算目標如何在執行期間使用資料存放區的方法。The following table lists the methods that tell the compute target how to use the datastores during runs.

方式Way 方法Method 描述Description
掛接Mount as_mount() 使用在計算目標上掛接資料存放區。Use to mount the datastore on the compute target.
下載Download as_download() 使用將資料存放區的內容下載至 path_on_compute 所指定的位置。Use to download the contents of your datastore to the location specified by path_on_compute.

此下載會在執行之前進行。This download happens before the run.
上傳Upload as_upload() 使用,將檔案從 path_on_compute 指定的位置上傳至您的資料存放區。Use to upload a file from the location specified by path_on_compute to your datastore.

此上傳會在您執行之後發生。This upload happens after your run.

若要參考資料存放區中的特定資料夾或檔案,並將其提供給計算目標,請使用資料存放區path()方法。To reference a specific folder or file in your datastore and make it available on the compute target, use the datastore path() method.

#to mount the full contents in your storage to the compute target
datastore.as_mount()

#to download the contents of the `./bar` directory in your storage to the compute target
datastore.path('./bar').as_download()

注意

任何指定的 datastore 或 @no__t 1 物件都會解析為格式為 "$AZUREML_DATAREFERENCE_XXXX" 的環境變數名稱,其值代表目標計算上的掛接/下載路徑。Any specified datastore or datastore.path object resolves to an environment variable name of the format "$AZUREML_DATAREFERENCE_XXXX", whose value represents the mount/download path on the target compute. 目標計算上的資料存放區路徑可能不會與定型腳本的執行路徑相同。The datastore path on the target compute might not be the same as the execution path for the training script.

範例Examples

下列程式碼範例專門用於在定型期間存取資料的@no__t 1類別。The following code examples are specific to the Estimator class for accessing data during training.

script_params 是包含 entry_script 之參數的字典。script_params is a dictionary containing parameters to the entry_script. 使用它來傳入資料存放區,並描述如何在計算目標上提供資料。Use it to pass in a datastore and describe how data is made available on the compute target. 深入瞭解我們的端對端教學課程。Learn more from our end-to-end tutorial.

from azureml.train.estimator import Estimator

script_params = {
    '--data_dir': datastore.path('/bar').as_mount()
}

est = Estimator(source_directory='your code directory',
                entry_script='train.py',
                script_params=script_params,
                compute_target=compute_target
                )

您也可以將資料存放區清單傳遞至估計工具的 inputs 參數,以在資料存放區中裝載或複製資料。You can also pass in a list of datastores to the Estimator constructor inputs parameter to mount or copy data to/from your datastore(s). 此程式碼範例:This code example:

  • 在定型腳本 train.py 執行之前,將 datastore1 中的所有內容下載到計算目標Downloads all the contents in datastore1 to the compute target before your training script train.py is run
  • 在執行 train.py 之前,將 datastore2 中的 './foo' 資料夾下載到計算目標Downloads the folder './foo' in datastore2 to the compute target before train.py is run
  • 在您的腳本執行之後,將 './bar.pkl' 的檔案上傳至 datastore3Uploads the file './bar.pkl' from the compute target to the datastore3 after your script has run
est = Estimator(source_directory='your code directory',
                compute_target=compute_target,
                entry_script='train.py',
                inputs=[datastore1.as_download(), datastore2.path('./foo').as_download(), datastore3.as_upload(path_on_compute='./bar.pkl')])

計算和資料存放區矩陣Compute and datastore matrix

資料存放區目前支援將連接資訊儲存至下列矩陣中所列的儲存體服務。Datastores currently support storing connection information to the storage services listed in the following matrix. 此矩陣會顯示不同計算目標和資料存放區案例的可用資料存取功能。This matrix displays the available data access functionalities for the different compute targets and datastore scenarios. 深入瞭解 Azure Machine Learning 的計算目標Learn more about the compute targets for Azure Machine Learning.

計算Compute AzureBlobDatastoreAzureBlobDatastore AzureFileDatastoreAzureFileDatastore AzureDataLakeDatastoreAzureDataLakeDatastore AzureDataLakeGen2Datastore AzurePostgreSqlDatastore AzureSqlDatabaseDatastoreAzureDataLakeGen2Datastore AzurePostgreSqlDatastore AzureSqlDatabaseDatastore
本機Local as_download(), as_upload()as_download(), as_upload() as_download(), as_upload()as_download(), as_upload() N/AN/A N/AN/A
Azure Machine Learning ComputeAzure Machine Learning Compute as_mount ()as_download ()as_upload ()ML @ no__t-4pipelinesas_mount(), as_download(), as_upload(), ML pipelines as_mount ()as_download ()as_upload ()ML @ no__t-4pipelinesas_mount(), as_download(), as_upload(), ML pipelines N/AN/A N/AN/A
虛擬機器Virtual machines as_download(), as_upload()as_download(), as_upload() as_download() as_upload()as_download() as_upload() N/AN/A N/AN/A
HDInsightHDInsight as_download() as_upload()as_download() as_upload() as_download() as_upload()as_download() as_upload() N/AN/A N/AN/A
資料傳輸Data transfer ML @ no__t-1pipelinesML pipelines N/AN/A ML @ no__t-1pipelinesML pipelines ML @ no__t-1pipelinesML pipelines
DatabricksDatabricks ML @ no__t-1pipelinesML pipelines N/AN/A ML @ no__t-1pipelinesML pipelines N/AN/A
Azure BatchAzure Batch ML @ no__t-1pipelinesML pipelines N/AN/A N/AN/A N/AN/A
Azure DataLake 分析Azure DataLake Analytics N/AN/A N/AN/A ML @ no__t-1pipelinesML pipelines N/AN/A

注意

在某些情況下,使用 as_download() 而不是 as_mount(),可能會有高度反復的大型資料處理程式執行速度更快;這可以驗證 experimentally。There may be scenarios in which highly iterative, large data processes run faster using as_download() instead of as_mount(); this can be validated experimentally.

在定型期間存取原始程式碼Accessing source code during training

Azure blob 儲存體的輸送量速度高於 Azure 檔案共用,而且會調整為以平行方式啟動的大量作業。Azure blob storage has higher throughput speeds than Azure file share and will scale to large numbers of jobs started in parallel. 基於這個理由,我們建議您將執行設定為使用 blob 儲存體來傳送原始程式碼檔案。For this reason, we recommend configuring your runs to use blob storage for transferring source code files.

下列程式碼範例會在執行設定中指定要用於來來源程式代碼傳輸的 blob 資料存放區。The following code example specifies in the run configuration which blob datastore to use for source code transfers.

# workspaceblobstore is the default blob storage
run_config.source_directory_data_store = "workspaceblobstore" 

在評分期間存取資料Access data during scoring

Azure Machine Learning 提供數種方式來使用您的模型進行評分。Azure Machine Learning provides several ways to use your models for scoring. 其中一些方法不會提供資料存放區的存取權。Some of these methods do not provide access to datastores. 使用下表瞭解哪些方法可讓您在計分期間存取資料存放區:Use the following table to understand which methods allow you to access datastores during scoring:

方法Method 資料存放區存取Datastore access 描述Description
批次預測Batch prediction 以非同步方式對大量資料進行預測。Make predictions on large quantities of data asynchronously.
Web 服務Web service   將模型部署為 web 服務。Deploy model(s) as a web service.
IoT Edge 模組IoT Edge module   將模型部署到 IoT Edge 裝置。Deploy model(s) to IoT Edge devices.

針對 SDK 不提供資料存放區存取權的情況,您可以使用相關的 Azure SDK 來建立自訂程式碼來存取資料。For situations where the SDK doesn't provide access to datastores, you may be able to create custom code using the relevant Azure SDK to access the data. 例如,適用于Python 的 AZURE 儲存體 SDK是用戶端程式庫,可讓您用來存取儲存在 blob 或檔案中的資料。For example, the Azure Storage SDK for Python is a client library that you can use to access data stored in blobs or files.

後續步驟Next steps