在 Azure Machine Learning 中使用資料表

發行項
04/19/2024

適用於：Azure CLI ml 延伸模組 v2 (目前)Python SDK azure-ai-ml v2 (目前)

Azure Machine Learning 支援資料表類型 (mltable)。這可讓您建立藍圖，定義如何將資料檔案作為 Pandas 或 Spark 資料框架載入記憶體中。在本文中，您將了解：

何時使用 Azure 機器學習資料表，而不是檔案或資料夾
如何安裝 mltable SDK
如何使用檔案定義數據載入藍圖mltable
示範如何在 mltable Azure 機器學習中使用的範例
如何在互動式開發期間使用 mltable （例如，在筆記本中）

必要條件

Azure 訂用帳戶。如果您還沒有 Azure 訂用帳戶，請在開始前建立免費帳戶。試用免費或付費版本的 Azure Machine Learning
適用於 Python 的 Azure Machine Learning SDK
Azure Machine Learning 工作區

重要

請確定您已在 Python 環境中安裝最新的 mltable 套件：

pip install -U mltable azureml-dataprep[pandas]

複製範例存放庫

本文中的程式碼片段是以 Azure Machine Learning 範例 GitHub 存放庫中的範例為基礎。若要將存放庫複製到您的開發環境，請使用此命令：

git clone --depth 1 https://github.com/Azure/azureml-examples

提示

使用 --depth 1，僅將最新的認可複製到存放庫。這可縮短完成作業所需時間。

您可以在複製存放庫的這個資料夾中找到與 Azure 機器學習資料表相關的範例：

cd azureml-examples/sdk/python/using-mltable

簡介

Azure Machine Learning 資料表 (mltable) 可讓您定義如何將資料檔案作為 Pandas 和/或 Spark 資料框架載入記憶體中。資料表有兩個主要功能：

MLTable 檔案。 YAML 型檔案，定義資料載入藍圖。在 MLTable 檔案中，您可以指定：
- 數據的儲存位置或位置 - 本機、雲端或公用 HTTP（s）伺服器上。
- 雲端儲存空間的萬用字元模式。這些位置可以指定帶有萬用字元 (*) 的檔案名稱。
- 讀取轉換，例如，檔案格式類型 (分隔符號文字、Parquet、Delta、json)、分隔符號、標頭等。
- 數據行類型轉換（以強制執行架構）。
- 使用資料夾結構資訊建立新資料行，例如，使用路徑中的 {year}/{month} 資料夾結構建立年和月資料行。
- 要載入的資料子集，例如，篩選資料列、保留/卸除資料行、取得隨機樣本。
快速且有效率的引擎，用於根據 MLTable 檔案中定義的藍圖，將資料載入 Pandas 或 Spark 資料框架。該引擎仰賴 Rust 實現高速和記憶體效率。

Azure 機器學習數據表在這些案例中很有用：

您必須在儲存位置上 glob。
您必須使用來自不同儲存位置 (例如，不同的 Blob 容器) 的資料建立資料表。
該路徑包含您想在資料中擷取的相關資訊 (例如日期和時間)。
資料結構描述經常變更。
您希望輕鬆重現資料載入步驟。
您只需要大型資料子集。
資料中包含您想要串流至 Python 工作階段的儲存位置。例如，您要在下列 JSON 行結構中串流 path：[{"path": "abfss://fs@account.dfs.core.windows.net/my-images/cats/001.jpg", "label":"cat"}]。
您想要使用 Azure Machine Learning AutoML 定型 ML 模型。

提示

針對表格式數據，Azure 機器學習 不需要使用 Azure 機器學習資料表（mltable）。您可以使用 Azure Machine Learning 檔案 (uri_file) 和資料夾 (uri_folder) 類型，而您自己的剖析邏輯會將資料載入 Pandas 或 Spark 資料框架中。

對於簡單的 CSV 檔案或 Parquet 資料夾，使用 Azure 機器學習檔案/資料夾，而不是數據表會比較容易。

Azure Machine Learning 資料表快速入門

在本快速入門中，您會從 Azure 開放資料集建立 NYC 綠色計程車資料的資料表 (mltable)。數據具有 parquet 格式，且涵蓋 2008-2021 年。在可公開存取的 Blob 記憶體帳戶上，數據檔具有此資料夾結構：

/
└── green
    ├── puYear=2008
    │   ├── puMonth=1
    │   │   ├── _committed_2983805876188002631
    │   │   └── part-XXX.snappy.parquet
    │   ├── ...
    │   └── puMonth=12
    │       ├── _committed_2983805876188002631
    │       └── part-XXX.snappy.parquet
    ├── ...
    └── puYear=2021
        ├── puMonth=1
        │   ├── _committed_2983805876188002631
        │   └── part-XXX.snappy.parquet
        ├── ...
        └── puMonth=12
            ├── _committed_2983805876188002631
            └── part-XXX.snappy.parquet

使用此資料時，您必須載入 Pandas 資料框架：

只有 Parquet 檔案 2015-19 年
數據的隨機樣本
只有撕裂距離大於 0 的數據列
機器學習的相關數據行
新的資料列 - 年與月 - 使用路徑資訊（puYear=X/puMonth=Y）

Pandas 程式碼會處理此作業。不過，達到重現性會變得困難，因為您必須：

共用程式代碼，這表示如果架構變更（例如數據行名稱可能會變更），則所有使用者都必須更新其程序代碼
撰寫 ETL 管線，其額外負荷很大

Azure Machine Learning 資料表提供輕量型機制，可將 MLTable 檔案中的資料載入步驟串行化 (儲存)。然後，您和小組成員可以重現 Pandas 資料框架。如果結構描述變更，您只須更新 MLTable 檔案，不必更新許多涉及 Python 資料載入程式碼的位置。

複製快速入門筆記本，或建立新筆記本/指令碼

若您使用 Azure Machine Learning 計算執行個體，請建立新的筆記本。如果您使用 IDE，您應該建立新的 Python 腳本。

此外，Azure Machine Learning 範例 GitHub 存放庫提供了快速入門筆記本。使用此程式碼複製及存取 Notebook：

git clone --depth 1 https://github.com/Azure/azureml-examples
cd azureml-examples/sdk/python/using-mltable/quickstart

安裝 `mltable` Python SDK

若要將 NYC 綠色計程車資料載入 Azure Machine Learning 資料表，您必須使用下列命令在 Python 環境中安裝 mltable Python SDK 和 pandas：

pip install -U mltable azureml-dataprep[pandas]

撰寫 MLTable 檔案

使用 mltable Python SDK 建立 MLTable 檔案，以記錄資料載入藍圖。為此，請將下列程式碼複製並貼到 Notebook/Script 中，然後執行該程式碼：

import mltable

# glob the parquet file paths for years 2015-19, all months.
paths = [
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/green/puYear=2015/puMonth=*/*.parquet"
    },
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/green/puYear=2016/puMonth=*/*.parquet"
    },
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/green/puYear=2017/puMonth=*/*.parquet"
    },
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/green/puYear=2018/puMonth=*/*.parquet"
    },
    {
        "pattern": "wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/green/puYear=2019/puMonth=*/*.parquet"
    },
]

# create a table from the parquet paths
tbl = mltable.from_parquet_files(paths)

# table a random sample
tbl = tbl.take_random_sample(probability=0.001, seed=735)

# filter trips with a distance > 0
tbl = tbl.filter("col('tripDistance') > 0")

# Drop columns
tbl = tbl.drop_columns(["puLocationId", "doLocationId", "storeAndFwdFlag"])

# Create two new columns - year and month - where the values are taken from the path
tbl = tbl.extract_columns_from_partition_format("/puYear={year}/puMonth={month}")

# print the first 5 records of the table as a check
tbl.show(5)

您可以選擇使用下列項目，將 MLTable 物件載入 Pandas：

# You can load the table into a pandas dataframe
# NOTE: The data is in East US region and the data is large, so this will take several minutes (~7mins)
# to load if you are in a different region.

# df = tbl.to_pandas_dataframe()

儲存資料載入步驟

接下來，將所有資料載入步驟儲存至 MLTable 檔案中。將資料載入步驟儲存至 MLTable 檔案中，可讓您在稍後的某個時間點重現 Pandas 資料框架，而無須每次都重新定義程式碼。

您可以將 MLTable yaml 檔案儲存到雲端記憶體資源，也可以將它儲存到本機路徑資源。

# save the data loading steps in an MLTable file to a cloud storage resource
# NOTE: the tbl object was defined in the previous snippet.
tbl.save(path="azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/titanic", colocated=True, show_progress=True, overwrite=True)

# save the data loading steps in an MLTable file to a local resource
# NOTE: the tbl object was defined in the previous snippet.
tbl.save("./titanic")

重要

如果共置 == True，則在目前未共置 MLTable yaml 檔案時，我們會將數據複製到相同資料夾，而且我們會在 MLTable yaml 中使用相對路徑。
如果共置 == False，我們不會移動數據，我們會使用雲端數據的絕對路徑，並使用本機數據的相對路徑。
我們不支援此參數組合：數據會儲存在本機資源中，共置 == False， path 以雲端目錄為目標。請將本機數據上傳至雲端，並改用MLTable的雲端資料路徑。

重現資料載入步驟

既然您已將數據載入步驟串行化到檔案中，您可以使用load（）方法在任何時間點重現它們。如此一來，您便無須在程式碼中重新定義資料載入步驟，還能更輕鬆地共用檔案。

import mltable

# load the previously saved MLTable file
tbl = mltable.load("./nyc_taxi/")
tbl.show(5)

# You can load the table into a pandas dataframe
# NOTE: The data is in East US region and the data is large, so this will take several minutes (~7mins)
# to load if you are in a different region.

# load the table into pandas
# df = tbl.to_pandas_dataframe()

# print the head of the data frame
# df.head()
# print the shape and column types of the data frame
# print(f"Shape: {df.shape}")
# print(f"Columns:\n{df.dtypes}")

您可能已將 MLTable 檔案儲存在磁碟上，因此很難與小組成員共用。當您在 Azure 機器學習中建立數據資產時，MLTable 會上傳至雲端記憶體並「已設定書籤」。您的小組成員接著可以使用易記名稱存取MLTable。此外，資料資產已建立版本。

CLI
Python

az ml data create --name green-quickstart --version 1 --path ./nyc_taxi --type mltable

注意

該路徑指向包含 MLTable 檔案的資料夾。

設定您的訂用帳戶、資源群組和工作區：

subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

您可以使用此 Python 程式碼，在 Azure Machine Learning 中建立資料資產：

from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

# set VERSION variable
VERSION="1"

# connect to the AzureML workspace
# NOTE: the subscription_id, resource_group, workspace variables are set
# in the previous code snippet.
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

my_data = Data(
    path="./nyc_taxi",
    type=AssetTypes.MLTABLE,
    description="A random sample of NYC Green Taxi Data between 2015-19.",
    name="green-quickstart",
    version=VERSION,
)

ml_client.data.create_or_update(my_data)

注意

該路徑會指向包含 MLTable 成品的資料夾。

讀取互動式工作階段中的資料資產

既然您已將 MLTable 儲存在雲端中，您和小組成員就可以在互動式會話中使用易記名稱來存取它（例如筆記本）：

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# connect to the AzureML workspace
# NOTE: the subscription_id, resource_group, workspace variables are set
# in a previous code snippet.
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# get the latest version of the data asset
# Note: The version was set in the previous snippet. If you changed the version
# number, update the VERSION variable below.
VERSION="1"
data_asset = ml_client.data.get(name="green-quickstart", version=VERSION)

# create a table
tbl = mltable.load(f"azureml:/{data_asset.id}")
tbl.show(5)

# load into pandas
# NOTE: The data is in East US region and the data is large, so this will take several minutes (~7mins) to load if you are in a different region.
df = tbl.to_pandas_dataframe()

讀取作業中的資料資產

若您或小組成員想存取作業中的資料表，您的 Python 定型指令碼會包含：

# ./src/train.py
import argparse
import mltable

# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('--input', help='mltable to read')
args = parser.parse_args()

# load mltable
tbl = mltable.load(args.input)

# load into pandas
df = tbl.to_pandas_dataframe()

您的作業需要包含 Python 套件相依性的 conda 檔案：

# ./conda_dependencies.yml
dependencies:
  - python=3.10
  - pip=21.2.4
  - pip:
      - mltable
      - azureml-dataprep[pandas]

您可以透過下列方式提交作業：

CLI
Python

建立下列作業 YAML 檔案：

# mltable-job.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

code: ./src

command: python train.py --input ${{inputs.green}}
inputs:
    green:
      type: mltable
      path: azureml:green-quickstart:1

compute: cpu-cluster

environment:
  image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
  conda_file: conda_dependencies.yml

在 CLI 中，建立下列作業：

az ml job create -f mltable-job.yml

from azure.ai.ml import MLClient, command, Input
from azure.ai.ml.entities import Environment
from azure.identity import DefaultAzureCredential

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# get the latest version of the data asset
# Note: the VERSION was set in a previous cell.
data_asset = ml_client.data.get(name="green-quickstart", version=VERSION)

job = command(
    command="python train.py --input ${{inputs.green}}",
    inputs={"green": Input(type="mltable", path=data_asset.id)},
    compute="cpu-cluster",
    environment=Environment(
        image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
        conda_file="./job-env/conda_dependencies.yml",
    ),
    code="./src",
)

ml_client.jobs.create_or_update(job)

撰寫 MLTable 檔案

若要直接建立 MLTable 檔案，建議您使用 mltable Python SDK 來撰寫 MLTable 檔案，如 Azure 機器學習數據表快速入門所示，而不是文本編輯器。在本節中，我們會概述 mltable Python SDK 中的功能。

支援的檔案類型

您可以建立具有不同檔案類型範圍的 MLTable：

檔案類型	`MLTable` Python SDK
分隔的文字 (例如 CSV 檔案)	`from_delimited_files(paths=[path])`
Parquet	`from_parquet_files(paths=[path])`
Delta Lake	`from_delta_lake(delta_table_uri=<uri_pointing_to_delta_table_directory>,timestamp_as_of='2022-08-26T00:00:00Z')`
JSON 行	`from_json_lines_files(paths=[path])`
路徑 (建立具有串流路徑資料列的資料表)	`from_paths(paths=[path])`

如需詳細資訊，請參閱 MLTable 參考資源

定義路徑

針對分隔文字、parquet、JSON 行和路徑，請定義 Python 字典清單，以定義要從中讀取的路徑或路徑：

import mltable

# A List of paths to read into the table. The paths are a python dict that define if the path is
# a file, folder, or (glob) pattern.
paths = [
    {
        "file": "<supported_path>"
    }
]

tbl = mltable.from_delimited_files(paths=paths)

# alternatively
# tbl = mltable.from_parquet_files(paths=paths)
# tbl = mltable.from_json_lines_files(paths=paths)
# tbl = mltable.from_paths(paths=paths)

MLTable 支援這些路徑類型：

Location	範例
本機電腦上的路徑	`./home/username/data/my_data`
公用 HTTP 伺服器的路徑	`https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv`
Azure 儲存體上的路徑	`wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>` `abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>`
完整格式 Azure Machine Learning 資料存放區	`azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<path>`

注意

mltable 負責針對 Azure 儲存體和 Azure Machine Learning 資料存放區上的路徑處理使用者認證傳遞。若您沒有基礎儲存體中的資料使用權限，則無法存取資料。

定義 Delta Lake 資料表路徑的注意事項

與其他文件類型相比，定義讀取 Delta Lake 數據表的路徑不同。對於 Delta Lake 數據表，路徑會指向包含「_delta_log」資料夾和資料檔的單一資料夾（通常是 ADLS gen2 上）。支援時間移動。下列程式碼示範如何定義 Delta Lake 資料表的路徑：

import mltable

# define the cloud path containing the delta table (where the _delta_log file is stored)
delta_table = "abfss://<file_system>@<account_name>.dfs.core.windows.net/<path_to_delta_table>"

# create an MLTable. Note the timestamp_as_of parameter for time travel.
tbl = mltable.from_delta_lake(
    delta_table_uri=delta_table,
    timestamp_as_of='2022-08-26T00:00:00Z'
)

若要取得最新版的 Delta Lake 數據，您可以將目前的時間戳傳遞至 timestamp_as_of。

import mltable

# define the relative path containing the delta table (where the _delta_log file is stored)
delta_table_path = "./working-directory/delta-sample-data"

# get the current timestamp in the required format
current_timestamp = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
print(current_timestamp)
tbl = mltable.from_delta_lake(delta_table_path, timestamp_as_of=current_timestamp)
df = tbl.to_pandas_dataframe()

重要

限制： mltable 從 Delta Lake 讀取數據時，不支持數據分割索引鍵擷取。當您透過 mltable讀取 Delta Lake 資料時，轉換mltableextract_columns_from_partition_format將無法運作。

重要

mltable 負責針對 Azure 儲存體和 Azure Machine Learning 資料存放區上的路徑處理使用者認證傳遞。若您沒有基礎儲存體中的資料使用權限，則無法存取資料。

檔案、資料夾和 Glob

Azure Machine Learning 資料表支援讀取下列項目：

檔案，例如：abfss://<file_system>@<account_name>.dfs.core.windows.net/my-csv.csv
資料夾，例如：abfss://<file_system>@<account_name>.dfs.core.windows.net/my-folder/
glob 模式，例如：abfss://<file_system>@<account_name>.dfs.core.windows.net/my-folder/*.csv
檔案、資料夾和 Globbing 模式的組合

支援的資料載入轉換

有關支援的資料載入轉換，請至 MLTable 參考文件瀏覽最新的完整詳細資訊。

範例

本文中的程式碼片段是以 Azure Machine Learning 範例 GitHub 存放庫中的範例為基礎。若要將存放庫複製到您的開發環境，請使用此命令：

git clone --depth 1 https://github.com/Azure/azureml-examples

提示

使用 --depth 1，僅將最新的認可複製到存放庫。這可縮短完成作業所需時間。

此複製存放庫資料夾裝載與 Azure Machine Learning 資料表相關的範例：

cd azureml-examples/sdk/python/using-mltable

符號分隔檔案

首先，使用下列程式碼從 CSV 檔案建立 MLTable：

import mltable
from mltable import MLTableHeaders, MLTableFileEncoding, DataType

# create paths to the data files
paths = [{"file": "wasbs://data@azuremlexampledata.blob.core.windows.net/titanic.csv"}]

# create an MLTable from the data files
tbl = mltable.from_delimited_files(
    paths=paths,
    delimiter=",",
    header=MLTableHeaders.all_files_same_headers,
    infer_column_types=True,
    include_path_column=False,
    encoding=MLTableFileEncoding.utf8,
)

# filter out rows undefined ages
tbl = tbl.filter("col('Age') > 0")

# drop PassengerId
tbl = tbl.drop_columns(["PassengerId"])

# ensure survived column is treated as boolean
data_types = {
    "Survived": DataType.to_bool(
        true_values=["True", "true", "1"], false_values=["False", "false", "0"]
    )
}
tbl = tbl.convert_column_types(data_types)

# show the first 5 records
tbl.show(5)

# You can also load into pandas...
# df = tbl.to_pandas_dataframe()
# df.head(5)

儲存資料載入步驟

接下來，將所有資料載入步驟儲存至 MLTable 檔案中。當您將數據載入步驟儲存在MLTable檔案中時，可以在稍後的時間點重現 Pandas 資料框架，而不需要每次重新定義程式代碼。

# save the data loading steps in an MLTable file
# NOTE: the tbl object was defined in the previous snippet.
tbl.save("./titanic")

重現資料載入步驟

既然該檔案具有序列化資料載入步驟，您可以使用 load() 方法在任何時間點加以重現。如此一來，您便無須在程式碼中重新定義資料載入步驟，還能更輕鬆地共用檔案。

import mltable

# load the previously saved MLTable file
tbl = mltable.load("./titanic/")

import time
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

# Update with your details...
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# set the version number of the data asset to the current UTC time
VERSION = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

my_data = Data(
    path="./titanic",
    type=AssetTypes.MLTABLE,
    description="The titanic dataset.",
    name="titanic-cloud-example",
    version=VERSION,
)

ml_client.data.create_or_update(my_data)

既然您已將 MLTable 儲存在雲端中，您和小組成員就可以在互動式會話中使用易記名稱來存取它（例如筆記本）：

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# connect to the AzureML workspace
# NOTE:  subscription_id, resource_group, workspace were set in a previous snippet.
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# get the latest version of the data asset
# Note: The version was set in the previous code cell.
data_asset = ml_client.data.get(name="titanic-cloud-example", version=VERSION)

# create a table
tbl = mltable.load(f"azureml:/{data_asset.id}")

# load into pandas
df = tbl.to_pandas_dataframe()
df.head(5)

您也可以輕鬆存取作業中的資料資產。

Parquet 檔案

Azure 機器學習數據表快速入門說明如何讀取 parquet 檔案。

路徑：建立映像檔案的資料表

您可以建立包含雲端儲存空間路徑的資料表。此範例在雲端儲存空間存有多張狗和貓的映像，資料夾結構如下所示：

/pet-images
  /cat
    0.jpeg
    1.jpeg
    ...
  /dog
    0.jpeg
    1.jpeg

mltable 可以建構資料表，其中包含這些映像的儲存體路徑及其資料夾名稱 (標籤)，可用於串流映像。此程式代碼會建立 MLTable：

import mltable

# create paths to the data files
paths = [{"pattern": "wasbs://data@azuremlexampledata.blob.core.windows.net/pet-images/**/*.jpg"}]

# create the mltable
tbl = mltable.from_paths(paths)

# extract useful information from the path
tbl = tbl.extract_columns_from_partition_format("{account}/{container}/{folder}/{label}")

tbl = tbl.drop_columns(["account", "container", "folder"])

df = tbl.to_pandas_dataframe()
print(df.head())

# save the data loading steps in an MLTable file
tbl.save("./pets")

此程式代碼示範如何在 Pandas 資料框架開啟儲存位置，並繪製影像：

# plot images on a grid. Note this takes ~1min to execute.
import matplotlib.pyplot as plt
from PIL import Image

fig = plt.figure(figsize=(20, 20))
columns = 4
rows = 5
for i in range(1, columns*rows +1):
    with df.Path[i].open() as f:
        img = Image.open(f)
        fig.add_subplot(rows, columns, i)
        plt.imshow(img)
        plt.title(df.label[i])

您可能已將檔案 mltable 儲存在磁碟上，因此很難與小組成員共用。當您在 Azure 機器學習中建立數據資產時，mltable會上傳至雲端記憶體並「已設定書籤」。您的小組成員接著可以使用易記名稱存取 mltable 。此外，資料資產已建立版本。

import time
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

# set the version number of the data asset to the current UTC time
VERSION = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())

# connect to the AzureML workspace
# NOTE: subscription_id, resource_group, workspace were set in a previous snippet.
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

my_data = Data(
    path="./pets",
    type=AssetTypes.MLTABLE,
    description="A sample of cat and dog images",
    name="pets-mltable-example",
    version=VERSION,
)

ml_client.data.create_or_update(my_data)

mltable現在，儲存在雲端中，您和您的小組成員可以在互動式會話中以易記名稱存取它（例如筆記本）：

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# connect to the AzureML workspace
# NOTE: subscription_id, resource_group, workspace were set in a previous snippet.
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# get the latest version of the data asset
# Note: the variable VERSION is set in the previous code
data_asset = ml_client.data.get(name="pets-mltable-example", version=VERSION)

# the table from the data asset id
tbl = mltable.load(f"azureml:/{data_asset.id}")

# load into pandas
df = tbl.to_pandas_dataframe()
df.head()

您也可以將資料載入作業。

在 Azure Machine Learning 中使用資料表

必要條件

複製範例存放庫

簡介

Azure Machine Learning 資料表快速入門

複製快速入門筆記本，或建立新筆記本/指令碼

安裝 `mltable` Python SDK

撰寫 MLTable 檔案

儲存資料載入步驟

重現資料載入步驟

讀取互動式工作階段中的資料資產

讀取作業中的資料資產

撰寫 MLTable 檔案

支援的檔案類型

定義路徑

定義 Delta Lake 資料表路徑的注意事項

檔案、資料夾和 Glob

支援的資料載入轉換

範例

符號分隔檔案

儲存資料載入步驟

重現資料載入步驟

Parquet 檔案

路徑：建立映像檔案的資料表

下一步

其他資源

在 Azure Machine Learning 中使用資料表

必要條件

複製範例存放庫

簡介

Azure Machine Learning 資料表快速入門

複製快速入門筆記本，或建立新筆記本/指令碼

安裝 mltable Python SDK

撰寫 MLTable 檔案

儲存資料載入步驟

重現資料載入步驟

建立資料資產以協助共用和重現

讀取互動式工作階段中的資料資產

讀取作業中的資料資產

撰寫 MLTable 檔案

支援的檔案類型

定義路徑

定義 Delta Lake 資料表路徑的注意事項

檔案、資料夾和 Glob

支援的資料載入轉換

範例

符號分隔檔案

儲存資料載入步驟

重現資料載入步驟

建立資料資產以協助共用和重現

Parquet 檔案

路徑：建立映像檔案的資料表

建立資料資產以協助共用和重現

下一步

其他資源

安裝 `mltable` Python SDK