CLI （v2） MLtable YAML 架構

發行項
03/19/2024

您可以在找到來源 JSON 架構 https://azuremlschemas.azureedge.net/latest/MLTable.schema.json。

注意

本文件中詳述的 YAML 語法是以最新版 ML CLI v2 擴充功能的 JSON 架構為基礎。此語法保證只能與最新版的ML CLI v2擴充功能搭配使用。您可以在找到舊版擴充功能的 https://azuremlschemasprod.azureedge.net/架構。

如何撰寫 `MLTable` 檔案

本文僅提供 YAML 架構的相關信息 MLTable 。如需 MLTable 的詳細資訊，包括

MLTable 檔案撰寫
建立 MLTable 成品
Pandas 和 Spark 中的耗用量
端對端範例

請造訪在 Azure 機器學習中使用數據表。

YAML 語法

機碼	類型	描述	允許的值	預設值
`$schema`	string	YAML 架構。如果您使用 Azure 機器學習 Visual Studio Code 擴充功能來撰寫 YAML 檔案，如果您在檔案頂端包含`$schema`架構和資源完成，則可以叫用架構和資源完成
`type`	const	`mltable` 抽象化表格式數據的架構定義。數據取用者可以更輕鬆地將數據表具體化為 Pandas/Dask/Spark 數據框架	`mltable`	`mltable`
`paths`	陣列	路徑可以是 `file` 路徑、 `folder` 路徑或 `pattern` 路徑。 `pattern`支援通配符（``、、 `?`） `[abc][a-z]`指定檔案名集合的通*配符模式。支援的 URI 類型： `azureml`、 `https`、 `wasbs`、 `abfss` 和 `adl`。如需使用 URI 格式的詳細資訊，`azureml://`請流覽 Core yaml 語法	`file` `folder` `pattern`
`transformations`	陣列	已定義的轉換順序，套用至從定義路徑載入的數據。如需詳細資訊，請瀏覽轉換	`read_delimited` `read_parquet` `read_json_lines` `read_delta_lake` `take` `take_random_sample` `drop_columns` `keep_columns` `convert_column_types` `skip` `filter` `extract_columns_from_partition_format`

轉換

讀取轉換

讀取轉換	描述	參數
`read_delimited`	加入轉換步驟，以讀取中提供的分隔文本檔 `paths`	`infer_column_types`：布爾值以推斷數據行數據類型。預設值為 True。類型推斷要求目前的計算可以存取數據源。目前，類型推斷只會提取前 200 個數據列。 `encoding`：指定檔案編碼。支援的編碼方式：`utf8`、、、`asciilatin1`、`utf16`、`utf32`、、 `utf8bom`和 `windows1252`。 `iso88591` 默認編碼： `utf8`。 `header`：用戶可以選擇下列其中一個選項：`no_header`、、、`from_first_file`。`all_files_different_headersall_files_same_headers` 預設為 `all_files_same_headers`。 `delimiter`：分割數據行的分隔符。 `empty_as_string`：指定空域值是否應該載入為空字串。預設值（False）會將空域值讀取為 Null。將此設定傳遞為 True 會將空白域值當做空字串來讀取。對於轉換成數值或 datetime 數據類型的值，此設定沒有作用，因為空值會轉換成 Null。 `include_path_column`：布爾值，將路徑資訊保留為數據表中的數據行。預設為 False。此設定有助於讀取多個檔案，而且您想要知道特定記錄的原始檔案。此外，您可以在檔案路徑中保留有用的資訊。 `support_multi_line`：根據預設，`support_multi_line=False`所有換行符，包括引號域值中的換行符，都會解譯為記錄分隔符。這個數據讀取方法會加快速度，並針對多個 CPU 核心上的平行執行提供優化。不過，這可能會導致產生具有未對齊域值之更多記錄的無訊息產生。當已知分隔的檔案包含引號換行符時，請將此值 `True` 設定為
`read_parquet`	新增轉換步驟，以讀取中提供的 Parquet 格式化檔案 `paths`	`include_path_column`：布爾值，將路徑資訊保留為數據表數據行。預設為 False。此設定可協助您讀取多個檔案，而且想要知道特定記錄的原始檔案。此外，您可以在檔案路徑中保留有用的資訊。注意： MLTable 僅支援包含基本類型之數據行的 parquet 檔案讀取。不支援包含數位資料行
`read_delta_lake`	加入轉換步驟，以讀取中 `paths`提供的 Delta Lake 資料夾。您可以在特定時間戳或版本讀取資料	`timestamp_as_of`：字串。要針對特定 Delta Lake 資料進行時間移動的時間戳。若要在特定時間點讀取數據，datetime 字串應該具有 RFC-3339/ISO-8601 格式（例如：“2022-10-01T00：00：00Z”， “2022-10-01T00：00：00+08：00”， “2022-10-01T01：30：00-08：00”。 `version_as_of`：整數。要針對特定 Delta Lake 資料的時間移動指定版本。您必須提供或的 `timestamp_as_of` 一個值 `version_as_of`
`read_json_lines`	新增轉換步驟以讀取中提供的 json 檔案 `paths`	`include_path_column`：布爾值，將路徑資訊保留為MLTable資料行。預設為 False。此設定可協助您讀取多個檔案，而且想要知道特定記錄的原始檔案。此外，您可以在檔案路徑中保留有用的資訊 `invalid_lines`：決定如何處理具有無效 JSON 的行。支援的值： `error` 和 `drop`。預設為 `error` `encoding`：指定檔案編碼。支援的編碼方式：`utf8`、、、`asciilatin1`、`utf16`、`utf32`、、 `utf8bom`和 `windows1252`。 `iso88591` 預設為 `utf8`

其他轉換

轉換	描述	參數	範例
`convert_column_types`	加入轉換步驟，將指定的數據行轉換成其各自的指定新類型	`columns` 要轉換的數據行名稱陣列 `column_type` 您要轉換的類型（`int`、、 `floatstring`、 `boolean`） `datetime`	`- convert_column_types： - 數據行：[Age] column_type： int` 將 Age 資料行轉換成整數。 `- convert_column_types： - 資料行：日期 column_type： Datetime：格式： - “%d/%m/%Y”` 將日期資料行轉換成格式 `dd/mm/yyyy`。如需日期時間轉換的詳細資訊，請參閱 `to_datetime` 。 `- convert_column_types： - 資料行：[is_weekday] column_type：布林： true_values：['yes'， 'true'， '1'] false_values：['no'， 'false'， '0']` 將is_weekday數據行轉換成布爾值;是/true/1 數據行中的值會對應至 `True`，而數據行中的 no/false/0 值則對應至 `False`。如需布爾值轉換的詳細資訊，請參閱`to_bool`
`drop_columns`	新增轉換步驟以從數據集移除特定數據行	要卸除的數據行名稱陣列	`- drop_columns: ["col1", "col2"]`
`keep_columns`	新增轉換步驟以保留指定的數據行，並從數據集中移除所有其他數據行	要保留的數據行名稱陣列	`- keep_columns: ["col1", "col2"]`
`extract_columns_from_partition_format`	加入轉換步驟，以使用每個路徑的數據分割資訊，然後根據指定的數據分割格式，將它們擷取到數據行。	要使用的分割區格式	`- extract_columns_from_partition_format: {column_name:yyyy/MM/dd/HH/mm/ss}` 會建立 datetime 數據行，其中 'yyyy'、'MM'、'dd'、'HH'、'mm' 和 'ss' 用來擷取日期時間類型的年、月、日、小時、分鐘和秒值
`filter`	篩選數據，只留下符合指定表達式的記錄。	以字串表示的表達式	`- filter: 'col("temperature") > 32 and col("location") == "UK"'` 只有離開溫度超過 32 的數據列，而英國是位置
`skip`	加入轉換步驟，以略過這個MLTable的第一個計數數據列。	要略過的數據列數目計數	`- skip: 10` 略過前 10 個數據列
`take`	加入轉換步驟，以選取此 MLTable 的第一個計數數據列。	要取得之數據表頂端的數據列數目計數	`- take: 5` 取得前五個數據列。
`take_random_sample`	新增轉換步驟，以隨機選取此MLTable的每個數據列，機率機率。	`probability` 選取個別數據列的機率。必須介於 [0,1] 範圍內。 `seed` 選擇性隨機種子	`- take_random_sample： probability：0.10 seed：123` 使用 123 的隨機種子，擷取 10% 的隨機數據列樣本

範例

MLTable 使用的範例。在下列位置尋找更多範例：

在 Azure 機器學習中使用數據表
GitHub 存放庫範例

快速入門

本快速入門會從公用 HTTPs 伺服器讀取著名的鳶尾花數據集。若要繼續，您必須將 MLTable 檔案放在資料夾中。首先，使用下列專案建立資料夾和 MLTable 檔案：

mkdir ./iris
cd ./iris
touch ./MLTable

接下來，將此內容放在檔案中 MLTable ：

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json

type: mltable
paths:
    - file: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv

transformations:
    - read_delimited:
        delimiter: ','
        header: all_files_same_headers
        include_path_column: true

然後，您可以使用：

重要

mltable您必須安裝 Python SDK。使用下列方式安裝此 SDK：

pip install mltable.

import mltable

tbl = mltable.load("./iris")
df = tbl.to_pandas_dataframe()

請確定數據報含名為 Path的新數據行。此資料列包含 https://azuremlexamples.blob.core.windows.net/datasets/iris.csv 資料路徑。

CLI 可以建立數據資產：

az ml data create --name iris-from-https --version 1 --type mltable --path ./iris

包含 MLTable 自動上傳至雲端記憶體的資料夾（預設的 Azure 機器學習資料存放區）。

提示

Azure 機器學習數據資產類似於網頁瀏覽器書籤（我的最愛）。您可以建立數據資產，然後以易記名稱存取該資產，而不是記住指向您最常使用之數據的長 URI（記憶體路徑）。

分隔的文字檔案

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable

# Supported paths include:
# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
# Datastore: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>

paths:
  - file: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/ # a specific file on ADLS
  # additional options
  # - folder: ./<folder> a specific folder
  # - pattern: ./*.csv # glob all the csv files in a folder

transformations:
    - read_delimited:
        encoding: ascii
        header: all_files_same_headers
        delimiter: ","
        include_path_column: true
        empty_as_string: false
    - keep_columns: [col1, col2, col3, col4, col5, col6, col7]
    # or you can drop_columns...
    # - drop_columns: [col1, col2, col3, col4, col5, col6, col7]
    - convert_column_types:
        - columns: col1
          column_type: int
        - columns: col2
          column_type:
            datetime:
                formats:
                    - "%d/%m/%Y"
        - columns: [col1, col2, col3] 
          column_type:
            boolean:
                mismatch_as: error
                true_values: ["yes", "true", "1"]
                false_values: ["no", "false", "0"]
      - filter: 'col("col1") > 32 and col("col7") == "a_string"'
      # create a column called timestamp with the values extracted from the folder information
      - extract_columns_from_partition_format: {timestamp:yyyy/MM/dd}
      - skip: 10
      - take_random_sample:
          probability: 0.50
          seed: 1394
      # or you can take the first n records
      # - take: 200

Parquet

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable

# Supported paths include:
# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
# Datastore: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>

paths:
  - pattern: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>/*.parquet
  
transformations:
  - read_parquet:
        include_path_column: false
  - filter: 'col("temperature") > 32 and col("location") == "UK"'
  - skip: 1000 # skip first 1000 rows
  # create a column called timestamp with the values extracted from the folder information
  - extract_columns_from_partition_format: {timestamp:yyyy/MM/dd}

Delta Lake

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable

# Supported paths include:
# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
# Datastore: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>

paths:
- folder: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/

# NOTE: for read_delta_lake, you are *required* to provide either
# timestamp_as_of OR version_as_of.
# timestamp should be in RFC-3339/ISO-8601 format (for example:
# "2022-10-01T00:00:00Z", "2022-10-01T00:00:00+08:00",
# "2022-10-01T01:30:00-08:00")
# To get the latest, set the timestamp_as_of at a future point (for example: '2999-08-26T00:00:00Z')

transformations:
 - read_delta_lake:
      timestamp_as_of: '2022-08-26T00:00:00Z'
      # alternative:
      # version_as_of: 1

重要

限制： mltable 不支援從 Delta Lake 讀取數據時擷取數據分割索引鍵。當您透過 mltable讀取 Delta Lake 資料時，轉換mltableextract_columns_from_partition_format將無法運作。

JSON

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
paths:
  - file: ./order_invalid.jsonl
transformations:
  - read_json_lines:
        encoding: utf8
        invalid_lines: drop
        include_path_column: false

Share via

CLI （v2） MLtable YAML 架構

如何撰寫 `MLTable` 檔案

YAML 語法

轉換

讀取轉換

其他轉換

範例

快速入門

分隔的文字檔案

Parquet

Delta Lake

JSON

下一步

其他資源

Share via

CLI （v2） MLtable YAML 架構

如何撰寫 MLTable 檔案

YAML 語法

轉換

讀取轉換

其他轉換

範例

快速入門

分隔的文字檔案

Parquet

Delta Lake

JSON

下一步

其他資源

如何撰寫 `MLTable` 檔案