管理元件和管線的輸入和輸出

發行項
03/23/2024

在本文中，您將了解：

元件和管線中的輸入和輸出概觀
如何將元件輸入/輸出升階至管線輸入/輸出
如何定義選用的輸入
如何自訂輸出路徑
如何下載輸出
如何將輸出註冊為具名資產

輸入和輸出概觀

Azure Machine Learning 管線支援元件和管線層級的輸入和輸出。

在元件層級，輸入和輸出可定義元件的介面。一個元件的輸出可以用作同一父管線中另一個元件的輸入，以讓資料或模型可以在元件之間傳遞。此互連性形成了一個圖表，說明了管線內的資料流。

在管線層級，輸入和輸出對於提交具有不同資料輸入或控制定型邏輯的參數 (例如learning_rate) 的管線作業非常有用。當透過 REST 端點叫用管線時，它們特別有用。這些輸入和輸出可讓您對管線輸入指派不同的值，或透過 REST 端點存取管線作業的輸出。若要深入了解，請參閱為批次端點建立作業和輸入資料。(英文)

輸入和輸出的類型

支援以下類型作為元件或管線的輸出。

資料類型。請參閱 Azure Machine Learning 中的資料類型，以深入了解資料類型。
- uri_file
- uri_folder
- mltable
模型類型。
- mlflow_model
- custom_model

使用資料或模型輸出基本上是將輸出進行序列化，並將它們當作檔案儲存在儲存位置中。在後續的步驟中，此儲存位置可以掛接、下載或上傳至計算目標檔案系統，以讓下一個步驟能夠在作業執行期間存取檔案。

此過程需要元件的原始程式碼將所需的輸出物件 (通常儲存在記憶體中) 序列化成檔案。例如，您可以將 pandas 資料框架序列化為 CSV 檔案。請注意，Azure Machine Learning 不會為物件序列化定義任何標準化的方法。作為使用者，您可以彈性地選擇慣用的方法來將物件序列化成檔案。隨後，在下游元件中，您可以獨立還原序列化並讀取這些檔案。以下是一些範例供您參考：

在 nyc_taxi_data_regression 範例中，準備元件具有 uri_folder 類型輸出。在此元件原始程式碼中，它會從輸入資料夾讀取 csv 檔案、處理這些檔案，並將已處理的 CSV 檔寫入到輸出資料夾。
在 nyc_taxi_data_regression 範例中，定型元件具有 mlflow_model 類型輸出。在此元件原始程式碼中，它會使用 mlflow.sklearn.save_model 方法儲存定型的模型。

除了上述資料或模型類型之外，管線或元件的輸入也可以是以下的基本類型。

string
number
integer
boolean

在 nyc_taxi_data_regression 範例中，定型元件具有名為 test_split_ratio 的 number 輸入。

注意

不支援基本類型輸出。

資料輸入/輸出的路徑和模式

對於資料資產的輸入/輸出，您必須指定指向資料位置的 path 參數。下表顯示 Azure Machine Learning 管線所支援的不同資料位置，也顯示路徑參數範例：

Location	範例	輸入	輸出
本機電腦上的路徑	`./home/username/data/my_data`	✓
公用 HTTP 伺服器的路徑	`https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv`	✓
Azure 儲存體上的路徑	`wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>` `abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>`	不建議使用，因為它可能需要額外的身分識別設定才能讀取資料。
Azure Machine Learning 資料存放區上的路徑	`azureml://datastores/<data_store_name>/paths/<path>`	✓	✓
資料資產的路徑	`azureml:<my_data>:<version>`	✓	✓

注意

對於儲存體上的輸入/輸出，我們強烈建議使用 Azure Machine Learning 資料存放區路徑，而不是直接的 Azure 儲存體路徑。管線中的各種作業類型都支援資料存放區路徑。

對於資料的輸入/輸出，您可以選擇各種模式 (下載、掛接或上傳) 來定義如何在計算目標中存取資料。下表顯示不同類型/模式/輸入/輸出組合的可能模式。

類型	輸入/輸出	`upload`	`download`	`ro_mount`	`rw_mount`	`direct`	`eval_download`	`eval_mount`
`uri_folder`	輸入		✓	✓		✓
`uri_file`	輸入		✓	✓		✓
`mltable`	輸入		✓	✓		✓	✓	✓
`uri_folder`	輸出	✓			✓
`uri_file`	輸出	✓			✓
`mltable`	輸出	✓			✓	✓

注意

在大部分的情況下，我們建議使用 ro_mount 或 rw_mount 模式。若要深入了解模式，請參閱資料資產模式。

Azure Machine Learning 工作室中的視覺表示法

下列螢幕擷取畫面提供如何在 Azure Machine Learning 工作室的管線作業中顯示輸入和輸出的範例。這個名為 nyc-taxi-data-regression 的特定作業可以在 azureml-example 中找到。

在工作室的管線作業頁面中，元件的資料/模型類型的輸入/輸出會在對應的元件中顯示為小圓圈 (稱為輸入/輸出埠)。這些埠代表管線中的資料流。

管線層級的輸出會顯示為紫色方塊，以便於識別。

當您將滑鼠的游標停留在輸入/輸出埠上時，即會顯示類型。

基本類型輸入不會顯示在圖表上。它可以在管線作業概觀面板 (對於管線層級輸入) 或元件面板 (對於元件層級輸入) 的 [設定] 索引標籤中找到。下列螢幕擷取畫面顯示了管線作業的 [設定] 索引標籤，您可以選取 [作業概觀] 連結來開啟它。

如果您想要檢查元件的輸入，請按兩下該元件以開啟元件面板。

同樣地，在設計工具中編輯管線時，您可以在 [管線介面] 面板中找到管線的輸入和輸出，以及在元件的面板中找到元件的輸入與輸出 (按兩下元件來觸發)。

如何將元件的輸入和輸出升階到管線層級

將元件的輸入/輸出升階到管線層級可讓您在提交管線作業時覆寫元件的輸入/輸出。如果您想要使用 REST 端點來觸發管線，它也很有用。

以下是將元件輸入/輸出升階到管線層級輸入/輸出的範例。

Azure CLI
Python SDK

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: 1b_e2e_registered_components
description: E2E dummy train-score-eval pipeline with registered components

inputs:
  pipeline_job_training_max_epocs: 20
  pipeline_job_training_learning_rate: 1.8
  pipeline_job_learning_rate_schedule: 'time-based'

outputs: 
  pipeline_job_trained_model:
    mode: upload
  pipeline_job_scored_data:
    mode: upload
  pipeline_job_evaluation_report:
    mode: upload

settings:
 default_compute: azureml:cpu-cluster

jobs:
  train_job:
    type: command
    component: azureml:my_train@latest
    inputs:
      training_data: 
        type: uri_folder 
        path: ./data      
      max_epocs: ${{parent.inputs.pipeline_job_training_max_epocs}}
      learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate}}
      learning_rate_schedule: ${{parent.inputs.pipeline_job_learning_rate_schedule}}
    outputs:
      model_output: ${{parent.outputs.pipeline_job_trained_model}}
    services:
      my_vscode:
        type: vs_code
      my_jupyter_lab:
        type: jupyter_lab
      my_tensorboard:
        type: tensor_board
        log_dir: "outputs/tblogs"
    #  my_ssh:
    #    type: tensor_board
    #    ssh_public_keys: <paste the entire pub key content>
    #    nodes: all # Use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node.

  score_job:
    type: command
    component: azureml:my_score@latest
    inputs:
      model_input: ${{parent.jobs.train_job.outputs.model_output}}
      test_data: 
        type: uri_folder 
        path: ./data
    outputs:
      score_output: ${{parent.outputs.pipeline_job_scored_data}}

  evaluate_job:
    type: command
    component: azureml:my_eval@latest
    inputs:
      scoring_result: ${{parent.jobs.score_job.outputs.score_output}}
    outputs:
      eval_output: ${{parent.outputs.pipeline_job_evaluation_report}}

完整的範例可以在具有註冊元件的 train-score-eval 管線中找到。此管線會將三個輸入和三個輸出升階為管線層級。讓我們以 pipeline_job_training_max_epocs 為例。它在根層級的 inputs 區段下宣告，這表示它是管線層級的輸入。在 jobs -> train_job 區段下，名為 max_epocs 的輸入被參考為 ${{parent.inputs.pipeline_job_training_max_epocs}}，這表示 train_job 的輸入 max_epocs 參考了管線層級的輸入 pipeline_job_training_max_epocs。同樣地，您也可以使用相同的結構描述來升階管線輸出。

# import required libraries
from azure.identity import DefaultAzureCredential

from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component

# Set your subscription, resource group and workspace name:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# define the dirtory that stores the input data 
parent_dir = ""

# Load components
prepare_data = load_component(source=parent_dir + "./prep.yml")
transform_data = load_component(source=parent_dir + "./transform.yml")
train_model = load_component(source=parent_dir + "./train.yml")
predict_result = load_component(source=parent_dir + "./predict.yml")
score_data = load_component(source=parent_dir + "./score.yml")

# Construct pipeline. 
# Below code snippet defines nyc_taxi_data_regression pipeline.
# The pipeline takes 1 input (pipeline_job_input) and generates 6 outputs as defined in return statement.
# The pipeline outputs are promoted from the child component using schema as <step_name.outputs.output_name>.
# for example `prepare_sample_data.outputs.prep_data`.  
@pipeline()
def nyc_taxi_data_regression(pipeline_job_input):
    """NYC taxi data regression example."""
    prepare_sample_data = prepare_data(raw_data=pipeline_job_input)
    transform_sample_data = transform_data(
        clean_data=prepare_sample_data.outputs.prep_data
    )
    train_with_sample_data = train_model(
        training_data=transform_sample_data.outputs.transformed_data
    )
    predict_with_sample_data = predict_result(
        model_input=train_with_sample_data.outputs.model_output,
        test_data=train_with_sample_data.outputs.test_data,
    )
    score_with_sample_data = score_data(
        predictions=predict_with_sample_data.outputs.predictions,
        model=train_with_sample_data.outputs.model_output,
    )
    return {
        "pipeline_job_prepped_data": prepare_sample_data.outputs.prep_data,
        "pipeline_job_transformed_data": transform_sample_data.outputs.transformed_data,
        "pipeline_job_trained_model": train_with_sample_data.outputs.model_output,
        "pipeline_job_test_data": train_with_sample_data.outputs.test_data,
        "pipeline_job_predictions": predict_with_sample_data.outputs.predictions,
        "pipeline_job_score_report": score_with_sample_data.outputs.score_report,
    }
# 
pipeline_job = nyc_taxi_data_regression(
    Input(type="uri_folder", path=parent_dir + "./data/")
)
# demo how to change pipeline output settings
pipeline_job.outputs.pipeline_job_prepped_data.mode = "rw_mount"

# set pipeline level compute
pipeline_job.settings.default_compute = "cpu-cluster"
# set pipeline level datastore
pipeline_job.settings.default_datastore = "workspaceblobstore"

此端對端筆記本 (Notebook) 範例位於 azureml-example 存放庫中

Studio

您可以在設計工具編寫頁面中將元件的輸入升階為管線層級的輸入。按兩下該元件來移至元件的設定面板 -> 尋找您想要升階的輸入 -> 選取右邊的三個點 -> 選取 [新增至管線輸入]。

選用的輸入

根據預設，所有的輸入都是必要的，而且您每次提交管線作業時都必須指派一個值 (或預設值)。不過，在某些情況下您可能需要選用的輸入。在這類的情況下，您可以在提交管線作業時彈性地不將值指派給輸入。

選用的輸入在以下兩種情況下可能很有用：

如果您有選用的資料/模型類型的輸入，而且在提交管線作業時沒有指派值給它，則管線中將會有一個缺少前面的資料相依性的元件。換句話說，此輸入埠不會連結到任何的元件或資料/模型節點。這會導致管線服務直接叫用此元件，而不會等候前面的相依性準備好。
下面的螢幕擷取畫面提供了第二種情況的清楚範例。如果您針對管線設定 continue_on_step_failure = True，並擁有使用第一個節點 (node1) 的輸出作為選用輸入的第二個節點 (node2)，則即使 node1 失敗，node2 仍會執行。不過，如果 node2 使用 node1 的必要輸入，則當 node1 失敗時，它就不會執行。

以下是如何定義選用輸入的範例。

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_data_component_cli
display_name: train_data
description: A example train component
tags:
  author: azureml-sdk-team
version: 9
type: command
inputs:
  training_data: 
    type: uri_folder
  max_epocs:
    type: integer
    optional: true
  learning_rate: 
    type: number
    default: 0.01
    optional: true
  learning_rate_schedule: 
    type: string
    default: time-based
    optional: true
outputs:
  model_output:
    type: uri_folder
code: ./train_src
environment: azureml://registries/azureml/environments/sklearn-1.0/labels/latest
command: >-
  python train.py 
  --training_data ${{inputs.training_data}} 
  $[[--max_epocs ${{inputs.max_epocs}}]]
  $[[--learning_rate ${{inputs.learning_rate}}]]
  $[[--learning_rate_schedule ${{inputs.learning_rate_schedule}}]]
  --model_output ${{outputs.model_output}}

當輸入設為 optional = true 時，您必須使用 $[[]] 來對輸入採用命令列。請參閱上面範例中反白顯示的行。

注意

不支援選用的輸出。

在管線圖形中，資料/模型類型的選用輸入以虛線圓圈表示。基本類型的選用輸入可位於 [設定] 索引標籤底下。與必要的輸入不同，選用的輸入旁邊沒有星號，表示它們不是強制性的。

如何自訂輸出路徑

根據預設，元件的輸出會儲存在 azureml://datastores/${{default_datastore}}/paths/${{name}}/${{output_name}} 中。 {default_datastore} 是客戶針對管線所設定的預設資料存放區。若未設定，則為工作區 Blob 儲存體。 {name} 是作業名稱 (它會在作業執行時加以解析)。 {output_name} 是客戶在元件 YAML 中所定義的輸出名稱。

但您也可以透過定義輸出的路徑來自訂儲存輸出的位置。以下是範例︰

Azure CLI
Python SDK

pipeline.yaml 定義了一個具有三個管線層級輸出的管線。完整的 YAML 可以在具有註冊元件範例的 train-score-eval 管線中找到。您可以使用下列命令來為 pipeline_job_trained_model 輸出設定自訂輸出路徑。

# define the custom output path using datastore uri
# add relative path to your blob container after "azureml://datastores/<datastore_name>/paths"
output_path="azureml://datastores/{datastore_name}/paths/{relative_path_of_container}"  

# create job and define path using --outputs.<outputname>
az ml job create -f ./pipeline.yml --set outputs.pipeline_job_trained_model.path=$output_path

cluster_name = "cpu-cluster"
custom_path = "azureml://datastores/workspaceblobstore/paths/custom_path/${{name}}/"

# define a pipeline with component
@pipeline(default_compute=cluster_name)
def pipeline_with_python_function_components(input_data, test_data, learning_rate):
    """E2E dummy train-score-eval pipeline with components defined via python function components"""

    # Call component obj as function: apply given inputs & parameters to create a node in pipeline
    train_with_sample_data = train_model(
        training_data=input_data, max_epochs=5, learning_rate=learning_rate
    )
    score_with_sample_data = score_data(
        model_input=train_with_sample_data.outputs.model_output,
        test_data=test_data,
        model_file=train_with_sample_data.outputs.output,
    )
    # example how to change path of output on step level,
    # please note if the output is promoted to pipeline level you need to change path in pipeline job level
    score_with_sample_data.outputs.score_output = Output(
        type="uri_folder", mode="rw_mount", path=custom_path
    )
    eval_with_sample_data = eval_model(
        scoring_result=score_with_sample_data.outputs.score_output,
        scoring_file=score_with_sample_data.outputs.output,
    )

    # Return: pipeline outputs
    return {
        "eval_output": eval_with_sample_data.outputs.eval_output,
        "model_output": train_with_sample_data.outputs.model_output,
    }


pipeline_job = pipeline_with_python_function_components(
    input_data=Input(
        path="wasbs://demo@dprepdata.blob.core.windows.net/Titanic.csv", type="uri_file"
    ),
    test_data=Input(
        path="wasbs://demo@dprepdata.blob.core.windows.net/Titanic.csv", type="uri_file"
    ),
    learning_rate=0.1,
)
# example how to change path of output on pipeline level
pipeline_job.outputs.model_output = Output(
    type="uri_folder", mode="rw_mount", path=custom_path
)

端對端筆記本 (Notebook) 範例可以在使用 command_component 裝飾的 python 函式筆記本建置管線中找到。

如何下載輸出

您可以按照以下範例來下載元件的輸出或管線輸出。

# Download all the outputs of the job
az ml job download --all -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>

# Download specific output
az ml job download --output-name <OUTPUT_PORT_NAME> -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>

在我們深入探討程式碼之前，您需要一種方法來參考您的工作區。您要建立 ml_client 來取得工作區的控制代碼。請參閱建立工作區的控制代碼，以初始化 ml_client。

# Download all the outputs of the job
output = client.jobs.download(name=job.name, download_path=tmp_path, all=True)

# Download specific output
output = client.jobs.download(name=job.name, download_path=tmp_path, output_name=output_port_name)

下載子作業的輸出

當您需要下載子作業的輸出時 (未升階為管線層級的元件輸出)，您應該先列出管線作業的所有子作業實體，然後再使用類似的程式碼來下載輸出。

Azure CLI
Python SDK

# List all child jobs in the job and print job details in table format
az ml job list --parent-job-name <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID> -o table

# Select needed child job name to download output
az ml job download --all -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>

# List all child jobs in the job
child_jobs = client.jobs.list(parent_job_name=job.name)
# Traverse and download all the outputs of child job
for child_job in child_jobs:
    client.jobs.download(name=child_job.name, all=True)

如何將輸出註冊為具名資產

您可以將 name 和 version 指派給輸出，來將元件或管線的輸出註冊為具名資產。已註冊的資產可以透過工作室的 UI/CLI/SDK 在您的工作區中列出，也可以在您未來的作業中加以參考。

註冊管線輸出

Azure CLI
Python SDK

display_name: register_pipeline_output
type: pipeline
jobs:
  node:
    type: command
    inputs:
      component_in_path:
        type: uri_file
        path: https://dprepdata.blob.core.windows.net/demo/Titanic.csv
    component: ../components/helloworld_component.yml
    outputs:
      component_out_path: ${{parent.outputs.component_out_path}}
outputs:
  component_out_path:
    type: mltable
    name: pipeline_output  # Define name and version to register pipeline output
    version: '1'
settings:
  default_compute: azureml:cpu-cluster

from azure.ai.ml import dsl, Output

# Load component functions
components_dir = "./components/"
helloworld_component = load_component(source=f"{components_dir}/helloworld_component.yml")

@pipeline()
def register_pipeline_output():
  # Call component obj as function: apply given inputs & parameters to create a node in pipeline
  node = helloworld_component(component_in_path=Input(
    type='uri_file', path='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'))

  return {
      'component_out_path': node.outputs.component_out_path
  }

pipeline = register_pipeline_output()
# Define name and version to register pipeline output
pipeline.settings.default_compute = "azureml:cpu-cluster"
pipeline.outputs.component_out_path.name = 'pipeline_output'
pipeline.outputs.component_out_path.version = '1'

註冊子作業的輸出

Azure CLI
Python SDK

display_name: register_node_output
type: pipeline
jobs:
  node:
    type: command
    component: ../components/helloworld_component.yml
    inputs:
      component_in_path:
        type: uri_file
        path: 'https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
    outputs:
      component_out_path:
        type: uri_folder
        name: 'node_output'  # Define name and version to register a child job's output
        version: '1'
settings:
  default_compute: azureml:cpu-cluster

from azure.ai.ml import dsl, Output

# Load component functions
components_dir = "./components/"
helloworld_component = load_component(source=f"{components_dir}/helloworld_component.yml")

@pipeline()
def register_node_output():
  # Call component obj as function: apply given inputs & parameters to create a node in pipeline
  node = helloworld_component(component_in_path=Input(
    type='uri_file', path='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'))

  # Define name and version to register node output
  node.outputs.component_out_path.name = 'node_output'
  node.outputs.component_out_path.version = '1'

pipeline = register_node_output()
pipeline.settings.default_compute = "azureml:cpu-cluster"

Share via