サーバーレスコンピューティングでのモデルトレーニング

[アーティクル]
11/16/2023

適用対象:Azure CLI ml extension v2 (現行)Python SDK azure-ai-ml v2 (現行)

スケーラブルな方法でモデルをトレーニングするためにコンピューティングを作成して管理する必要はなくなりました。代わりに、サーバーレスコンピューティングと呼ばれる新しいコンピューティングターゲットの種類にジョブを送信できます。サーバーレスコンピューティングは、Azure Machine Learning でトレーニングジョブを実行する最も簡単な方法です。サーバーレスコンピューティングは、フルマネージドのオンデマンドコンピューティングです。 Azure Machine Learning では、コンピューティングの作成、スケーリング、管理が行われます。サーバーレスコンピューティングによるモデルトレーニングを通じて、機械学習の専門家は機械学習モデルの構築に関する専門知識に集中でき、コンピューティングインフラストラクチャやその設定について学ぶ必要がありません。

機械学習の専門家は、ジョブに必要なリソースを指定できます。 Azure Machine Learning でコンピューティングインフラストラクチャが管理され、マネージドネットワーク分離により負担が軽減されます。

また、各ジョブに最適なリソースを指定することで、企業はコストを削減できます。 IT 管理者は引き続き、サブスクリプションおおびワークスペースレベルでコアクォータを指定することで制御を適用し、Azure ポリシーを適用できます。

サーバーレスコンピューティングを使用すると、LLAMA 2 などのモデルカタログ内のモデルを微調整できます。サーバーレスコンピューティングは、Azure Machine Learning スタジオ、SDK、CLI からすべての種類のジョブを実行するために使用できます。サーバーレスコンピューティングは、環境イメージの構築や責任ある AI ダッシュボードのシナリオにも使用できます。サーバーレスジョブは、Azure Machine Learning コンピューティングクォータと同じクォータを使います。標準 (専用) レベルまたはスポット (低優先度) VM を選択できます。マネージド ID とユーザー ID は、サーバーレスジョブでサポートされています。課金モデルは、Azure Machine Learning コンピューティングと同じです。

サーバーレスコンピューティングの利点

Azure Machine Learning は、コンピューティングインフラストラクチャの作成、設定、スケーリング、削除、修正プログラムの適用を管理し、管理オーバーヘッドを削減します
コンピューティング、さまざまなコンピューティングの種類、関連プロパティについて学ぶ必要はありません。
必要な VM サイズごとにクラスターを作成し、同じ設定を使い、ワークスペースごとにレプリケーションを繰り返し行う必要がありません。
インスタンスの種類 (VM サイズ) とインスタンス数の観点から、各ジョブが必要とするリソースを実行時に正確に指定することで、コストを最適化できます。ジョブの使用率メトリックを監視して、ジョブに必要なリソースを最適化できます。
ジョブの実行に関連するステップの削減
ジョブの送信をさらに簡略化するために、リソースを完全にスキップできます。 Azure Machine Learning では、インスタンス数が既定値で設定され、クォータ、コスト、パフォーマンス、ディスクサイズなどの要因に基づいてインスタンスの種類 (VM サイズ) が選択されます。
場合によっては、ジョブの実行が開始されるまでの待機時間が短くなります。
ジョブの送信では、ユーザー ID とワークスペースユーザー割り当てマネージド ID がサポートされています。
マネージドネットワーク分離を使用すると、ネットワーク分離構成を合理化および自動化できます。お客様の仮想ネットワークもサポートされています
クォータと Azure ポリシーを使用した管理制御

サーバーレスコンピューティングの使用方法

以下に示すようにノートブックを使用して、LLAMA 2 などの基盤モデルを微調整できます。
- LLAMA 2 の微調整
- 複数ノードを使用した LLAMA 2 の微調整
独自のコンピューティングクラスターを作成する場合、compute="cpu-cluster" のように、コマンドジョブでその名前を使います。サーバーレスでは、コンピューティングクラスターの作成をスキップし、compute パラメーターを省略することで、代わりにサーバーレスコンピューティングを使用できます。ジョブに compute が指定されていない場合、ジョブはサーバーレスコンピューティングで実行されます。 CLI または SDK のジョブでコンピューティング名を省略すると、次のジョブの種類でサーバーレスコンピューティングが使われ、必要に応じてジョブが必要とするリソースをインスタンス数およびインスタンスの種類で指定できます。
- 対話型ジョブや分散トレーニングを含むコマンドジョブ
- AutoML ジョブ
- スイープジョブ
- 並列ジョブ
CLI を介したパイプラインジョブの場合、パイプラインレベルの既定のコンピューティングには default_compute: azureml:serverless を使います。 SDK を介したパイプラインジョブの場合は default_compute="serverless" を使います。例については、「パイプラインジョブ」を参照してください。
スタジオ (プレビュー) でトレーニングジョブを送信する場合、コンピューティングの種類として [サーバーレス] を選びます。
Azure Machine Learning デザイナーを使う場合、既定のコンピューティングとして [サーバーレス] を選びます。
責任ある AI ダッシュボードにサーバーレスコンピューティングを使用できます
- RAI ダッシュボードを使用した AutoML 画像分類シナリオ

パフォーマンスに関する考慮事項

サーバーレスコンピューティングを使うと、次のようにトレーニングの高速化に役立ちます。

クォータの不足: 独自のコンピューティングクラスターを作成する場合、作成する VM サイズとノード数を把握する必要があります。ジョブの実行時に、クラスターのクォータが十分でない場合、ジョブは失敗します。サーバーレスコンピューティングでは、クォータに関する情報を使って、既定で適切な VM サイズを選びます。

スケールダウンの最適化: コンピューティングクラスターがスケールダウンしている場合、新しいジョブはスケールダウンが起こるのを待ち、ジョブを実行する前にスケールアップする必要があります。サーバーレスコンピューティングでは、スケールダウンを待つ必要がなく、ジョブは別のクラスター/ノードで実行を開始できます (クォータがある場合)。

クラスタービジーの最適化: コンピューティングクラスターでジョブを実行中に別のジョブが送信されると、ジョブは現在実行中のジョブの後ろにキューイングされます。サーバーレスコンピューティングでは、別のノードや別のクラスターでジョブの実行を開始できます (クォータがある場合)。

Quota

ジョブを送信する場合でも、続行するのに十分な Azure Machine Learning コンピューティングクォータ (ワークスペースとサブスクリプションレベルのクォータの両方) が必要です。サーバーレスジョブの既定の VM サイズは、このクォータに基づいて選ばれます。独自の VM サイズ/ファミリを指定する場合:

VM サイズ/ファミリのクォータはあるが、インスタンス数に対するクォータが十分でない場合、エラーが表示されます。このエラーでは、インスタンス数をクォータ制限に基づいて有効な数に減らすか、この VM ファミリのクォータの引き上げを要求するか、VM サイズを変更することをお勧めします。
指定した VM サイズのクォータがない場合は、エラーが表示されます。このエラーでは、クォータが不足しない別の VM サイズを選ぶか、この VM ファミリのクォータを要求することをお勧めします。
サーバーレスジョブを実行するのに十分な VM ファミリのクォータがあるが、他のジョブでそのクォータが使われている場合、クォータが使用可能になるまでジョブはキューで待機する必要があるというメッセージが表示されます。

Azure portal で使用量とクォータを表示すると、"サーバーレス" という名前が表示され、サーバーレスジョブで使ったすべてのクォータが表示されます。

ID のサポートと資格情報のパススルー

ユーザー資格情報のパススルー: サーバーレスコンピューティングは、ユーザー資格情報のパススルーを完全にサポートしています。ジョブを送信するユーザーのユーザートークンは、ストレージアクセスに使われます。これらの資格情報は、Microsoft Entra ID から取得されます。

Python SDK
Azure CLI

from azure.ai.ml import command
from azure.ai.ml import MLClient     # Handle to the workspace
from azure.identity import DefaultAzureCredential     # Authentication package
from azure.ai.ml.entities import ResourceConfiguration
from azure.ai.ml.entities import UserIdentityConfiguration 

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure subscription id>", 
    resource_group_name="<Azure resource group>",
    workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
        identity=UserIdentityConfiguration(),
)
# submit the command job
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
identity:
  type: user_identity

ユーザー割り当てマネージド ID: ユーザー割り当てマネージド ID でワークスペースを構成している場合、その ID をサーバーレスジョブで使って、ストレージアクセスに使用できます。

Python SDK
Azure CLI

from azure.ai.ml import command
from azure.ai.ml import MLClient     # Handle to the workspace
from azure.identity import DefaultAzureCredential    # Authentication package
from azure.ai.ml.entities import ResourceConfiguration
from azure.ai.ml.entities import ManagedIdentityConfiguration

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure subscription id>", 
    resource_group_name="<Azure resource group>",
    workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
        identity= ManagedIdentityConfiguration(),
)
# submit the command job
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
identity:
  type: managed

ユーザー割り当てマネージド ID のアタッチの詳細については、「ユーザー割り当てマネージド ID をアタッチする」を参照してください。

コマンドジョブのプロパティを構成する

コマンド、スイープ、AutoML の各ジョブでコンピューティング先が指定されていない場合、コンピューティングの既定値はサーバーレスコンピューティングになります。たとえば、このコマンドジョブの場合:

Python SDK
Azure CLI

from azure.ai.ml import command
from azure.ai.ml import command 
from azure.ai.ml import MLClient # Handle to the workspace
from azure.identity import DefaultAzureCredential # Authentication package

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure subscription id>", 
    resource_group_name="<Azure resource group>",
    workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
)
# submit the command job
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest

コンピューティングの既定値は、次のサーバーレスコンピューティングです。

このジョブの単一ノード。既定のノード数は、ジョブの種類に基づきます。他のジョブの種類については、次のセクションを参照してください。
CPU 仮想マシン。クォータ、パフォーマンス、コスト、ディスクサイズに基づいて決定されます。
専用仮想マシン
ワークスペースの場所

これらの既定値はオーバーライドできます。サーバーレスコンピューティングの VM 型やノード数を指定する場合は、ジョブに resources を追加します。

instance_type で特定の VM を選びます。特定の CPU/GPU の VM サイズが必要な場合は、このパラメーターを使います

ノード数を指定するには instance_count を使います。

Python SDK
Azure CLI

from azure.ai.ml import command 
from azure.ai.ml import MLClient # Handle to the workspace
from azure.identity import DefaultAzureCredential # Authentication package
from azure.ai.ml.entities import ResourceConfiguration 

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure subscription id>", 
    resource_group_name="<Azure resource group>",
    workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
    resources = ResourceConfiguration(instance_type="Standard_NC24", instance_count=4)
)
# submit the command job
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest
resources:
  instance_count: 4
  instance_type: Standard_NC24

ジョブレベルを変更するには、queue_settings で専用 VM (job_tier: Standard) と低優先度 (jobtier: Spot) のどちらかを選びます。

Python SDK
Azure CLI

from azure.ai.ml import command
from azure.ai.ml import MLClient    # Handle to the workspace
from azure.identity import DefaultAzureCredential    # Authentication package
credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure subscription id>", 
    resource_group_name="<Azure resource group>",
    workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
    queue_settings={
      "job_tier": "spot"  
    }
)
# submit the command job
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
component: ./train.yml 
queue_settings:
   job_tier: Standard #Possible Values are Standard (dedicated), Spot (low priority). Default is Standard.

コマンドジョブですべてのフィールドを指定した場合の例

ID などのジョブが使用する必要があるすべてのフィールドを指定した例をこちらに示します。ワークスペースレベルのマネージドネットワーク分離が自動的に使われるため、仮想ネットワーク設定を指定する必要はありません。

Python SDK
Azure CLI

from azure.ai.ml import command
from azure.ai.ml import MLClient      # Handle to the workspace
from azure.identity import DefaultAzureCredential     # Authentication package
from azure.ai.ml.entities import ResourceConfiguration
from azure.ai.ml.entities import UserIdentityConfiguration 

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure subscription id>", 
    resource_group_name="<Azure resource group>",
    workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
         identity=UserIdentityConfiguration(),
    queue_settings={
      "job_tier": "Standard"  
    }
)
job.resources = ResourceConfiguration(instance_type="Standard_E4s_v3", instance_count=1)
# submit the command job
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
queue_settings:
   job_tier: Standard #Possible Values are Standard, Spot. Default is Standard.
identity:
  type: user_identity #Possible values are Managed, user_identity
resources:
  instance_count: 1
  instance_type: Standard_E4s_v3

サーバーレスコンピューティングでのトレーニングの例については、- を参照してください

AutoML ジョブ

AutoML ジョブの場合、コンピューティングを指定する必要はありません。リソースは必要に応じて指定できます。インスタンス数が指定されていない場合は、max_concurrent_trials と max_nodes パラメーターに基づき既定値で設定されます。インスタンスの種類を指定せずに AutoML の画像分類または NLP タスクを送信した場合、GPU VM のサイズが自動的に選ばれます。 AutoML のジョブは CLI、SDK、スタジオから送信することができます。 Studio でサーバーレスコンピューティングを使用して AutoML ジョブを送信するには、まずプレビューパネルで [Studio でトレーニングジョブを送信する (プレビュー)] を有効にします。

Python SDK
Azure CLI

種類やインスタンス数を指定する場合は、ResourceConfiguration クラスを使います。

# Create the AutoML classification job with the related factory-function.
from azure.ai.ml.entities import ResourceConfiguration 

classification_job = automl.classification(
    experiment_name=exp_name,
    training_data=my_training_data_input,
    target_column_name="y",
    primary_metric="accuracy",
    n_cross_validations=5,
    enable_model_explainability=True,
    tags={"my_custom_tag": "My custom value"},
)

# Limits are all optional
classification_job.set_limits(
    timeout_minutes=600,
    trial_timeout_minutes=20,
    max_trials=max_trials,
    # max_concurrent_trials = 4,
    # max_cores_per_trial: -1,
    enable_early_termination=True,
)

# Training properties are optional
classification_job.set_training(
    blocked_training_algorithms=[ClassificationModels.LOGISTIC_REGRESSION],
    enable_onnx_compatible_models=True,
)

# Serverless compute resources used to run the job
classification_job.resources = 
ResourceConfiguration(instance_type="Standard_E4s_v3", instance_count=6)

種類やインスタンスの数を指定する場合は、resources セクションを追加します。

$schema: https://azuremlsdk2.blob.core.windows.net/preview/0.0.1/autoMLJob.schema.json
type: automl
experiment_name: dpv2-cli-automl-classifier-experiment
description: A Classification job using bank marketing
# Serverless compute is used to run this AutoML job. 
# Through serverless compute, Azure Machine Learning takes care of creating, scaling, deleting, patching and managing compute, along with providing managed network isolation, reducing the burden on you.

task: classification
log_verbosity: debug
primary_metric: accuracy

target_column_name: "y"

#validation_data_size: 0.20
#n_cross_validations: 5
#test_data_size: 0.1

training_data:
  path: "./training-mltable-folder"
  type: mltable
validation_data:
  path: "./validation-mltable-folder"
  type: mltable
test_data:
  path: "./test-mltable-folder"
  type: mltable

limits:
  timeout_minutes: 180
  max_trials: 40
  max_concurrent_trials: 5
  trial_timeout_minutes: 20
  enable_early_termination: true
  exit_score: 0.92

featurization:
  mode: custom
  transformer_params:
    imputer:
      - fields: ["job"]
        parameters:
          strategy: most_frequent
  blocked_transformers:
    - WordEmbedding
training:
  enable_model_explainability: true
  allowed_training_algorithms:
    - gradient_boosting
    - logistic_regression
# Resources to run this serverless job
resources:
  instance_type="Standard_E4s_v3"
  instance_count=5

パイプラインジョブの場合、既定のコンピューティングの種類に "serverless" を指定すると、サーバーレスコンピューティングが使われます。

# Construct pipeline
@pipeline()
def pipeline_with_components_from_yaml(
    training_input,
    test_input,
    training_max_epochs=20,
    training_learning_rate=1.8,
    learning_rate_schedule="time-based",
):
    """E2E dummy train-score-eval pipeline with components defined via yaml."""
    # Call component obj as function: apply given inputs & parameters to create a node in pipeline
    train_with_sample_data = train_model(
        training_data=training_input,
        max_epochs=training_max_epochs,
        learning_rate=training_learning_rate,
        learning_rate_schedule=learning_rate_schedule,
    )

    score_with_sample_data = score_data(
        model_input=train_with_sample_data.outputs.model_output, test_data=test_input
    )
    score_with_sample_data.outputs.score_output.mode = "upload"

    eval_with_sample_data = eval_model(
        scoring_result=score_with_sample_data.outputs.score_output
    )

    # Return: pipeline outputs
    return {
        "trained_model": train_with_sample_data.outputs.model_output,
        "scored_data": score_with_sample_data.outputs.score_output,
        "evaluation_report": eval_with_sample_data.outputs.eval_output,
    }


pipeline_job = pipeline_with_components_from_yaml(
    training_input=Input(type="uri_folder", path=parent_dir + "/data/"),
    test_input=Input(type="uri_folder", path=parent_dir + "/data/"),
    training_max_epochs=20,
    training_learning_rate=1.8,
    learning_rate_schedule="time-based",
)

# set pipeline to use serverless compute
pipeline_job.settings.default_compute = "serverless"

パイプラインジョブの場合、既定のコンピューティングの種類に azureml:serverless を指定すると、サーバーレスコンピューティングが使われます。

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: 1b_e2e_registered_components
description: E2E dummy train-score-eval pipeline with registered components
# Serverless compute is used to run this pipeline job. 
# Through serverless compute, Azure Machine Learning takes care of creating, scaling, deleting, patching and managing compute, along with providing managed network isolation, reducing the burden on you.
inputs:
  pipeline_job_training_max_epocs: 20
  pipeline_job_training_learning_rate: 1.8
  pipeline_job_learning_rate_schedule: 'time-based'

outputs: 
  pipeline_job_trained_model:
    mode: upload
  pipeline_job_scored_data:
    mode: upload
  pipeline_job_evaluation_report:
    mode: upload

settings:
 default_compute: azureml:serverless

jobs:
  train_job:
    type: command
    component: azureml:my_train@latest
    inputs:
      training_data: 
        type: uri_folder 
        path: ./data      
      max_epocs: ${{parent.inputs.pipeline_job_training_max_epocs}}
      learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate}}
      learning_rate_schedule: ${{parent.inputs.pipeline_job_learning_rate_schedule}}
    outputs:
      model_output: ${{parent.outputs.pipeline_job_trained_model}}
    services:
      my_vscode:
        type: vs_code
      my_jupyter_lab:
        type: jupyter_lab
      my_tensorboard:
        type: tensor_board
        log_dir: "outputs/tblogs"
    #  my_ssh:
    #    type: tensor_board
    #    ssh_public_keys: <paste the entire pub key content>
    #    nodes: all # Use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node.

  score_job:
    type: command
    component: azureml:my_score@latest
    inputs:
      model_input: ${{parent.jobs.train_job.outputs.model_output}}
      test_data: 
        type: uri_folder 
        path: ./data
    outputs:
      score_output: ${{parent.outputs.pipeline_job_scored_data}}

  evaluate_job:
    type: command
    component: azureml:my_eval@latest
    inputs:
      scoring_result: ${{parent.jobs.score_job.outputs.score_output}}
    outputs:
      eval_output: ${{parent.outputs.pipeline_job_evaluation_report}}

デザイナーで、サーバーレスコンピューティングを既定のコンピューティングとして設定することもできます。

次の手順

サーバーレスコンピューティングでのトレーニングの例については、- を参照してください

サーバーレス コンピューティングでのモデル トレーニング

サーバーレス コンピューティングの利点

サーバーレス コンピューティングの使用方法