Azure Python SDK を使って CNTK トレーニングジョブを実行する

[アーティクル]
08/15/2018

この記事では、Azure Python SDK を使い、Batch AI サービスを使用してサンプルの Microsoft Cognitive Toolkit (CNTK) モデルをトレーニングする方法について説明します。

この例では、手書き画像の MNIST データベースを使い、単一ノードの GPU クラスター上で畳み込みニューラルネットワーク (CNN: Convolutional Neural Network) のトレーニングを実施します。

前提条件

Azure サブスクリプション - Azure サブスクリプションをお持ちでない場合は、開始する前に無料のアカウントを作成してください。
Azure Python SDK- インストール手順を参照してください。この記事では、バージョン 2.0.0 以降の azure-mgmt-batchai パッケージが必要です。
Azure Storage アカウント - Azure Storage アカウントの作成方法に関するページをご覧ください。
Azure Active Directory サービスプリンシパルの資格情報 - CLI でサービスプリンシパルを作成する方法に関するページをご覧ください。
Azure Cloud Shell または Azure CLI を使って、ご利用のサブスクリプションについて 1 回、Batch AI リソースプロバイダーを登録します。プロバイダーの登録にかかる時間は最大で 15 分程度です。

az provider register -n Microsoft.BatchAI

資格情報を構成する

スクリプトファイルに次のコードを追加します。このとき、FILL-IN-HERE を適切な値に置き換えます。

# credentials used for authentication
aad_client_id = 'FILL-IN-HERE'
aad_secret = 'FILL-IN-HERE'
aad_tenant = 'FILL-IN-HERE'
subscription_id = 'FILL-IN-HERE'

# credentials used for storage
storage_account_name = 'FILL-IN-HERE'
storage_account_key = 'FILL-IN-HERE'

# specify the credentials used to remote login your GPU node
admin_user_name = 'FILL-IN-HERE'
admin_user_password = 'FILL-IN-HERE'

# specify the location in which to create Batch AI resources
mylocation = 'eastus'

ソースコードに資格情報を含めるのは良いやり方ではありませんが、ここではクイックスタートを簡単にする目的でこのようにしていることに注意してください。代わりに、環境変数または別個の構成ファイルを使用することを検討してください。

Batch AI クライアントの作成

次のコードを実行すると、サービスプリンシパルの資格証明オブジェクトと Batch AI クライアントが作成されます。

from azure.common.credentials import ServicePrincipalCredentials
import azure.mgmt.batchai as batchai
import azure.mgmt.batchai.models as models

creds = ServicePrincipalCredentials(
		client_id=aad_client_id, secret=aad_secret, tenant=aad_tenant)

batchai_client = batchai.BatchAIManagementClient(
    credentials=creds, subscription_id=subscription_id)

リソースグループを作成する

Batch AI のクラスターとジョブは Azure リソースであり、Azure リソースグループに配置する必要があります。リソースグループは次のスニペットで作成します。

from azure.mgmt.resource import ResourceManagementClient

resource_group_name = 'myresourcegroup'
resource_management_client = ResourceManagementClient(
        credentials=creds, subscription_id=subscription_id)
resource = resource_management_client.resource_groups.create_or_update(
        resource_group_name, {'location': mylocation})

このクイックスタートでは、説明上、Azure Files 共有を使用して、トレーニングジョブのトレーニングデータとスクリプトをホストしています。

batchaiquickstart という名前のファイル共有を作成します。

from azure.storage.file import FileService
azure_file_share_name = 'batchaiquickstart'
service = FileService(storage_account_name, storage_account_key)
service.create_share(azure_file_share_name, fail_on_exist=False)

共有内に mnistcntksample という名前のディレクトリを作成します。

mnist_dataset_directory = 'mnistcntksample'
service.create_directory(azure_file_share_name, mnist_dataset_directory, fail_on_exist=False)

サンプルパッケージをダウンロードし、現在のディレクトリに解凍します。次のコードを実行すると、必要なファイルが Azure ファイル共有にアップロードされます。

for f in ['Train-28x28_cntk_text.txt', 'Test-28x28_cntk_text.txt',
          'ConvNet_MNIST.py']:
     service.create_file_from_path(
             azure_file_share_name, mnist_dataset_directory, f, f)

Batch AI ワークスペースの作成

ワークスペースは、全種類の Batch AI リソースの最上位のコレクションとなります。 Batch AI のクラスターと実験は、どちらもワークスペースに作成します。

workspace_name='myworkspace'
batchai_client.workspaces.create(resource_group_name, workspace_name, mylocation)

GPU クラスターの作成

Batch AI クラスターを作成します。この例では、クラスターが単一の STANDARD_NC6 VM ノードから成ります。この VM には NVIDIA K80 GPU が 1 つ含まれています。 azurefileshare という名前のフォルダーにファイル共有をマウントします。このフォルダーの GPU 計算ノード上における完全なパスは $AZ_BATCHAI_MOUNT_ROOT/azurefileshare です。

cluster_name = 'mycluster'

relative_mount_point = 'azurefileshare'

parameters = models.ClusterCreateParameters(
    # VM size. Use N-series for GPU
    vm_size='STANDARD_NC6',
    # Configure the ssh users
    user_account_settings=models.UserAccountSettings(
        admin_user_name=admin_user_name,
        admin_user_password=admin_user_password),
    # Number of VMs in the cluster
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=1)
    ),
    # Configure each node in the cluster
    node_setup=models.NodeSetup(
        # Mount shared volumes to the host
        mount_volumes=models.MountVolumes(
            azure_file_shares=[
                models.AzureFileShareReference(
                    account_name=storage_account_name,
                    credentials=models.AzureStorageCredentialsInfo(
                        account_key=storage_account_key),
                    azure_file_url='https://{0}/{1}'.format(
                        service.primary_endpoint, azure_file_share_name),
                    relative_mount_path=relative_mount_point)],
        ),
    ),
)
batchai_client.clusters.create(resource_group_name, workspace_name, cluster_name,
                               parameters).result()

クラスターの状態の取得

次のコマンドを使って、クラスターの状態を監視します。

cluster = batchai_client.clusters.get(resource_group_name, workspace_name, cluster_name)
print('Cluster state: {0} Target: {1}; Allocated: {2}; Idle: {3}; '
      'Unusable: {4}; Running: {5}; Preparing: {6}; Leaving: {7}'.format(
    cluster.allocation_state,
    cluster.scale_settings.manual.target_node_count,
    cluster.current_node_count,
    cluster.node_state_counts.idle_node_count,
    cluster.node_state_counts.unusable_node_count,
    cluster.node_state_counts.running_node_count,
    cluster.node_state_counts.preparing_node_count,
    cluster.node_state_counts.leaving_node_count))

前のコードからは、次の例のような基本的なクラスターの割り当て情報が出力されます。

Cluster state: AllocationState.steady Target: 1; Allocated: 1; Idle: 0; Unusable: 0; Running: 0; Preparing: 1; Leaving: 0

ノードの割り当てが済んで準備が完了したら (nodeStateCounts 属性を参照)、クラスターの準備は完了です。何か問題がある場合、errors 属性にエラーの説明が格納されます。

実験とトレーニングジョブの作成

クラスターを作成したら、実験 (相互に関連するジョブのグループを格納する論理コンテナー) を作成します。その後、実験内で学習ジョブを構成して送信します。

experiment_name='myexperiment'

batchai_client.experiments.create(resource_group_name, workspace_name, experiment_name)

job_name = 'myjob'

parameters = models.JobCreateParameters(
    # The cluster this job will run on
    cluster=models.ResourceId(id=cluster.id),
    # The number of VMs in the cluster to use
    node_count=1,
    # Write job's standard output and execution log to Azure File Share
    std_out_err_path_prefix='$AZ_BATCHAI_MOUNT_ROOT/{0}'.format(
        relative_mount_point),
    # Configure location of the training script and MNIST dataset
    input_directories=[models.InputDirectory(
        id='SAMPLE',
        path='$AZ_BATCHAI_MOUNT_ROOT/{0}/{1}'.format(
            relative_mount_point, mnist_dataset_directory))],
    # Specify location where generated model will be stored
    output_directories=[models.OutputDirectory(
        id='MODEL',
        path_prefix='$AZ_BATCHAI_MOUNT_ROOT/{0}'.format(relative_mount_point),
        path_suffix="Models")],
    # Container configuration
    container_settings=models.ContainerSettings(
        image_source_registry=models.ImageSourceRegistry(
            image='microsoft/cntk:2.1-gpu-python3.5-cuda8.0-cudnn6.0')),
    # Toolkit specific settings
    cntk_settings=models.CNTKsettings(
        python_script_file_path='$AZ_BATCHAI_INPUT_SAMPLE/ConvNet_MNIST.py',
        command_line_args='$AZ_BATCHAI_INPUT_SAMPLE $AZ_BATCHAI_OUTPUT_MODEL')
)

# Create the job
batchai_client.jobs.create(resource_group_name, workspace_name, experiment_name, job_name, parameters).result()

ジョブの監視

ジョブの状態は、次のコードを使用して確認できます。

job = batchai_client.jobs.get(resource_group_name, workspace_name, experiment_name, job_name)

print('Job state: {0} '.format(job.execution_state))

Job state: running のような出力が返されます。

ジョブの現在の実行状態は executionState に含まれています。

queued: ジョブは、クラスターノードが使用可能な状態になるのを待機しています
running: ジョブは実行中です
succeeded (または failed): ジョブが完了しました。executionInfo に結果の詳細が格納されています

stdout と stderr のリスト出力

生成された stdout、stderr、およびログファイルを一覧表示するには、次のコードを使用します。

files = batchai_client.jobs.list_output_files(
    resource_group_name, workspace_name, experiment_name, job_name,
    models.JobsListOutputFilesOptions(outputdirectoryid="stdouterr"))

for file in (f for f in files if f.download_url):
    print('file: {0}, download url: {1}'.format(file.name, file.download_url))

生成されたモデルファイルの一覧表示

生成されたモデルファイルを一覧表示するには、次のコードを使用します。

files = batchai_client.jobs.list_output_files(
    resource_group_name, workspace_name, experiment_name,job_name,
    models.JobsListOutputFilesOptions(outputdirectoryid="MODEL"))

for file in (f for f in files if f.download_url):
    print('file: {0}, download url: {1}'.format(file.name, file.download_url))

リソースを削除する

ジョブを削除するには、次のコードを使用します。

batchai_client.jobs.delete(resource_group_name, workspace_name, experiment_name, job_name)

クラスターを削除するには、次のコードを使用します。

batchai_client.clusters.delete(resource_group_name, workspace_name, cluster_name)

割り当てられているすべてのリソースを削除するには、次のコードを使用します。

resource_management_client.resource_groups.delete('myresourcegroup')

次のステップ

さまざまなフレームワークで Batch AI を使う方法の詳細については、トレーニングのレシピをご覧ください。

Azure Python SDK を使って CNTK トレーニング ジョブを実行する