Azure Machine Learning Experimentation Service configuration files

When you run a script in Azure Machine Learning (Azure ML) Workbench, the behavior of the execution is controlled by files in the aml_config folder. This folder is under your project folder root. It is important to understand the contents of these files in order to achieve the desired outcome for your execution in an optimal way.

Following are the relevant files under this folder:

  • conda_dependencies.yml
  • spark_dependencies.yml
  • compute target files
    • <compute target name>.compute
  • run configuration files
    • <run configuration name>.runconfig

Note

You typically have a compute target file and run configuration file for each compute target you create. However, you can create these files independently and have multiple run configuration files pointing to the same compute target.

conda_dependencies.yml

This file is a conda environment file that specifies the Python runtime version and packages that your code depends on. When Azure ML Workbench executes a script in a Docker container or HDInsight cluster, it creates a conda environment for your script to run on.

In this file, you specify Python packages that your script needs for execution. Azure ML Experimentation Service creates the conda environment according to your list of dependencies. Packages listed here must be reachable by the execution engine through channels such as:

  • continuum.io
  • PyPI
  • a publicly accessible endpoint (URL)
  • or a local file path
  • others reachable by the execution engine

Note

When running on HDInsight cluster, Azure ML Workbench creates a conda environment for your specific run. This allows different users to run on different python environments on the same cluster.

Here is an example of a typical conda_dependencies.yml file.

name: project_environment
dependencies:
  # Python version
  - python=3.5.2

  # some conda packages
  - scikit-learn
  - cryptography

  # use pip to install some more packages
  - pip:
     # a package in PyPi
     - azure-storage

     # a package hosted in a public URL endpoint
     - https://cntk.ai/PythonWheel/CPU-Only/cntk-2.1-cp35-cp35m-win_amd64.whl

     # a wheel file available locally on disk (this only works if you are executing against local Docker target)
     - C:\temp\my_private_python_pkg.whl

Azure ML Workbench uses the same conda environment without rebuilding it as long as the conda_dependencies.yml remains the same. It will rebuild your environment if your dependencies change.

Note

If you target execution against local compute context, conda_dependencies.yml file is not used. Package dependencies for your local Azure ML Workbench Python environment need to be installed manually.

spark_dependencies.yml

This file specifies the Spark application name when you submit a PySpark script and Spark packages that need to be installed. You can also specify a public Maven repository as well as Spark packages that can be found in those Maven repositories.

Here is an example:

configuration:
  # Spark application name
  "spark.app.name": "ClassifyingIris"

repositories:
  # Maven repository hosted in Azure CDN
  - "https://mmlspark.azureedge.net/maven"

  # Maven repository hosted in spark-packages.org
  - "https://spark-packages.org/packages"

packages:
  # MMLSpark package hosted in the Azure CDN Maven
  - group: "com.microsoft.ml.spark"
    artifact: "mmlspark_2.11"
    version: "0.5"

  # spark-sklearn packaged hosted in the spark-packages.org Maven
  - group: "databricks"
    artifact: "spark-sklearn"
    version: "0.2.0"

Note

Cluster tuning parameters such as worker size and cores should go into "configuration" section in the spark_dependecies.yml file

Note

If you are executing the script in Python environment, spark_dependencies.yml file is ignored. It is used only if you are running against Spark (either on Docker or HDInsight Cluster).

Run configuration

To specify a particular run configuration, you need a .compute file and a .runconfig file. These are typically generated using a CLI command. You can also clone exiting ones, rename them, and edit them.

# create a compute target pointing to a VM via SSH
$ az ml computetarget attach remotedocker -n <compute target name> -a <IP address or FQDN of VM> -u <username> -w <password>

# create a compute context pointing to an HDI cluster head-node via SSH
$ az ml computetarget attach cluster -n <compute target name> -a <IP address or FQDN of HDI cluster> -u <username> -w <password> 

This command creates a pair of files based on the compute target specified. Let's say you named your compute target foo. This command generates foo.compute and foo.runconfig in your aml_config folder.

Note

local or docker names for the run configuration files are arbitrary. Azure ML Workbench adds these two run configurations when you create a blank project for your convenience. You can rename ".runconfig" files that come with the project template, or create new ones with any name you want.

<compute target name>.compute

<compute target name>.compute file specifies connection and configuration information for the compute target. It is a list of name-value pairs. Following are the supported settings:

type: Type of the compute environment. Supported values are:

  • local
  • remote
  • docker
  • remotedocker
  • cluster

baseDockerImage: The Docker image used to run the Python/PySpark script. The default value is microsoft/mmlspark:plus-0.7.91. We also support one other image: microsoft/mmlspark:plus-gpu-0.7.91, which gives you GPU access to the host machine (if GPU is present).

address: The IP address, or FQDN (fully qualified domain name) of the virtual machine, or HDInsight cluster head-node.

username: The SSH username for accessing the virtual machine or the HDInsight head-node.

password: The encrypted password for the SSH connection.

sharedVolumes: Flag to signal that execution engine should use Docker shared volume feature to ship project files back and forth. Having this flag turned on can speed up execution since Docker can access projects directly without the need to copy them. It is best to set false if the Docker engine is running on Windows since volume sharing for Docker on Windows can be flaky. Set it to true if it is running on macOS or Linux.

nvidiaDocker: This flag, when set to true, tells the Azure ML Experimentation Service to use nvidia-docker command, as opposed to the regular docker command, to launch the Docker image. The nvidia-docker engine allows the Docker container to access GPU hardware. The setting is required if you want to run GPU execution in the Docker container. Only Linux host supports nvidia-docker. For example, Linux-based DSVM in Azure ships with nvidia-docker. nvidia-docker as of now is not supported on Windows.

nativeSharedDirectory: This property specifies the base directory (For example: ~/.azureml/share/) where files can be saved in order to be shared across runs on the same compute target. If this setting is used when running on a Docker container, sharedVolumes must be set to true. Otherwise, execution fails.

userManagedEnvironment: This property specifies whether this compute target is managed by the user directly or managed through experimentation service.

pythonLocation: This property specifies the location of the python runtime to be used on the compute target to execute user's program.

<run configuration name>.runconfig

<run configuration name>.runconfig specifies the Azure ML experiment execution behavior. You can configure execution behavior such as tracking run history or what compute target to use along with many others. The names of the run configuration files are used to populate the execution context dropdown in the Azure ML Workbench desktop application.

ArgumentVector: This section specifies the script to be run as part of this execution and the parameters for the script. For example, if you have the following snippet in your ".runconfig" file

 "ArgumentVector":[
  - "myscript.py"
  - 234
  - "-v" 
 ] 

"az ml experiment submit foo.runconfig" automatically runs the command with myscript.py file passing in 234 as a parameter and sets the --verbose flag.

Target: This parameter is the name of the .compute file that the runconfig file references. It generally points the foo.compute file but you can edit it to point to a different compute target.

Environment Variables: This section enables users to set environment variables as part of their runs. User can specify environment variables using name-value pairs in the following format:

EnvironmentVariables:
  "EXAMPLE_ENV_VAR1": "Example Value1"
  "EXAMPLE_ENV_VAR2": "Example Value2"

These environment variables can be accessed in user's code. For example, this Python code prints the environment variable named "EXAMPLE_ENV_VAR"

print(os.environ.get("EXAMPLE_ENV_VAR1"))

Framework: This property specifies if Azure ML Workbench should launch a Spark session to run the script. The default value is PySpark. Set it to Python if you are not running PySpark code, which can help launching the job quicker with less overhead.

CondaDependenciesFile: This property points to the file that specifies the conda environment dependencies in the aml_config folder. If set to null, it points to the default conda_dependencies.yml file.

SparkDependenciesFile: This property points to the file that specifies the Spark dependencies in the aml_config folder. It is set to null by default and it points to the default spark_dependencies.yml file.

PrepareEnvironment: This property, when set to true, tells the Experimentation Service to prepare the conda environment based on the conda dependencies specified as part of your initial run. This property is effective only when you execute against a Docker environment. This setting has no effect if you are running against a local environment.

TrackedRun: This flag signals the Experimentation Service whether or not to track the run in Azure ML Workbench run history infrastructure. The default value is true.

UseSampling: UseSampling specifies whether the active sample datasets for data sources are used for the run. If set to false, data sources ingest and use the full data read from the data store. If set to true, active samples are used. Users can use the DataSourceSettings to specify which specific sample datasets to use if they want to override the active sample.

DataSourceSettings: This configuration section specifies the data source settings. In this section, user specifies which existing data sample for a particular data source is used as part of the run.

The following configuration setting specifies that sample named "MySample" is used for the data source named "MyDataSource"

DataSourceSettings:
    MyDataSource.dsource:
    Sampling:
    Sample: MySample

DataSourceSubstitutions: Data source substitutions can be used when the user wants to switch from one data source to another without changing their code. For example, users can switch from a sampled-down, local file to the original, larger dataset stored in Azure Blob by changing the data source reference. When a substitution is used, Azure ML Workbench runs your data sources and data preparation packages by referencing the substitute data source.

The following example replaces the "mylocal.datasource" references in Azure ML data sources and data preparation packages with "myremote.dsource".

DataSourceSubstitutions:
    mylocal.dsource: myremote.dsource

Based on the substitution above, the following code sample now reads from "myremote.dsource" instead of "mylocal.dsource" without users changing their code.

df = datasource.load_datasource('mylocal.dsource')

Next steps

Learn more about Experimentation Service configuration.