CLI (v2) Spark component YAML schema

Article
05/23/2023

APPLIES TO: Azure CLI ml extension v2 (current)

Note

The YAML syntax detailed in this document is based on the JSON schema for the latest version of the ML CLI v2 extension. This syntax is guaranteed only to work with the latest version of the ML CLI v2 extension. You can find the schemas for older extension versions at https://azuremlschemasprod.azureedge.net/.

YAML syntax

Key	Type	Description	Allowed values
`$schema`	string	The YAML schema. If you use the Azure Machine Learning VS Code extension to author the YAML file, including `$schema` at the top of your file enables you to invoke schema and resource completions.
`type`	const	Required. The type of component.	`spark`
`name`	string	Required. Name of the component. Must start with lowercase letter. Allowed characters are lowercase letters, numbers, and underscore(_). Maximum length is 255 characters.
`version`	string	Version of the component. If omitted, Azure Machine Learning autogenerates a version.
`display_name`	string	Display name of the component in the studio UI. Can be nonunique within the workspace.
`description`	string	Description of the component.
`tags`	object	Dictionary of tags for the component.
`code`	string	Required. The location of the folder that contains source code and scripts for the component.
`entry`	object	Required. The entry point for the component. It could define a `file`.
`entry.file`	string	The location of the folder that contains source code and scripts for the component.
`py_files`	object	A list of `.zip`, `.egg`, or `.py` files, to be placed in the `PYTHONPATH`, for successful execution of the job with this component.
`jars`	object	A list of `.jar` files to include on the Spark driver, and the executor `CLASSPATH`, for successful execution of the job with this component.
`files`	object	A list of files that should be copied to the working directory of each executor, for successful execution of the job with this component.
`archives`	object	A list of archives that should be extracted into the working directory of each executor, for successful execution of the job with this component.
`conf`	object	The Spark driver and executor properties. See Attributes of the `conf` key
`environment`	string or object	The environment to use for the component. This value can be either a reference to an existing versioned environment in the workspace or an inline environment specification. To reference an existing environment, use the `azureml:<environment_name>:<environment_version>` syntax or `azureml:<environment_name>@latest` (to reference the latest version of an environment). To define an environment inline, follow the Environment schema. Exclude the `name` and `version` properties, because inline environments don't support them.
`args`	string	The command line arguments that should be passed to the component entry point Python script. These arguments may contain the paths of input data and the location to write the output, for example `"--input_data ${{inputs.<input_name>}} --output_path ${{outputs.<output_name>}}"`
`inputs`	object	Dictionary of component inputs. The key is a name for the input within the context of the component and the value is the input value. Inputs can be referenced in the `args` using the `${{ inputs.<input_name> }}` expression.
`inputs.<input_name>`	number, integer, boolean, string or object	One of a literal value (of type number, integer, boolean, or string) or an object containing a component input data specification.
`outputs`	object	Dictionary of output configurations of the component. The key is a name for the output within the context of the component and the value is the output configuration. Outputs can be referenced in the `args` using the `${{ outputs.<output_name> }}` expression.
`outputs.<output_name>`	object	The Spark component output. Output for a Spark component can be written to either a file or a folder location by providing an object containing the component output specification.

Attributes of the `conf` key

Key	Type	Description	Default value
`spark.driver.cores`	integer	The number of cores for the Spark driver.
`spark.driver.memory`	string	Allocated memory for the Spark driver, in gigabytes (GB), for example, `2g`.
`spark.executor.cores`	integer	The number of cores for the Spark executor.
`spark.executor.memory`	string	Allocated memory for the Spark executor, in gigabytes (GB), for example `2g`.
`spark.dynamicAllocation.enabled`	boolean	Whether or not executors should be dynamically allocated as a `True` or `False` value. If this property is set `True`, define `spark.dynamicAllocation.minExecutors` and `spark.dynamicAllocation.maxExecutors`. If this property is set to `False`, define `spark.executor.instances`.	`False`
`spark.dynamicAllocation.minExecutors`	integer	The minimum number of Spark executors instances, for dynamic allocation.
`spark.dynamicAllocation.maxExecutors`	integer	The maximum number of Spark executors instances, for dynamic allocation.
`spark.executor.instances`	integer	The number of Spark executor instances.

Component inputs

Key	Type	Description	Allowed values	Default value
`type`	string	The type of component input. Specify `uri_file` for input data that points to a single file source, or `uri_folder` for input data that points to a folder source. Learn more about data access.	`uri_file`, `uri_folder`
`mode`	string	Mode of how the data should be delivered to the compute target. The `direct` mode passes in the URL of the storage location as the component input. You have full responsibility to handle storage access credentials.	`direct`

Component outputs

Key	Type	Description	Allowed values	Default value
`type`	string	The type of component output.	`uri_file`, `uri_folder`
`mode`	string	The mode of delivery of the output file(s) to the destination storage resource.	`direct`

Remarks

The az ml component commands can be used for managing Azure Machine Learning Spark component.

Examples

Examples are available in the examples GitHub repository. Several are shown next.

YAML: A sample Spark component

# spark-job-component.yaml
$schema: https://azuremlschemas.azureedge.net/latest/sparkComponent.schema.json
name: titanic_spark_component
type: spark
version: 1
display_name: Titanic-Spark-Component
description: Spark component for Titanic data

code: ./src
entry:
  file: titanic.py

inputs:
  titanic_data:
    type: uri_file
    mode: direct

outputs:
  wrangled_data:
    type: uri_folder
    mode: direct

args: >-
  --titanic_data ${{inputs.titanic_data}}
  --wrangled_data ${{outputs.wrangled_data}}

conf:
  spark.driver.cores: 1
  spark.driver.memory: 2g
  spark.executor.cores: 2
  spark.executor.memory: 2g
  spark.dynamicAllocation.enabled: True
  spark.dynamicAllocation.minExecutors: 1
  spark.dynamicAllocation.maxExecutors: 4

YAML: A sample pipeline job with a Spark component

# attached-spark-pipeline-user-identity.yaml
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: Titanic-Spark-CLI-Pipeline-2
description: Spark component for Titanic data in Pipeline

jobs:
  spark_job:
    type: spark
    component: ./spark-job-component.yml
    inputs:
      titanic_data: 
        type: uri_file
        path: azureml://datastores/workspaceblobstore/paths/data/titanic.csv
        mode: direct

    outputs:
      wrangled_data:
        type: uri_folder
        path: azureml://datastores/workspaceblobstore/paths/data/wrangled/
        mode: direct

    identity:
      type: user_identity

    compute: <ATTACHED_SPARK_POOL_NAME>

CLI (v2) Spark component YAML schema

YAML syntax

Attributes of the `conf` key

Component inputs

Component outputs

Remarks

Examples

YAML: A sample Spark component

YAML: A sample pipeline job with a Spark component

Next steps

Feedback

Additional resources

CLI (v2) Spark component YAML schema

YAML syntax

Attributes of the conf key

Component inputs

Component outputs

Remarks

Examples

YAML: A sample Spark component

YAML: A sample pipeline job with a Spark component

Next steps

Feedback

Additional resources

Attributes of the `conf` key