PipelineOutputFileDataset Class

Represents intermediate pipeline data promoted to an Azure Machine Learning File Dataset.

Once an intermediate data is promoted to an Azure Machine Learning Dataset, it will also be consumed as a Dataset instead of a DataReference in subsequent steps.

Create an intermediate data that will be promoted to an Azure Machine Learning Dataset.

Inheritance
PipelineOutputFileDataset

Constructor

PipelineOutputFileDataset(pipeline_data)

Parameters

Name Description
pipeline_data
Required

The PipelineData that represents the intermediate output which will be promoted to a Dataset.

pipeline_data
Required

The PipelineData that represents the intermediate output which will be promoted to a Dataset.

Methods

as_direct

Set input the consumption mode of the dataset to direct.

In this mode, you will get the ID of the dataset and in your script you can call Dataset.get_by_id to retrieve the dataset. run.input_datasets['{dataset_name}'] will return the Dataset.

as_download

Set the consumption mode of the dataset to download.

as_mount

Set the consumption mode of the dataset to mount.

parse_delimited_files

Transform the intermediate file dataset to a tabular dataset.

The tabular dataset is created by parsing the delimited file(s) pointed to by the intermediate output.

parse_parquet_files

Transform the intermediate file dataset to a tabular dataset.

The tabular dataset is created by parsing the parquet file(s) pointed to by the intermediate output.

as_direct

Set input the consumption mode of the dataset to direct.

In this mode, you will get the ID of the dataset and in your script you can call Dataset.get_by_id to retrieve the dataset. run.input_datasets['{dataset_name}'] will return the Dataset.

as_direct()

Returns

Type Description

The modified PipelineOutputDataset.

as_download

Set the consumption mode of the dataset to download.

as_download(path_on_compute=None)

Parameters

Name Description
path_on_compute
str

The path on the compute to download the dataset to. Defaults to None, which means Azure Machine Learning picks a path for you.

default value: None

Returns

Type Description

The modified PipelineOutputDataset.

as_mount

Set the consumption mode of the dataset to mount.

as_mount(path_on_compute=None)

Parameters

Name Description
path_on_compute
str

The path on the compute to mount the dataset to. Defaults to None, which means Azure Machine Learning picks a path for you.

default value: None

Returns

Type Description

The modified PipelineOutputDataset.

parse_delimited_files

Transform the intermediate file dataset to a tabular dataset.

The tabular dataset is created by parsing the delimited file(s) pointed to by the intermediate output.

parse_delimited_files(include_path=False, separator=',', header=PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS, partition_format=None, file_extension='', set_column_types=None, quoted_line_breaks=False)

Parameters

Name Description
include_path

Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.

default value: False
separator
str

The separator used to split columns.

default value: ,
header

Controls how column headers are promoted when reading from files. Defaults to assume that all files have the same header.

default value: PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS
partition_format
str

Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.csv' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

default value: None
file_extension
Required
str

The file extension of the files to read. Only files with this extension will be read from the directory. Default value is '.csv' when the separator is ',' and '.tsv' when the separator is tab, and None otherwise. If None is passed, all files will be read regardless of their extension (or lack of extension).

set_column_types

A dictionary to set column data type, where key is column name and value is DataType. Columns not in the dictionary will remain of type string. Passing None will result in no conversions. Entries for columns not found in the source data will not cause an error and will be ignored.

default value: None
quoted_line_breaks

Whether to handle new line characters within quotes. This option can impact performance.

default value: False

Returns

Type Description

Returns an intermediate data that will be a tabular dataset.

Remarks

This transformation will only be applied when the intermediate data is consumed as the input of the subsequent step. It has no effect on the output even if it is passed to the output.

parse_parquet_files

Transform the intermediate file dataset to a tabular dataset.

The tabular dataset is created by parsing the parquet file(s) pointed to by the intermediate output.

parse_parquet_files(include_path=False, partition_format=None, file_extension='.parquet', set_column_types=None)

Parameters

Name Description
include_path

Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.

default value: False
partition_format
str

Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

default value: None
file_extension
str

The file extension of the files to read. Only files with this extension will be read from the directory. Default value is '.parquet'. If this is set to None, all files will be read regardless their extension (or lack of extension).

default value: .parquet
set_column_types

A dictionary to set column data type, where key is column name and value is DataType. Columns not in the dictionary will remain of type loaded from the parquet file. Passing None will result in no conversions. Entries for columns not found in the source data will not cause an error and will be ignored.

default value: None

Returns

Type Description

Returns an intermediate data that will be a tabular dataset.

Remarks

This transformation will only be applied when the intermediate data is consumed as the input of the subsequent step. It has no effect on the output even if it is passed to the output.