Dataflow Class

A Dataflow represents a series of lazily-evaluated, immutable operations on data. It is only an execution plan. No data is loaded from the source until you get data from the Dataflow using one of head, to_pandas_dataframe, get_profile or the write methods.

Inheritance
builtins.object
Dataflow

Constructor

Dataflow(engine_api: azureml.dataprep.api.engineapi.api.EngineAPI, steps: typing.List[azureml.dataprep.api.step.Step] = None, meta: typing.Dict[str, str] = None)

Remarks

Dataflows are usually created by supplying a data source. Once the data source has been provided, operations can be added by invoking the different transformation methods available on this class. The result of adding an operation to a Dataflow is always a new Dataflow.

The actual loading of the data and execution of the transformations is delayed as much as possible and will not occur until a 'pull' takes place. A pull is the action of reading data from a Dataflow, whether by asking to look at the first N records in it or by transferring the data in the Dataflow to another storage mechanism (a Pandas Dataframe, a CSV file, or a Spark Dataframe).

The operations available on the Dataflow are runtime-agnostic. This allows the transformation pipelines contained in them to be portable between a regular Python environment and Spark.

Methods

add_column

Adds a new column to the dataset. The values in the new column will be the result of invoking the specified expression.

add_step
append_columns

Appends the columns from the referenced dataflows to the current one. Duplicate columns will result in failure.

append_rows

Appends the records in the specified dataflows to the current one. If the schemas of the dataflows are distinct, this will result in records with different schemas.

assert_value

Ensures that values in the specified columns satisfy the provided expression. This is useful to identify anomalies in the dataset and avoid broken pipelines by handling assertion errors in a clean way.

cache

Pulls all the records from this Dataflow and cache the result to disk.

clip

Clips values so that all values are between the lower and upper boundaries.

convert_unix_timestamp_to_datetime

Converts the specified column to DateTime values by treating the existing value as a Unix timestamp.

derive_column_by_example

Inserts a column by learning a program based on a set of source columns and provided examples. Dataprep will attempt to achieve the intended derivation inferred from provided examples.

distinct

Filters out records that contain duplicate values in the specified columns, leaving only a single instance.

distinct_rows

Filters out records that contain duplicate values in all columns, leaving only a single instance. :return: The modified Dataflow.

drop_columns

Drops the specified columns.

drop_errors

Drops rows where all or any of the selected columns are an Error.

drop_nulls

Drops rows where all or any of the selected columns are null.

duplicate_column

Creates new columns that are duplicates of the specified source columns.

error

Creates errors in a column for values that match the specified search value.

execute_inspector
execute_inspectors
extract_error_details

Extracts the error details from error values into a new column.

fill_errors

Fills all errors in a column with the specified value.

fill_nulls

Fills all nulls in a column with the specified value.

filter

Filters the data, leaving only the records that match the specified expression.

from_json

Load Dataflow from 'package_json'.

get_files

Expands the path specified by reading globs and files in folders and outputs one record per file found.

get_missing_secrets

Get a list of missing secret IDs.

get_partition_count

Calculates the partitions for the current Dataflow and returns their count. Partitioning is guaranteed to be stable for a specific execution mode.

get_profile

Requests the data profile which collects summary statistics on the full data produced by the Dataflow. A data profile can be very useful to understand the input data, identify anomalies and missing values, and verify that data preparation operations produced the desired result.

has_invalid_source

Verifies if the Dataflow has invalid source.

head

Pulls the number of records specified from the top of this Dataflow and returns them as a Link pandas.DataFrame.

join

Creates a new Dataflow that is a result of joining two provided Dataflows.

keep_columns

Keeps the specified columns and drops all others.

label_encode

Adds a column with encoded labels generated from the source column. For explicit label encoding, use LabelEncoderBuilder.

map_column

Creates a new column where matching values in the source column have been replaced with the specified values.

map_partition

Applies a transformation to each partition in the Dataflow.

min_max_scale

Scales the values in the specified column to lie within range_min (default=0) and range_max (default=1).

multi_split

Split a Dataflow into multiple other Dataflows, each containing a random but exclusive sub-set of the data.

new_script_column

Adds a new column to the Dataflow using the passed in Python script.

new_script_filter

Filters the Dataflow using the passed in Python script.

null_coalesce

For each record, selects the first non-null value from the columns specified and uses it as the value of a new column.

one_hot_encode

Adds a binary column for each categorical label from the source column values. For more control over categorical labels, use OneHotEncodingBuilder.

open

Opens a Dataflow with specified name from the package file.

parse_delimited

Adds step to parse CSV data.

parse_fwf

Adds step to parse fixed-width data.

parse_json_column

Parses the values in the specified column as JSON objects and expands them into multiple columns.

parse_json_lines

Creates a new Dataflow with the operations required to read JSON lines files.

parse_lines

Adds step to parse text files and split them into lines.

pivot

Returns a new Dataflow with columns generated from the values in the selected columns to pivot.

promote_headers

Sets the first record in the dataset as headers, replacing any existing ones. :return: The modified Dataflow.

quantile_transform

Perform quantile transformation to the source_column and output the transformed result in new_column.

random_split

Returns two Dataflows from the original Dataflow, with records randomly and approximately split by the percentage specified (using a seed). If the percentage specified is p (where p is between 0 and 1), the first returned dataflow will contain approximately p*100% records from the original dataflow, and the second dataflow will contain all the remaining records that were not included in the first. A random seed will be used if none is provided.

read_excel

Adds step to read and parse Excel files.

read_json

Creates a new Dataflow with the operations required to read JSON files.

read_npz_file

Adds step to parse npz files.

read_parquet_dataset

Creates a step to read parquet file.

read_parquet_file

Adds step to parse Parquet files.

read_postgresql

Adds step that can read data from a PostgreSQL database by executing the query specified.

read_preppy

Adds step to read a directory containing Preppy files.

read_sql

Adds step that can read data from an MS SQL database by executing the query specified.

reference

Creates a reference to an existing activity object.

rename_columns

Renames the specified columns.

replace

Replaces values in a column that match the specified search value.

replace_datasource

Returns new Dataflow with its DataSource replaced by the given one.

replace_na

Replaces values in the specified columns with nulls. You can choose to use the default list, supply your own, or both.

replace_reference

Returns new Dataflow with its reference DataSource replaced by the given one.

round

Rounds the values in the column specified to the desired number of decimal places.

run_local

Runs the current Dataflow using the local execution runtime.

run_spark

Runs the current Dataflow using the Spark runtime.

save

Saves the Dataflow to the specified file

select_partitions

Selects specific partitions from the data, dropping the rest.

set_column_types

Converts values in specified columns to the corresponding data types.

skip

Skips the specified number of records.

sort

Sorts the dataset by the specified columns.

sort_asc

Sorts the dataset in ascending order by the specified columns.

sort_desc

Sorts the dataset in descending order by the specified columns.

split_column_by_delimiters

Splits the provided column and adds the resulting columns to the dataflow.

split_column_by_example

Splits the provided column and adds the resulting columns to the dataflow based on the provided example.

split_stype

Creates new columns from an existing column, interpreting its values as a semantic type.

str_replace

Replaces values in a string column that match a search string with the specified value.

summarize

Summarizes data by running aggregate functions over specific columns.

take

Takes the specified count of records.

take_sample

Takes a random sample of the available records.

take_stratified_sample

Takes a random stratified sample of the available records according to input fractions.

to_bool

Converts the values in the specified columns to booleans.

to_csv_streams

Creates streams with the data in delimited format.

to_dask_dataframe

Returns a Dask DataFrame that can lazily read the data in the Dataflow.

to_data_frame_directory

Creates streams with the data in dataframe directory format.

to_datetime

Converts the values in the specified columns to DateTimes.

to_json

Get the JSON string representation of the Dataflow.

to_long

Converts the values in the specified columns to 64 bit integers.

to_number

Converts the values in the specified columns to floating point numbers.

to_pandas_dataframe

Pulls all of the data and returns it as a Pandas Link pandas.DataFrame.

to_parquet_streams

Creates streams with the data in parquet format.

to_partition_iterator

Creates an iterable object that returns the partitions produced by this Dataflow in sequence. This iterable must be closed before any other executions can run.

to_record_iterator

Creates an iterable object that returns the records produced by this Dataflow in sequence. This iterable must be closed before any other executions can run.

to_spark_dataframe

Creates a Spark Link pyspark.sql.DataFrame that can execute the transformation pipeline defined by this Dataflow.

to_string

Converts the values in the specified columns to strings.

transform_partition

Applies a transformation to each partition in the Dataflow.

  • This function has been deprecated and will be removed in a future version. Please use map_partition instead.
transform_partition_with_file

Transforms an entire partition using the Python script in the passed in file.

trim_string

Trims string values in specific columns.

use_secrets

Uses the passed in secrets for execution.

verify_has_data

Verifies that this Dataflow would produce records if executed. An exception will be thrown otherwise.

write_streams

Writes the streams in the specified column to the destination path. By default, the name of the files written will be the resource identifier of the streams. This behavior can be overriden by specifying a column which contains the names to use.

write_to_csv

Write out the data in the Dataflow in a delimited text format. The output is specified as a directory which will contain multiple files, one per partition processed in the Dataflow.

write_to_parquet

Writes out the data in the Dataflow into Parquet files.

write_to_preppy

Writes out the data in the Dataflow into Preppy files, a DataPrep serialization format.

write_to_sql

Adds step that writes out the data in the Dataflow into a table in MS SQL database.

zip_partitions

Appends the columns from the referenced dataflows to the current one. This is different from AppendColumns in that it assumes all dataflows being appended have the same number of partitions and same number of Records within each corresponding partition. If these two conditions are not true the operation will fail.

add_column

Adds a new column to the dataset. The values in the new column will be the result of invoking the specified expression.

add_column(expression: azureml.dataprep.api.expressions.Expression, new_column_name: str, prior_column: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

expression

The expression to evaluate to generate the values in the column.

new_column_name

The name of the new column.

prior_column

The name of the column after which the new column should be added. The default is to add the new column as the last column.

Returns

The modified Dataflow.

Remarks

Expressions are built using the expression builders in the expressions module and the functions in the functions module. The resulting expression will be lazily evaluated for each record when a data pull occurs and not where it is defined.

add_step

add_step(step_type: str, arguments: typing.Dict[str, typing.Any], local_data: typing.Dict[str, typing.Any] = None) -> azureml.dataprep.api.dataflow.Dataflow

append_columns

Appends the columns from the referenced dataflows to the current one. Duplicate columns will result in failure.

append_columns(dataflows: typing.List[_ForwardRef('DataflowReference')], parallelize: bool = True) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

dataflows

The dataflows to append.

parallelize

Whether to parallelize the operation. If true, the data for all inputs will be loaded into memory.

Returns

The modified Dataflow.

append_rows

Appends the records in the specified dataflows to the current one. If the schemas of the dataflows are distinct, this will result in records with different schemas.

append_rows(dataflows: typing.List[_ForwardRef('DataflowReference')]) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

dataflows

The dataflows to append.

Returns

The modified Dataflow.

assert_value

Ensures that values in the specified columns satisfy the provided expression. This is useful to identify anomalies in the dataset and avoid broken pipelines by handling assertion errors in a clean way.

assert_value(columns: MultiColumnSelection, expression: azureml.dataprep.api.expressions.Expression, policy: azureml.dataprep.api.engineapi.typedefinitions.AssertPolicy = <AssertPolicy.ERRORVALUE: 1>, error_code: str = 'AssertionFailed') -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

Columns to apply evaluation to.

expression

Expression that has to be evaluated to be True for the value to be kept.

policy

Action to take when expression is evaluated to False. Options are FAILEXECUTION and ERRORVALUE. FAILEXECUTION ensures that any data that violates the assertion expression during execution will immediately fail the job. This is useful to save computing resources and time. ERRORVALUE captures any data that violates the assertion expression by replacing it with error_code. This allows you to handle these error values by either filtering or replacing them.

error_code

Error message to use to replace values failing the assertion or failing an execution.

Returns

The modified Dataflow.

cache

Pulls all the records from this Dataflow and cache the result to disk.

cache(directory_path: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

directory_path

The directory to save cache files.

Returns

The modified Dataflow.

Remarks

This is very useful when data is accessed repeatedly, as future executions will reuse the cached result without pulling the same Dataflow again.

clip

Clips values so that all values are between the lower and upper boundaries.

clip(columns: MultiColumnSelection, lower: typing.Union[float, NoneType] = None, upper: typing.Union[float, NoneType] = None, use_values: bool = True) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

lower

All values lower than this value will be set to this value.

upper

All values higher than this value will be set to this value.

use_values

If true, values outside boundaries will be set to the boundary values. If false, those values will be set to null.

Returns

The modified Dataflow.

convert_unix_timestamp_to_datetime

Converts the specified column to DateTime values by treating the existing value as a Unix timestamp.

convert_unix_timestamp_to_datetime(columns: MultiColumnSelection, use_seconds: bool = False) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

use_seconds

Whether to use seconds as the resolution. Milliseconds are used if false.

Returns

The modified Dataflow.

derive_column_by_example

Inserts a column by learning a program based on a set of source columns and provided examples. Dataprep will attempt to achieve the intended derivation inferred from provided examples.

derive_column_by_example(source_columns: SourceColumns, new_column_name: str, example_data: ExampleData) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

source_columns

Names of the columns from which the new column will be derived.

new_column_name

Name of the new column to add.

example_data

Examples to use as input for program generation. In case there is only one column to be used as source, examples could be Tuples of source value and intended target value. For example, you can have "example_data=[("2013-08-22", "Thursday"), ("2013-11-03", "Sunday")]". When multiple columns should be considered as source, each example should be a Tuple of dict-like sources and intended target value, where sources have column names as keys and column values as values.

Returns

The modified Dataflow.

Remarks

If you need more control of examples and generated program, create DeriveColumnByExampleBuilder instead.

distinct

Filters out records that contain duplicate values in the specified columns, leaving only a single instance.

distinct(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

Returns

The modified Dataflow.

distinct_rows

Filters out records that contain duplicate values in all columns, leaving only a single instance. :return: The modified Dataflow.

distinct_rows() -> azureml.dataprep.api.dataflow.Dataflow

drop_columns

Drops the specified columns.

drop_columns(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

Returns

The modified Dataflow.

drop_errors

Drops rows where all or any of the selected columns are an Error.

drop_errors(columns: MultiColumnSelection, column_relationship: azureml.dataprep.api.engineapi.typedefinitions.ColumnRelationship = <ColumnRelationship.ALL: 0>) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

column_relationship

Whether all or any of the selected columns must be an Error.

Returns

The modified Dataflow.

drop_nulls

Drops rows where all or any of the selected columns are null.

drop_nulls(columns: MultiColumnSelection, column_relationship: azureml.dataprep.api.engineapi.typedefinitions.ColumnRelationship = <ColumnRelationship.ALL: 0>) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

column_relationship

Whether all or any of the selected columns must be null.

Returns

The modified Dataflow.

duplicate_column

Creates new columns that are duplicates of the specified source columns.

duplicate_column(column_pairs: typing.Dict[str, str]) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column_pairs

Mapping of the columns to duplicate to their new names.

Returns

The modified Dataflow.

error

Creates errors in a column for values that match the specified search value.

error(columns: MultiColumnSelection, find: typing.Any, error_code: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

find

The value to find, or None.

error_code

The error code to use in new errors, or None.

Returns

The modified Dataflow.

Remarks

The following types are supported for the find argument: str, int, float, datetime.datetime, and bool.

execute_inspector

execute_inspector(inspector: azureml.dataprep.api.inspector.BaseInspector) -> azureml.dataprep.api.engineapi.typedefinitions.ExecuteInspectorCommonResponse

execute_inspectors

execute_inspectors(inspectors: typing.List[azureml.dataprep.api.inspector.BaseInspector]) -> typing.Dict[azureml.dataprep.api.engineapi.typedefinitions.InspectorArguments, azureml.dataprep.api.engineapi.typedefinitions.ExecuteInspectorCommonResponse]

extract_error_details

Extracts the error details from error values into a new column.

extract_error_details(column: str, error_value_column: str, extract_error_code: bool = False, error_code_column: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column

The source column.

error_value_column

Name of a column to hold the original value of the error.

extract_error_code

Whether to also extract the error code.

error_code_column

Name of a column to hold the error code.

Returns

The modified Dataflow.

fill_errors

Fills all errors in a column with the specified value.

fill_errors(columns: MultiColumnSelection, fill_with: typing.Any) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

fill_with

The value to fill errors with, or None.

Returns

The modified Dataflow.

Remarks

The following types are supported for the fill_with argument: str, int, float, datetime.datetime, and bool.

fill_nulls

Fills all nulls in a column with the specified value.

fill_nulls(columns: MultiColumnSelection, fill_with: typing.Any) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

fill_with

The value to fill nulls with.

Returns

The modified Dataflow.

Remarks

The following types are supported for the fill_with argument: str, int, float, datetime.datetime, and bool.

filter

Filters the data, leaving only the records that match the specified expression.

filter(expression: azureml.dataprep.api.expressions.Expression) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

expression

The expression to evaluate.

Returns

The modified Dataflow.

Remarks

Expressions are started by indexing the Dataflow with the name of a column. They support a variety of functions and operators and can be combined using logical operators. The resulting expression will be lazily evaluated for each record when a data pull occurs and not where it is defined.


   dataflow['myColumn'] > dataflow['columnToCompareAgainst']
   dataflow['myColumn'].starts_with('prefix')

from_json

Load Dataflow from 'package_json'.

from_json(dataflow_json: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

dataflow_json

JSON string representation of the Package.

Returns

New Package object constructed from the JSON string.

get_files

Expands the path specified by reading globs and files in folders and outputs one record per file found.

get_files(path: FilePath) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

path

The path or paths to expand.

Returns

A new Dataflow.

get_missing_secrets

Get a list of missing secret IDs.

get_missing_secrets() -> typing.List[str]

Returns

A list of missing secret IDs.

get_partition_count

Calculates the partitions for the current Dataflow and returns their count. Partitioning is guaranteed to be stable for a specific execution mode.

get_partition_count() -> int

Returns

The count of partitions.

get_profile

Requests the data profile which collects summary statistics on the full data produced by the Dataflow. A data profile can be very useful to understand the input data, identify anomalies and missing values, and verify that data preparation operations produced the desired result.

get_profile(include_stype_counts: bool = False, number_of_histogram_bins: int = 10) -> azureml.dataprep.api.dataprofile.DataProfile

Parameters

include_stype_counts
bool

Whether to include checking if values look like some well known semantic types of information. For Example, "email address". Turning this on will impact performance.

number_of_histogram_bins
int

Number of bins in a histogram. If not specified will be set to 10.

Returns

DataProfile object

Return type

has_invalid_source

Verifies if the Dataflow has invalid source.

has_invalid_source(return_validation_error=False)

Parameters

return_validation_error

Action to take when source is know to be invalid. Options are: True Returns error message. This is useful to gather more information regarding the failure occurred while checking whether the source is valid or not. False return True.

Returns

Return following based on the parameter checked

  • returns error message if show_error_message == True and source known to be invalid.
  • returns True if show_error_message == False and source known to be invalid.
  • returns False if source known to valid or unknown.

Pulls the number of records specified from the top of this Dataflow and returns them as a Link pandas.DataFrame.

Parameters

count

The number of records to pull. 10 is default.

Returns

A Pandas Link pandas.DataFrame.

join

Creates a new Dataflow that is a result of joining two provided Dataflows.

join(left_dataflow: DataflowReference, right_dataflow: DataflowReference, join_key_pairs: typing.List[typing.Tuple[str, str]] = None, join_type: azureml.dataprep.api.dataflow.JoinType = <JoinType.MATCH: 2>, left_column_prefix: str = 'l_', right_column_prefix: str = 'r_', left_non_prefixed_columns: typing.List[str] = None, right_non_prefixed_columns: typing.List[str] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

left_dataflow

Left Dataflow or DataflowReference to join with.

right_dataflow

Right Dataflow or DataflowReference to join with.

join_key_pairs

Key column pairs. List of tuples of columns names where each tuple forms a key pair to join on. For instance: [('column_from_left_dataflow', 'column_from_right_dataflow')]

join_type

Type of join to perform. Match is default.

left_column_prefix

Prefix to use in result Dataflow for columns originating from left_dataflow. Needed to avoid column name conflicts at runtime.

right_column_prefix

Prefix to use in result Dataflow for columns originating from right_dataflow. Needed to avoid column name conflicts at runtime.

left_non_prefixed_columns

List of column names from left_dataflow that should not be prefixed with left_column_prefix. Every other column appearing in the data at runtime will be prefixed.

right_non_prefixed_columns

List of column names from right_dataflow that should not be prefixed with left_column_prefix. Every other column appearing in the data at runtime will be prefixed.

Returns

The new Dataflow.

keep_columns

Keeps the specified columns and drops all others.

keep_columns(columns: MultiColumnSelection, validate_column_exists: bool = False) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

validate_column_exists

Whether to validate the columns selected.

Returns

The modified Dataflow.

label_encode

Adds a column with encoded labels generated from the source column. For explicit label encoding, use LabelEncoderBuilder.

label_encode(source_column: str, new_column_name: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

source_column

Column name from which encoded labels will be generated.

new_column_name

The new column's name.

Returns

The modified Dataflow.

map_column

Creates a new column where matching values in the source column have been replaced with the specified values.

map_column(column: str, new_column_id: str, replacements: typing.Union[typing.List[azureml.dataprep.api.dataflow.ReplacementsValue], NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column

The source column.

new_column_id

The name of the mapped column.

replacements

The values to replace and their replacements.

Returns

The modified Dataflow.

map_partition

Applies a transformation to each partition in the Dataflow.

Parameters

fn

A callable that will be invoked to transform each partition.

dataFormat

A optional string specifying the input format to fn. Supported Formats: 'dataframe', 'csr', 'lastProcessed'.

Returns

The modified Dataflow.

Remarks

The function passed in must take in two parameters: a Link pandas.DataFrame or Link scipy.sparse.csr_matrix containing the data for the partition and an index. The return value must be a Link pandas.DataFrame or a Link scipy.sparse.csr_matrix containing the transformed data. By default the 'lastProcessed' format is passed into fn, i.e. if the data coming into map_partitions is Sparse it will be sent as a Link scipy.sparse.csr_matrix, if it is dense it will be sent as a Link pandas.DataFrame. The desired input data format can be set explicitly using the data_format parameter.

Note

df does not usually contain all of the data in the dataflow, but a partition of the data as it is being processed in the runtime. The number and contents of each partition is not guaranteed across runs.

The transform function can fully edit the passed in dataframe or even create a new one, but must return a dataframe. Any libraries that the Python script imports must exist in the environment where the dataflow is run.

min_max_scale

Scales the values in the specified column to lie within range_min (default=0) and range_max (default=1).

min_max_scale(column: str, range_min: float = 0, range_max: float = 1, data_min: float = None, data_max: float = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column

The column to scale.

range_min

Desired min of scaled values.

range_max

Desired max of scaled values.

data_min

Min of source column. If not provided, a scan of the data will be performed to find the min.

data_max

Max of source column. If not provided, a scan of the data will be performed to find the max.

Returns

The modified Dataflow.

multi_split

Split a Dataflow into multiple other Dataflows, each containing a random but exclusive sub-set of the data.

multi_split(splits: int, seed: typing.Union[int, NoneType] = None)

Parameters

splits

The number of splits.

seed

The seed to use for the random split.

Returns

A list containing one Dataflow per split.

new_script_column

Adds a new column to the Dataflow using the passed in Python script.

new_script_column(new_column_name: str, insert_after: str, script: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

new_column_name

The name of the new column being created.

insert_after

The column after which the new column will be inserted.

script

The script that will be used to create this new column.

Returns

The modified Dataflow.

Remarks

The Python script must define a function called newvalue that takes a single argument, typically called row. The row argument is a dict (key is column name, value is current value) and will be passed to this function for each row in the dataset. This function must return a value to be used in the new column.

Note

Any libraries that the Python script imports must exist in the environment where the dataflow is run.


   import numpy as np
   def newvalue(row):
       return np.log(row['Metric'])

new_script_filter

Filters the Dataflow using the passed in Python script.

new_script_filter(script: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

script

The script that will be used to filter the dataflow.

Returns

The modified Dataflow.

Remarks

The Python script must define a function called includerow that takes a single argument, typically called row. The row argument is a dict (key is column name, value is current value) and will be passed to this function for each row in the dataset. This function must return True or False depending on whether the row should be included in the dataflow. Any libraries that the Python script imports must exist in the environment where the dataflow is run.


   def includerow(row):
       return row['Metric'] > 100

null_coalesce

For each record, selects the first non-null value from the columns specified and uses it as the value of a new column.

null_coalesce(columns: typing.List[str], new_column_id: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

new_column_id

The name of the new column.

Returns

The modified Dataflow.

one_hot_encode

Adds a binary column for each categorical label from the source column values. For more control over categorical labels, use OneHotEncodingBuilder.

one_hot_encode(source_column: str, prefix: str = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

source_column

Column name from which categorical labels will be generated.

prefix

An optional prefix to use for the new columns.

Returns

The modified Dataflow.

open

Opens a Dataflow with specified name from the package file.

open(file_path: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

file_path

Path to the package containing the Dataflow.

Returns

The Dataflow.

parse_delimited

Adds step to parse CSV data.

parse_delimited(separator: str, headers_mode: azureml.dataprep.api.dataflow.PromoteHeadersMode, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding, quoting: bool, skip_rows: int, skip_mode: azureml.dataprep.api.dataflow.SkipMode, comment: str, partition_size: typing.Union[int, NoneType] = None, empty_as_string: bool = False) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

separator

The separator to use to split columns.

headers_mode

How to determine column headers.

encoding

The encoding of the files being read.

quoting

Whether to handle new line characters within quotes. This option will impact performance.

skip_rows

How many rows to skip.

skip_mode

The mode in which rows are skipped.

comment

Character used to indicate a line is a comment instead of data in the files being read.

partition_size

Desired partition size.

empty_as_string

Whether to keep empty field values as empty strings. Default is read them as null.

Returns

A new Dataflow with Parse Delimited Step added.

parse_fwf

Adds step to parse fixed-width data.

parse_fwf(offsets: typing.List[int], headers_mode: azureml.dataprep.api.dataflow.PromoteHeadersMode, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding, skip_rows: int, skip_mode: azureml.dataprep.api.dataflow.SkipMode) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

offsets

The offsets at which to split columns. The first column is always assumed to start at offset 0.

headers_mode

How to determine column headers.

encoding

The encoding of the files being read.

skip_rows

How many rows to skip.

skip_mode

The mode in which rows are skipped.

Returns

A new Dataflow with Parse FixedWidthFile Step added.

parse_json_column

Parses the values in the specified column as JSON objects and expands them into multiple columns.

parse_json_column(column: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column

The source column.

Returns

The modified Dataflow.

parse_json_lines

Creates a new Dataflow with the operations required to read JSON lines files.

parse_json_lines(encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding, partition_size: typing.Union[int, NoneType] = None, invalid_lines: azureml.dataprep.api.engineapi.typedefinitions.InvalidLineHandling = <InvalidLineHandling.ERROR: 1>) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

invalid_lines

How to handle invalid JSON lines.

encoding

The encoding of the files being read.

partition_size

Desired partition size.

Returns

A new Dataflow with Read JSON line Step added.

parse_lines

Adds step to parse text files and split them into lines.

parse_lines(headers_mode: azureml.dataprep.api.dataflow.PromoteHeadersMode, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding, skip_rows: int, skip_mode: azureml.dataprep.api.dataflow.SkipMode, comment: str, partition_size: typing.Union[int, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

headers_mode

How to determine column headers.

encoding

The encoding of the files being read.

skip_rows

How many rows to skip.

skip_mode

The mode in which rows are skipped.

comment

Character used to indicate a line is a comment instead of data in the files being read.

partition_size

Desired partition size.

Returns

A new Dataflow with Parse Lines Step added.

pivot

Returns a new Dataflow with columns generated from the values in the selected columns to pivot.

pivot(columns_to_pivot: typing.List[str], value_column: str, summary_function: azureml.dataprep.api.dataflow.SummaryFunction = None, group_by_columns: typing.List[str] = None, null_value_replacement: str = None, error_value_replacement: str = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns_to_pivot

The columns used to get the values from which the new dataflow's new columns are generated.

value_column

The column used to get the values that will populate the new dataflow's rows.

Remarks

The values of the new dataflow come from the value column selected. Additionally there is an optional summarization that consists of an aggregation and a group by.

promote_headers

Sets the first record in the dataset as headers, replacing any existing ones. :return: The modified Dataflow.

promote_headers() -> azureml.dataprep.api.dataflow.Dataflow

quantile_transform

Perform quantile transformation to the source_column and output the transformed result in new_column.

quantile_transform(source_column: str, new_column: str, quantiles_count: int = 1000, output_distribution: str = 'Uniform')

Parameters

source_column

The column which quantile transform will be applied to.

new_column

The column where the transformed data will be placed.

quantiles_count

The number of quantiles used. This will be used to discretize the cdf, defaults to 1000

output_distribution

The distribution of the transformed data, defaults to 'Uniform'

Returns

The modified Dataflow.

random_split

Returns two Dataflows from the original Dataflow, with records randomly and approximately split by the percentage specified (using a seed). If the percentage specified is p (where p is between 0 and 1), the first returned dataflow will contain approximately p*100% records from the original dataflow, and the second dataflow will contain all the remaining records that were not included in the first. A random seed will be used if none is provided.

random_split(percentage: float, seed: typing.Union[int, NoneType] = None) -> ('Dataflow', 'Dataflow')

Parameters

percentage

The approximate percentage to split the Dataflow by. This must be a number between 0.0 and 1.0.

seed

The seed to use for the random split.

Returns

Two Dataflows containing records randomly split from the original Dataflow. If the percentage specified is p, the first dataflow contains approximately p*100% of the records from the original dataflow.

read_excel

Adds step to read and parse Excel files.

read_excel(sheet_name: typing.Union[str, NoneType] = None, use_column_headers: bool = False, skip_rows: int = 0) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

sheet_name

The name of the sheet to load. The first sheet is loaded if none is provided.

use_column_headers

Whether to use the first row as column headers.

skip_rows

Number of rows to skip when loading the data.

Returns

A new Dataflow with Read Excel Step added.

read_json

Creates a new Dataflow with the operations required to read JSON files.

read_json(json_extract_program: str, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding)

Parameters

json_extract_program

PROSE program that will be used to parse JSON.

encoding

The encoding of the files being read.

Returns

A new Dataflow with Read JSON Step added.

read_npz_file

Adds step to parse npz files.

read_npz_file() -> azureml.dataprep.api.dataflow.Dataflow

Returns

A new Dataflow with Read Npz File Step added.

read_parquet_dataset

Creates a step to read parquet file.

read_parquet_dataset(path: azureml.dataprep.api.datasources.FileDataSource) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

path

The path to the Parquet file.

Returns

A new Dataflow.

read_parquet_file

Adds step to parse Parquet files.

read_parquet_file() -> azureml.dataprep.api.dataflow.Dataflow

Returns

A new Dataflow with Parse Parquet File Step added.

read_postgresql

Adds step that can read data from a PostgreSQL database by executing the query specified.

read_postgresql(data_source: azureml.dataprep.api.datasources.PostgreSQLDataSource, query: str, query_timeout: int = 20) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

data_source

The details of the PostgreSQL database.

query

The query to execute to read data.

query_timeout

Sets the wait time (in seconds) before terminating the attempt to execute a command and generating an error. The default is 20 seconds.

Returns

A new Dataflow with Read SQL Step added.

read_preppy

Adds step to read a directory containing Preppy files.

read_preppy() -> azureml.dataprep.api.dataflow.Dataflow

Returns

A new Dataflow with Read Preppy Step added.

read_sql

Adds step that can read data from an MS SQL database by executing the query specified.

read_sql(data_source: azureml.dataprep.api.datasources.MSSQLDataSource, query: str, query_timeout: int = 30) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

data_source

The details of the MS SQL database.

query

The query to execute to read data.

query_timeout

Sets the wait time (in seconds) before terminating the attempt to execute a command and generating an error. The default is 30 seconds.

Returns

A new Dataflow with Read SQL Step added.

reference

Creates a reference to an existing activity object.

reference(reference: DataflowReference) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

reference

The reference activity.

Returns

A new Dataflow.

rename_columns

Renames the specified columns.

rename_columns(column_pairs: typing.Dict[str, str]) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column_pairs

The columns to rename and the desired new names.

Returns

The modified Dataflow.

replace

Replaces values in a column that match the specified search value.

replace(columns: MultiColumnSelection, find: typing.Any, replace_with: typing.Any) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

find

The value to find, or None.

replace_with

The replacement value, or None.

Returns

The modified Dataflow.

Remarks

The following types are supported for both the find or replace arguments: str, int, float, datetime.datetime, and bool.

replace_datasource

Returns new Dataflow with its DataSource replaced by the given one.

replace_datasource(new_datasource: DataSource) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

new_datasource

DataSource to substitute into new Dataflow.

Returns

The modified Dataflow.

Remarks

The given 'new_datasource' must match the type of datasource already in the Dataflow. For example a MSSQLDataSource cannot be replaced with a FileDataSource.

replace_na

Replaces values in the specified columns with nulls. You can choose to use the default list, supply your own, or both.

replace_na(columns: MultiColumnSelection, use_default_na_list: bool = True, use_empty_string_as_na: bool = True, use_nan_as_na: bool = True, custom_na_list: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

use_default_na_list

Use the default list and replace 'null', 'NaN', 'NA', and 'N/A' with null.

use_empty_string_as_na

Replace empty strings with null.

use_nan_as_na

Replace NaNs with Null.

custom_na_list

Provide a comma separated list of values to replace with null.

columns

The source columns.

Returns

The modified Dataflow.

replace_reference

Returns new Dataflow with its reference DataSource replaced by the given one.

replace_reference(new_reference: DataflowReference) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

new_reference

Reference to be substituted for current Reference in Dataflow.

Returns

The modified Dataflow.

round

Rounds the values in the column specified to the desired number of decimal places.

round(column: str, decimal_places: int) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column

The source column.

decimal_places

The number of decimal places.

Returns

The modified Dataflow.

run_local

Runs the current Dataflow using the local execution runtime.

run_local() -> None

run_spark

Runs the current Dataflow using the Spark runtime.

run_spark() -> None

save

Saves the Dataflow to the specified file

save(file_path: str)

Parameters

file_path

The path of the file.

select_partitions

Selects specific partitions from the data, dropping the rest.

select_partitions(partition_indices: typing.List[int]) -> azureml.dataprep.api.dataflow.Dataflow

Returns

The modified Dataflow.

set_column_types

Converts values in specified columns to the corresponding data types.

set_column_types(type_conversions: typing.Dict[str, TypeConversionInfo]) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

type_conversions

A dictionary where key is column name and value is either FieldType or TypeConverter or a Tuple of DATE (FieldType) and List of format strings

Returns

The modified Dataflow.

Remarks

The values in the type_conversions dictionary could be of several types:


   import azureml.dataprep as dprep

   dataflow = dprep.read_csv(path='./some/path')
   dataflow = dataflow.set_column_types({'MyNumericColumn': dprep.FieldType.DECIMAL,
                                      'MyBoolColumn': dprep.FieldType.BOOLEAN,
                                      'MyAutodetectDateColumn': dprep.FieldType.DATE,
                                      'MyDateColumnWithFormat': (dprep.FieldType.DATE, ['%m-%d-%Y']),
                                      'MyOtherDateColumn': dprep.DateTimeConverter(['%d-%m-%Y'])})

Note

If you choose to convert a column to DATE and do not provide format(s) to use, DataPrep will attempt to infer a format to use by pulling on the data. If a format could be inferred from the data, it will be used to convert values in the corresponding

column. However, if a format could not be inferred, this step will fail and you will need to either provide the format to use or use the interactive builder ColumnTypesBuilder by calling 'dataflow.builders.set_column_types()'

skip

Skips the specified number of records.

skip(count: int) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

count

The number of records to skip.

Returns

The modified Dataflow.

sort

Sorts the dataset by the specified columns.

sort(sort_order: typing.List[typing.Tuple[str, bool]]) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

sort_order

The columns to sort by and whether to sort ascending or descending. True is treated as descending, false as ascending.

Returns

The modified Dataflow.

sort_asc

Sorts the dataset in ascending order by the specified columns.

sort_asc(columns: SortColumns) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The columns to sort in ascending order.

Returns

The modified Dataflow.

sort_desc

Sorts the dataset in descending order by the specified columns.

sort_desc(columns: SortColumns) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The columns to sort in descending order.

Returns

The modified Dataflow.

split_column_by_delimiters

Splits the provided column and adds the resulting columns to the dataflow.

split_column_by_delimiters(source_column: str, delimiters: Delimiters, keep_delimiters: False) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

source_column

Column to split.

delimiters

String or list of strings to be deemed as column delimiters.

keep_delimiters

Controls whether to keep or drop column with delimiters.

Returns

The modified Dataflow.

Remarks

This will pull small sample of the data, determine number of columns it should expect as a result of the split and generate a split program that would ensure that the expected number of columns will be produced, so that there is a deterministic schema after this operation.

split_column_by_example

Splits the provided column and adds the resulting columns to the dataflow based on the provided example.

split_column_by_example(source_column: str, example: SplitExample = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

source_column

Column to split.

example

Example to use for program generation.

Returns

The modified Dataflow.

Remarks

This will pull small sample of the data, determine the best program to satisfy provided example and generate a split program that would ensure that the expected number of columns will be produced, so that there is a deterministic schema after this operation.

Note

If example is not provided, this will generate a split program based on common split patterns, like splitting by space, punctuation, date parts and etc.

split_stype

Creates new columns from an existing column, interpreting its values as a semantic type.

split_stype(column: str, stype: azureml.dataprep.api.dataflow.SType, stype_fields: typing.Union[typing.List[str], NoneType] = None, new_column_names: typing.Union[typing.List[str], NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column

The source column.

stype

The semantic type used to interpret values in the column.

stype_fields

Fields of the semantic type to use. If not provided, all fields will be used.

new_column_names

Names of the new columns. If not provided new columns will be named with the source column name plus the semantic type field name.

Returns

The modified Dataflow.

str_replace

Replaces values in a string column that match a search string with the specified value.

str_replace(columns: MultiColumnSelection, value_to_find: typing.Union[str, NoneType] = None, replace_with: typing.Union[str, NoneType] = None, match_entire_cell_contents: bool = False, use_special_characters: bool = False) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

value_to_find

The value to find.

replace_with

The replacement value.

match_entire_cell_contents

Whether the value to find must match the entire value.

use_special_characters

If checked, you can use '#(tab)', '#(cr)', or '#(lf)' to represent special characters in the find or replace arguments.

Returns

The modified Dataflow.

summarize

Summarizes data by running aggregate functions over specific columns.

summarize(summary_columns: typing.Union[typing.List[azureml.dataprep.api.dataflow._SummaryColumnsValue], NoneType] = None, group_by_columns: typing.Union[typing.List[str], NoneType] = None, join_back: bool = False, join_back_columns_prefix: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

summary_columns

List of SummaryColumnsValue, where each value defines the column to summarize, the summary function to use and the name of resulting column to add.

group_by_columns

Columns to group by.

join_back

Whether to append subtotals or replace current data with them.

join_back_columns_prefix

Prefix to use for subtotal columns when appending them to current data.

Returns

The modified Dataflow.

Remarks

The aggregate functions are independent and it is possible to aggregate the same column multiple times. Unique names have to be provided for the resulting columns. The aggregations can be grouped, in which case one record is returned per group; or ungrouped, in which case one record is returned for the whole dataset. Additionally, the results of the aggregations can either replace the current dataset or augment it by appending the result columns.


   import azureml.dataprep as dprep

   dataflow = dprep.read_csv(path='./some/path')
   dataflow = dataflow.summarize(
       summary_columns=[
           dprep.SummaryColumnsValue(
               column_id='Column To Summarize',
               summary_column_name='New Column of Counts',
               summary_function=dprep.SummaryFunction.COUNT)],
       group_by_columns=['Column 1', 'Column 2'])

take

Takes the specified count of records.

take(count: int) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

count

The number of records to take.

Returns

The modified Dataflow.

take_sample

Takes a random sample of the available records.

take_sample(probability: float, seed: typing.Union[int, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

probability

The probability of a record being included in the sample.

seed

The seed to use for the random generator.

Returns

The modified Dataflow.

take_stratified_sample

Takes a random stratified sample of the available records according to input fractions.

take_stratified_sample(columns: MultiColumnSelection, fractions: typing.Dict[typing.Tuple, int], seed: typing.Union[int, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The strata columns.

fractions

The strata to strata weights.

seed

The seed to use for the random generator.

Returns

The modified Dataflow.

to_bool

Converts the values in the specified columns to booleans.

to_bool(columns: MultiColumnSelection, true_values: typing.List[str], false_values: typing.List[str], mismatch_as: azureml.dataprep.api.dataflow.MismatchAsOption = <MismatchAsOption.ASERROR: 2>) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

true_values

The values to treat as true.

false_values

The values to treat as false.

mismatch_as

How to treat values that don't match the values in the true or false values lists.

Returns

The modified Dataflow.

to_csv_streams

Creates streams with the data in delimited format.

to_csv_streams(separator: str = ',', na: str = 'NA', error: str = 'ERROR') -> azureml.dataprep.api.dataflow.Dataflow

Returns

The modified Dataflow.

to_dask_dataframe

Returns a Dask DataFrame that can lazily read the data in the Dataflow.

to_dask_dataframe(sample_size: int = 10000, dtypes: dict = None, extended_types: bool = False, nulls_as_nan: bool = True, on_error: str = 'null', out_of_range_datetime: str = 'null')

Parameters

sample_size

The number of records to read to determine schema and types.

dtypes

An optional dict specifying the expected columns and their dtypes. sample_size is ignored if this is provided.

extended_types

Whether to keep extended DataPrep types such as DataPrepError in the DataFrame. If False, these values will be replaced with None.

nulls_as_nan

Whether to interpret nulls (or missing values) in number typed columns as NaN. This is done by pandas for performance reasons; it can result in a loss of fidelity in the data.

on_error

How to handle any error values in the Dataflow, such as those produced by an error while parsing values. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.

out_of_range_datetime

How to handle date-time values that are outside the range supported by Pandas. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.

Returns

A Dask DataFrame.

Remarks

Dask DataFrames allow for parallel and lazy processing of data by splitting the data into multiple partitions. Because Dask DataFrames don't actually read any data until a computation is requested, it is necessary to determine what the schema and types of the data will be ahead of time. This is done by reading a specific number of records from the Dataflow (specified by the sample_size parameter). However, it is possible for these initial records to have incomplete information. In those cases, it is possible to explicitly specify the expected columns and their types by providing a dict of the shape {column_name: dtype} in the dtypes parameter.

to_data_frame_directory

Creates streams with the data in dataframe directory format.

to_data_frame_directory(error: str = 'ERROR', rows_per_group: int = 5000) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

error

String to use for error values.

rows_per_group

Number of rows to use per row group.

Returns

The modified Dataflow.

to_datetime

Converts the values in the specified columns to DateTimes.

to_datetime(columns: MultiColumnSelection, date_time_formats: typing.Union[typing.List[str], NoneType] = None, date_constant: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

date_time_formats

The formats to use to parse the values. If none are provided, a partial scan of the data will be performed to derive one.

date_constant

If the column contains only time values, a date to apply to the resulting DateTime.

Returns

The modified Dataflow.

to_json

Get the JSON string representation of the Dataflow.

to_json() -> str

to_long

Converts the values in the specified columns to 64 bit integers.

to_long(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

Returns

The modified Dataflow.

to_number

Converts the values in the specified columns to floating point numbers.

to_number(columns: MultiColumnSelection, decimal_point: azureml.dataprep.api.dataflow.DecimalMark = <DecimalMark.DOT: 0>) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

decimal_point

The symbol to use as the decimal mark.

Returns

The modified Dataflow.

to_pandas_dataframe

Pulls all of the data and returns it as a Pandas Link pandas.DataFrame.

Parameters

extended_types

Whether to keep extended DataPrep types such as DataPrepError in the DataFrame. If False, these values will be replaced with None.

nulls_as_nan

Whether to interpret nulls (or missing values) in number typed columns as NaN. This is done by pandas for performance reasons; it can result in a loss of fidelity in the data.

on_error

How to handle any error values in the Dataflow, such as those produced by an error while parsing values. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.

out_of_range_datetime

How to handle date-time values that are outside the range supported by Pandas. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.

Returns

A Pandas Link pandas.DataFrame.

Remarks

This method will load all the data returned by this Dataflow into memory.

Since Dataflows do not require a fixed, tabular schema but Pandas DataFrames do, an implicit tabularization step will be executed as part of this action. The resulting schema will be the union of the schemas of all records produced by this Dataflow.

to_parquet_streams

Creates streams with the data in parquet format.

to_parquet_streams(error: str = 'ERROR', rows_per_group: int = 5000) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

error

String to use for error values.

rows_per_group

Number of rows to use per row group.

Returns

The modified Dataflow.

to_partition_iterator

Creates an iterable object that returns the partitions produced by this Dataflow in sequence. This iterable must be closed before any other executions can run.

to_partition_iterator(on_error: str = 'null') -> azureml.dataprep.api._partitionsreader.PartitionIterable

Parameters

on_error

How to handle any error values in the Dataflow, such as those produced by an error while parsing values. Valid values are 'null' which replaces them with null; and 'fail' which will result in an exception.

Returns

A PartitionsIterable object that can be used to iterate over the partitions in this Dataflow.

to_record_iterator

Creates an iterable object that returns the records produced by this Dataflow in sequence. This iterable must be closed before any other executions can run.

to_record_iterator() -> azureml.dataprep.api._dataframereader.RecordIterable

Returns

A RecordIterable object that can be used to iterate over the records in this Dataflow.

to_spark_dataframe

Creates a Spark Link pyspark.sql.DataFrame that can execute the transformation pipeline defined by this Dataflow.

Returns

A Spark Link pyspark.sql.DataFrame.

Remarks

Since Dataflows do not require a fixed, tabular schema but Spark Dataframes do, an implicit tabularization step will be executed as part of this action. The resulting schema will be the union of the schemas of all records produced by this Dataflow. This tabularization step will result in a pull of the data.

Note

The Spark Dataframe returned is only an execution plan and doesn't actually contain any data, since Spark Dataframes are also lazily evaluated.

to_string

Converts the values in the specified columns to strings.

to_string(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

Returns

The modified Dataflow.

transform_partition

Applies a transformation to each partition in the Dataflow.

  • This function has been deprecated and will be removed in a future version. Please use map_partition instead.
transform_partition(script: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

script

A script that will be used to transform each partition.

Returns

The modified Dataflow.

Remarks

The Python script must define a function called transform that takes two arguments, typically called df and index. The df argument will be a Pandas Dataframe passed to this function that contains the data for the partition and the index argument is a unique identifier of the partition.

Note

df does not usually contain all of the data in the dataflow, but a partition of the data as it is being processed in the runtime. The number and contents of each partition is not guaranteed across runs.

The transform function can fully edit the passed in dataframe or even create a new one, but must return a dataframe. Any libraries that the Python script imports must exist in the environment where the dataflow is run.


   # the script passed in should have this function defined
   def transform(df, index):
       # perform any partition level transforms here and return resulting `df`
       return df

transform_partition_with_file

Transforms an entire partition using the Python script in the passed in file.

transform_partition_with_file(script_path: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

script_path

Relative path to script that will be used to transform the partition.

Returns

The modified Dataflow.

Remarks

The Python script must define a function called transform that takes two arguments, typically called df and index. The first argument df will be a Pandas Dataframe that contains the data for the partition and the second argument index will be a unique identifier for the partition.

Note

df does not usually contain all of the data in the dataflow, but a partition of the data as it is being processed in the runtime. The number and contents of each partition is not guaranteed across runs.

The transform function can fully edit the passed in dataframe or even create a new one, but must return a dataframe. Any libraries that the Python script imports must exist in the environment where the dataflow is run.


   # the script file passed in should have this function defined
   def transform(df, index):
       # perform any partition level transforms here and return resulting `df`
       return df

trim_string

Trims string values in specific columns.

trim_string(columns: MultiColumnSelection, trim_left: bool = True, trim_right: bool = True, trim_type: azureml.dataprep.api.dataflow.TrimType = <TrimType.WHITESPACE: 0>, custom_characters: str = '') -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

trim_left

Whether to trim from the beginning.

trim_right

Whether to trim from the end.

trim_type

Whether to trim whitespace or custom characters.

custom_characters

The characters to trim.

Returns

The modified Dataflow.

use_secrets

Uses the passed in secrets for execution.

use_secrets(secrets: typing.Dict[str, str])

Parameters

secrets

A dictionary of secret ID to secret value. You can get the list of required secrets by calling the get_missing_secrets method on Dataflow.

verify_has_data

Verifies that this Dataflow would produce records if executed. An exception will be thrown otherwise.

verify_has_data()

write_streams

Writes the streams in the specified column to the destination path. By default, the name of the files written will be the resource identifier of the streams. This behavior can be overriden by specifying a column which contains the names to use.

write_streams(streams_column: str, base_path: azureml.dataprep.api.datasources.FileOutput, file_names_column: typing.Union[str, NoneType] = None, prefix_path: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

streams_column

The column containing the streams to write.

file_names_column

A column containing the file names to use.

base_path

The path under which the files should be written.

prefix_path

The prefix path that needs to be removed from the target paths.

Returns

The modified Dataflow.

write_to_csv

Write out the data in the Dataflow in a delimited text format. The output is specified as a directory which will contain multiple files, one per partition processed in the Dataflow.

write_to_csv(directory_path: DataDestination, separator: str = ',', na: str = 'NA', error: str = 'ERROR') -> azureml.dataprep.api.dataflow.Dataflow

Parameters

directory_path

The path to a directory in which to store the output files.

separator

The separator to use.

na

String to use for null values.

error

String to use for error values.

Returns

The modified Dataflow. Every execution of the returned Dataflow will perform the write again.

write_to_parquet

Writes out the data in the Dataflow into Parquet files.

write_to_parquet(file_path: typing.Union[~DataDestination, NoneType] = None, directory_path: typing.Union[~DataDestination, NoneType] = None, single_file: bool = False, error: str = 'ERROR', row_groups: int = 0) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

file_path

The path in which to store the output file.

directory_path

The path in which to store the output files.

single_file

Whether to store the entire Dataflow in a single file.

error

String to use for error values.

row_groups

Number of rows to use per row group.

Returns

The modified Dataflow.

write_to_preppy

Writes out the data in the Dataflow into Preppy files, a DataPrep serialization format.

write_to_preppy(directory_path: DataDestination) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

directory_path

The path in which to store the output Preppy files.

Returns

The modified Dataflow.

write_to_sql

Adds step that writes out the data in the Dataflow into a table in MS SQL database.

write_to_sql(destination: DatabaseDestination, table: str, batch_size: int = 500, default_string_length: int = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

destination

The details of the MS SQL database or AzureSqlDatabaseDatastore.

table

Name of the table to write data to.

batch_size

Size of a batch of records to commit to SQL server in a single request.

default_string_length

Length of string to be supported when creating a table.

Returns

A new Dataflow with write step added.

zip_partitions

Appends the columns from the referenced dataflows to the current one. This is different from AppendColumns in that it assumes all dataflows being appended have the same number of partitions and same number of Records within each corresponding partition. If these two conditions are not true the operation will fail.

zip_partitions(dataflows: typing.List[_ForwardRef('DataflowReference')]) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

dataflows

The dataflows to append.

Returns

The modified Dataflow.

Attributes

dtypes

Gets column data types for the current dataset by calling azureml.dataprep.Dataflow.get_profile and extracting just dtypes from the resulting DataProfile.

Returns

A dictionary, where key is the column name and value is FieldType.

row_count

Count of rows in this Dataflow.

Note

This will trigger a data profile calculation. To avoid calculating profile multiple times, get the profile first by calling get_profile() and then inspect it.

Note

If current Dataflow contains take_sample step or 'take' step, this will return number of rows in the subset defined by those steps.

Returns

Count of rows.

Return type

int

shape

Shape of the data produced by the Dataflow.

Note

This will trigger a data profile calculation. To avoid calculating profile multiple times, get the profile first by calling get_profile() and then inspect it.

Note

If current Dataflow contains take_sample step or 'take' step, this will return number of rows in the subset defined by those steps.

Returns

Tuple of row count and column count.

ExampleData

ExampleData = ~ExampleData

SortColumns

SortColumns = ~SortColumns

SourceColumns

SourceColumns = ~SourceColumns

TypeConversionInfo

TypeConversionInfo = ~TypeConversionInfo