Dataflow class

Definition

A Dataflow represents a series of lazily-evaluated, immutable operations on data. It is only an execution plan. No data is loaded from the source until you get data from the Dataflow using one of head, to_pandas_dataframe, get_profile or the write methods.

Dataflow(engine_api: azureml.dataprep.api.engineapi.api.EngineAPI, steps: typing.List[azureml.dataprep.api.step.Step] = None)
Inheritance
builtins.object
Dataflow

Remarks

Dataflows are usually created by supplying a data source. Once the data source has been provided, operations can be added by invoking the different transformation methods available on this class. The result of adding an operation to a Dataflow is always a new Dataflow.

The actual loading of the data and execution of the transformations is delayed as much as possible and will not occur until a 'pull' takes place. A pull is the action of reading data from a Dataflow, whether by asking to look at the first N records in it or by transferring the data in the Dataflow to another storage mechanism (a Pandas Dataframe, a CSV file, or a Spark Dataframe).

The operations available on the Dataflow are runtime-agnostic. This allows the transformation pipelines contained in them to be portable between a regular Python environment and Spark.

Methods

add_column(expression: azureml.dataprep.api.expressions.Expression, new_column_name: str, prior_column: str) -> azureml.dataprep.api.dataflow.Dataflow

Adds a new column to the dataset. The values in the new column will be the result of invoking the specified expression.

add_step(step_type: str, arguments: typing.Dict[str, typing.Any], local_data: typing.Dict[str, typing.Any] = None) -> azureml.dataprep.api.dataflow.Dataflow
append_columns(dataflows: typing.List[_ForwardRef('DataflowReference')]) -> azureml.dataprep.api.dataflow.Dataflow

Appends the columns from the referenced dataflows to the current one. Duplicate columns will result in failure.

append_rows(dataflows: typing.List[_ForwardRef('DataflowReference')]) -> azureml.dataprep.api.dataflow.Dataflow

Appends the records in the specified dataflows to the current one. If the schemas of the dataflows are distinct, this will result in records with different schemas.

assert_value(columns: MultiColumnSelection, expression: azureml.dataprep.api.expressions.Expression, policy: azureml.dataprep.api.engineapi.typedefinitions.AssertPolicy = <AssertPolicy.ERRORVALUE: 1>, error_code: str = 'AssertionFailed') -> azureml.dataprep.api.dataflow.Dataflow

Ensures that values in the specified columns satisfy the provided expression. This is useful to identify anomalies in the dataset and avoid broken pipelines by handling assertion errors in a clean way.

cache(directory_path: str) -> azureml.dataprep.api.dataflow.Dataflow

Pulls all the records from this Dataflow and cache the result to disk.

clip(columns: MultiColumnSelection, lower: typing.Union[float, NoneType] = None, upper: typing.Union[float, NoneType] = None, use_values: bool = True) -> azureml.dataprep.api.dataflow.Dataflow

Clips values so that all values are between the lower and upper boundaries.

convert_unix_timestamp_to_datetime(columns: MultiColumnSelection, use_seconds: bool = False) -> azureml.dataprep.api.dataflow.Dataflow

Converts the specified column to DateTime values by treating the existing value as a Unix timestamp.

derive_column_by_example(source_columns: SourceColumns, new_column_name: str, example_data: ExampleData) -> azureml.dataprep.api.dataflow.Dataflow

Inserts a column by learning a program based on a set of source columns and provided examples. Dataprep will attempt to achieve the intended derivation inferred from provided examples.

distinct(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Filters out records that contain duplicate values in the specified columns, leaving only a single instance.

distinct_rows() -> azureml.dataprep.api.dataflow.Dataflow

Filters out records that contain duplicate values in all columns, leaving only a single instance. :return: The modified Dataflow.

drop_columns(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Drops the specified columns.

drop_errors(columns: MultiColumnSelection, column_relationship: azureml.dataprep.api.engineapi.typedefinitions.ColumnRelationship = <ColumnRelationship.ALL: 0>) -> azureml.dataprep.api.dataflow.Dataflow

Drops rows where all or any of the selected columns are an Error.

drop_nulls(columns: MultiColumnSelection, column_relationship: azureml.dataprep.api.engineapi.typedefinitions.ColumnRelationship = <ColumnRelationship.ALL: 0>) -> azureml.dataprep.api.dataflow.Dataflow

Drops rows where all or any of the selected columns are null.

duplicate_column(column_pairs: typing.Dict[str, str]) -> azureml.dataprep.api.dataflow.Dataflow

Creates new columns that are duplicates of the specified source columns.

error(columns: MultiColumnSelection, find: typing.Any, error_code: str) -> azureml.dataprep.api.dataflow.Dataflow

Creates errors in a column for values that match the specified search value.

extract_error_details(column: str, error_value_column: str, extract_error_code: bool = False, error_code_column: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Extracts the error details from error values into a new column.

fill_errors(columns: MultiColumnSelection, fill_with: typing.Any) -> azureml.dataprep.api.dataflow.Dataflow

Fills all errors in a column with the specified value.

fill_nulls(columns: MultiColumnSelection, fill_with: typing.Any) -> azureml.dataprep.api.dataflow.Dataflow

Fills all nulls in a column with the specified value.

filter(expression: azureml.dataprep.api.expressions.Expression) -> azureml.dataprep.api.dataflow.Dataflow

Filters the data, leaving only the records that match the specified expression.

from_json(dataflow_json: str) -> azureml.dataprep.api.dataflow.Dataflow

Load Dataflow from 'package_json'.

fuzzy_group_column(source_column: str, new_column_name: str, similarity_threshold: float = 0.8, similarity_score_column_name: str = None) -> azureml.dataprep.api.dataflow.Dataflow

Add a column where similar values from the source column are fuzzy-grouped to their canonical form.

get_files(path: FilePath) -> azureml.dataprep.api.dataflow.Dataflow

Expands the path specified by reading globs and files in folders and outputs one record per file found.

get_missing_secrets() -> typing.List[str]

Get a list of missing secret IDs.

get_profile(include_stype_counts: bool = False, number_of_histogram_bins: int = 10) -> azureml.dataprep.api.dataprofile.DataProfile

Requests the data profile which collects summary statistics on the full data produced by the Dataflow. A data profile can be very useful to understand the input data, identify anomalies and missing values, and verify that data preparation operations produced the desired result.

:paramtype boolean :param number_of_histogram_bins: Number of bins in a histogram. If not specified will be set to 10. :paramtype int :return: DataProfile object :rtype: DataProfile

head

Pulls the number of records specified from the top of this Dataflow and returns them as a Link pandas.DataFrame.

join(left_dataflow: DataflowReference, right_dataflow: DataflowReference, join_key_pairs: typing.List[typing.Tuple[str, str]] = None, join_type: azureml.dataprep.api.dataflow.JoinType = <JoinType.MATCH: 2>, left_column_prefix: str = 'l_', right_column_prefix: str = 'r_', left_non_prefixed_columns: typing.List[str] = None, right_non_prefixed_columns: typing.List[str] = None) -> azureml.dataprep.api.dataflow.Dataflow

Creates a new Dataflow that is a result of joining two provided Dataflows.

keep_columns(columns: MultiColumnSelection, validate_column_exists: bool = False) -> azureml.dataprep.api.dataflow.Dataflow

Keeps the specified columns and drops all others.

label_encode(source_column: str, new_column_name: str) -> azureml.dataprep.api.dataflow.Dataflow

Adds a column with encoded labels generated from the source column. For explicit label encoding, use LabelEncoderBuilder.

map_column(column: str, new_column_id: str, replacements: typing.Union[typing.List[azureml.dataprep.api.dataflow.ReplacementsValue], NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Creates a new column where matching values in the source column have been replaced with the specified values.

min_max_scale(column: str, range_min: float = 0, range_max: float = 1, data_min: float = None, data_max: float = None) -> azureml.dataprep.api.dataflow.Dataflow

Scales the values in the specified column to lie within range_min (default=0) and range_max (default=1).

new_script_column(new_column_name: str, insert_after: str, script: str) -> azureml.dataprep.api.dataflow.Dataflow

Adds a new column to the Dataflow using the passed in Python script.

new_script_filter(script: str) -> azureml.dataprep.api.dataflow.Dataflow

Filters the Dataflow using the passed in Python script.

null_coalesce(columns: typing.List[str], new_column_id: str) -> azureml.dataprep.api.dataflow.Dataflow

For each record, selects the first non-null value from the columns specified and uses it as the value of a new column.

one_hot_encode(source_column: str, prefix: str = None) -> azureml.dataprep.api.dataflow.Dataflow

Adds a binary column for each categorical label from the source column values. For more control over categorical labels, use OneHotEncodingBuilder.

open(file_path: str) -> azureml.dataprep.api.dataflow.Dataflow

Opens a Dataflow with specified name from the package file.

parse_delimited(separator: str, headers_mode: azureml.dataprep.api.dataflow.PromoteHeadersMode, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding, quoting: bool, skip_rows: int, skip_mode: azureml.dataprep.api.dataflow.SkipMode, comment: str) -> azureml.dataprep.api.dataflow.Dataflow

Adds step to parse CSV data.

parse_fwf(offsets: typing.List[int], headers_mode: azureml.dataprep.api.dataflow.PromoteHeadersMode, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding, skip_rows: int, skip_mode: azureml.dataprep.api.dataflow.SkipMode) -> azureml.dataprep.api.dataflow.Dataflow

Adds step to parse fixed-width data.

parse_json_column(column: str) -> azureml.dataprep.api.dataflow.Dataflow

Parses the values in the specified column as JSON objects and expands them into multiple columns.

parse_lines(headers_mode: azureml.dataprep.api.dataflow.PromoteHeadersMode, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding, skip_rows: int, skip_mode: azureml.dataprep.api.dataflow.SkipMode, comment: str) -> azureml.dataprep.api.dataflow.Dataflow

Adds step to parse text files and split them into lines.

pivot(columns_to_pivot: typing.List[str], value_column: str, summary_function: azureml.dataprep.api.dataflow.SummaryFunction = None, group_by_columns: typing.List[str] = None, null_value_replacement: str = None, error_value_replacement: str = None) -> azureml.dataprep.api.dataflow.Dataflow

Returns a new Dataflow with columns generated from the values in the selected columns to pivot.

promote_headers() -> azureml.dataprep.api.dataflow.Dataflow

Sets the first record in the dataset as headers, replacing any existing ones. :return: The modified Dataflow.

quantile_transform(source_column: str, new_column: str, quantiles_count: int = 1000, output_distribution: str = 'Uniform')

Perform quantile transformation to the source_column and output the transformed result in new_column.

random_split(percentage: float, seed: typing.Union[int, NoneType] = None) -> ('Dataflow', 'Dataflow')

Returns two Dataflows from the original Dataflow, with records randomly and approximately split by the percentage specified (using a seed). If the percentage specified is p (where p is between 0 and 1), the first returned dataflow will contain approximately p*100% records from the original dataflow, and the second dataflow will contain all the remaining records that were not included in the first. A random seed will be used if none is provided.

read_excel(sheet_name: typing.Union[str, NoneType] = None, use_column_headers: bool = False, skip_rows: int = 0) -> azureml.dataprep.api.dataflow.Dataflow

Adds step to read and parse Excel files.

read_json(json_extract_program: str, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding)

Creates a new Dataflow with the operations required to read JSON files.

read_parquet_dataset(path: azureml.dataprep.api.datasources.FileDataSource) -> azureml.dataprep.api.dataflow.Dataflow

Creates a step to read parquet file.

read_parquet_file() -> azureml.dataprep.api.dataflow.Dataflow

Adds step to parse Parquet files.

read_postgresql(data_source: azureml.dataprep.api.datasources.PostgreSQLDataSource, query: str) -> azureml.dataprep.api.dataflow.Dataflow

Adds step that can read data from a PostgreSQL database by executing the query specified.

read_sql(data_source: azureml.dataprep.api.datasources.MSSQLDataSource, query: str) -> azureml.dataprep.api.dataflow.Dataflow

Adds step that can read data from an MS SQL database by executing the query specified.

reference(reference: DataflowReference) -> azureml.dataprep.api.dataflow.Dataflow

Creates a reference to an existing activity object.

rename_columns(column_pairs: typing.Dict[str, str]) -> azureml.dataprep.api.dataflow.Dataflow

Renames the specified columns.

replace(columns: MultiColumnSelection, find: typing.Any, replace_with: typing.Any) -> azureml.dataprep.api.dataflow.Dataflow

Replaces values in a column that match the specified search value.

replace_datasource(new_datasource: DataSource) -> azureml.dataprep.api.dataflow.Dataflow

Returns new Dataflow with its DataSource replaced by the given one.

replace_na(columns: MultiColumnSelection, use_default_na_list: bool = True, use_empty_string_as_na: bool = True, use_nan_as_na: bool = True, custom_na_list: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Replaces values in the specified columns with nulls. You can choose to use the default list, supply your own, or both.

replace_reference(new_reference: DataflowReference) -> azureml.dataprep.api.dataflow.Dataflow

Returns new Dataflow with its reference DataSource replaced by the given one.

round(column: str, decimal_places: int) -> azureml.dataprep.api.dataflow.Dataflow

Rounds the values in the column specified to the desired number of decimal places.

run_local() -> None

Runs the current Dataflow using the local execution runtime.

run_spark() -> None

Runs the current Dataflow using the Spark runtime.

save(file_path: str)

Saves the Dataflow to the specified file

set_column_types(type_conversions: typing.Dict[str, TypeConversionInfo]) -> azureml.dataprep.api.dataflow.Dataflow

Converts values in specified columns to the corresponding data types.

skip(count: int) -> azureml.dataprep.api.dataflow.Dataflow

Skips the specified number of records.

sort(sort_order: typing.List[typing.Tuple[str, bool]]) -> azureml.dataprep.api.dataflow.Dataflow

Sorts the dataset by the specified columns.

sort_asc(columns: SortColumns) -> azureml.dataprep.api.dataflow.Dataflow

Sorts the dataset in ascending order by the specified columns.

sort_desc(columns: SortColumns) -> azureml.dataprep.api.dataflow.Dataflow

Sorts the dataset in descending order by the specified columns.

split_column_by_delimiters(source_column: str, delimiters: Delimiters, keep_delimiters: False) -> azureml.dataprep.api.dataflow.Dataflow

Splits the provided column and adds the resulting columns to the dataflow.

split_column_by_example(source_column: str, example: SplitExample = None) -> azureml.dataprep.api.dataflow.Dataflow

Splits the provided column and adds the resulting columns to the dataflow based on the provided example.

split_stype(column: str, stype: azureml.dataprep.api.dataflow.SType, stype_fields: typing.Union[typing.List[str], NoneType] = None, new_column_names: typing.Union[typing.List[str], NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Creates new columns from an existing column, interpreting its values as a semantic type.

str_replace(columns: MultiColumnSelection, value_to_find: typing.Union[str, NoneType] = None, replace_with: typing.Union[str, NoneType] = None, match_entire_cell_contents: bool = False, use_special_characters: bool = False) -> azureml.dataprep.api.dataflow.Dataflow

Replaces values in a string column that match a search string with the specified value.

summarize(summary_columns: typing.Union[typing.List[azureml.dataprep.api.dataflow.SummaryColumnsValue], NoneType] = None, group_by_columns: typing.Union[typing.List[str], NoneType] = None, join_back: bool = False, join_back_columns_prefix: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Summarizes data by running aggregate functions over specific columns.

take(count: int) -> azureml.dataprep.api.dataflow.Dataflow

Takes the specified count of records.

take_sample(probability: float, seed: typing.Union[int, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Takes a random sample of the available records.

take_stratified_sample(columns: MultiColumnSelection, fractions: typing.Dict[typing.Tuple, int], seed: typing.Union[int, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Takes a random stratified sample of the available records according to input fractions.

to_bool(columns: MultiColumnSelection, true_values: typing.List[str], false_values: typing.List[str], mismatch_as: azureml.dataprep.api.dataflow.MismatchAsOption = <MismatchAsOption.ASERROR: 2>) -> azureml.dataprep.api.dataflow.Dataflow

Converts the values in the specified columns to booleans.

to_datetime(columns: MultiColumnSelection, date_time_formats: typing.Union[typing.List[str], NoneType] = None, date_constant: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Converts the values in the specified columns to DateTimes.

to_json() -> str

Get the JSON string representation of the Dataflow.

to_long(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Converts the values in the specified columns to 64 bit integers.

to_number(columns: MultiColumnSelection, decimal_point: azureml.dataprep.api.dataflow.DecimalMark = <DecimalMark.DOT: 0>) -> azureml.dataprep.api.dataflow.Dataflow

Converts the values in the specified columns to floating point numbers.

to_pandas_dataframe

Pulls all of the data and returns it as a Pandas Link pandas.DataFrame.

to_spark_dataframe

Creates a Spark Link pyspark.sql.DataFrame that can execute the transformation pipeline defined by this Dataflow.

to_string(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Converts the values in the specified columns to strings.

transform_partition(script: str) -> azureml.dataprep.api.dataflow.Dataflow

Transforms an entire partition using the passed in Python script.

transform_partition_with_file(script_path: str) -> azureml.dataprep.api.dataflow.Dataflow

Transforms an entire partition using the Python script in the passed in file.

trim_string(columns: MultiColumnSelection, trim_left: bool = True, trim_right: bool = True, trim_type: azureml.dataprep.api.dataflow.TrimType = <TrimType.WHITESPACE: 0>, custom_characters: str = '') -> azureml.dataprep.api.dataflow.Dataflow

Trims string values in specific columns.

use_secrets(secrets: typing.Dict[str, str])

Uses the passed in secrets for execution.

verify_has_data()

Verifies that this Dataflow would produce records if executed. An exception will be thrown otherwise.

write_streams(streams_column: str, base_path: azureml.dataprep.api.datasources.FileOutput, file_names_column: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Writes the streams in the specified column to the destination path. By default, the name of the files written will be the resource identifier of the streams. This behavior can be overriden by specifying a column which contains the names to use.

write_to_csv(directory_path: DataDestination, separator: str = ',', na: str = 'NA', error: str = 'ERROR') -> azureml.dataprep.api.dataflow.Dataflow

Write out the data in the Dataflow in a delimited text format. The output is specified as a directory which will contain multiple files, one per partition processed in the Dataflow.

write_to_parquet(file_path: typing.Union[~DataDestination, NoneType] = None, directory_path: typing.Union[~DataDestination, NoneType] = None, single_file: bool = False, error: str = 'ERROR', row_groups: int = 0) -> azureml.dataprep.api.dataflow.Dataflow

Writes out the data in the Dataflow into Parquet files.

add_column(expression: azureml.dataprep.api.expressions.Expression, new_column_name: str, prior_column: str) -> azureml.dataprep.api.dataflow.Dataflow

Adds a new column to the dataset. The values in the new column will be the result of invoking the specified expression.

add_column(expression: azureml.dataprep.api.expressions.Expression, new_column_name: str, prior_column: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

expression

The expression to evaluate to generate the values in the column.

new_column_name

The name of the new column.

prior_column

The name of the column after which the new column should be added. The default is to add the new column as the last column.

Returns

The modified Dataflow.

Remarks

Expressions are built using the expression builders in the expressions module and the functions in the functions module. The resulting expression will be lazily evaluated for each record when a data pull occurs and not where it is defined.

add_step(step_type: str, arguments: typing.Dict[str, typing.Any], local_data: typing.Dict[str, typing.Any] = None) -> azureml.dataprep.api.dataflow.Dataflow

add_step(step_type: str, arguments: typing.Dict[str, typing.Any], local_data: typing.Dict[str, typing.Any] = None) -> azureml.dataprep.api.dataflow.Dataflow

append_columns(dataflows: typing.List[_ForwardRef('DataflowReference')]) -> azureml.dataprep.api.dataflow.Dataflow

Appends the columns from the referenced dataflows to the current one. Duplicate columns will result in failure.

append_columns(dataflows: typing.List[_ForwardRef('DataflowReference')]) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

dataflows

The dataflows to append.

Returns

The modified Dataflow.

append_rows(dataflows: typing.List[_ForwardRef('DataflowReference')]) -> azureml.dataprep.api.dataflow.Dataflow

Appends the records in the specified dataflows to the current one. If the schemas of the dataflows are distinct, this will result in records with different schemas.

append_rows(dataflows: typing.List[_ForwardRef('DataflowReference')]) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

dataflows

The dataflows to append.

Returns

The modified Dataflow.

assert_value(columns: MultiColumnSelection, expression: azureml.dataprep.api.expressions.Expression, policy: azureml.dataprep.api.engineapi.typedefinitions.AssertPolicy = <AssertPolicy.ERRORVALUE: 1>, error_code: str = 'AssertionFailed') -> azureml.dataprep.api.dataflow.Dataflow

Ensures that values in the specified columns satisfy the provided expression. This is useful to identify anomalies in the dataset and avoid broken pipelines by handling assertion errors in a clean way.

assert_value(columns: MultiColumnSelection, expression: azureml.dataprep.api.expressions.Expression, policy: azureml.dataprep.api.engineapi.typedefinitions.AssertPolicy = <AssertPolicy.ERRORVALUE: 1>, error_code: str = 'AssertionFailed') -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

Columns to apply evaluation to.

expression

Expression that has to be evaluated to be True for the value to be kept.

policy

Action to take when expression is evaluated to False. Options are FAILEXECUTION and ERRORVALUE. FAILEXECUTION ensures that any data that violates the assertion expression during execution will immediately fail the job. This is useful to save computing resources and time. ERRORVALUE captures any data that violates the assertion expression by replacing it with error_code. This allows you to handle these error values by either filtering or replacing them.

error_code

Error message to use to replace values failing the assertion or failing an execution.

Returns

The modified Dataflow.

cache(directory_path: str) -> azureml.dataprep.api.dataflow.Dataflow

Pulls all the records from this Dataflow and cache the result to disk.

cache(directory_path: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

directory_path

The directory to save cache files.

Returns

The modified Dataflow.

Remarks

This is very useful when data is accessed repeatedly, as future executions will reuse the cached result without pulling the same Dataflow again.

clip(columns: MultiColumnSelection, lower: typing.Union[float, NoneType] = None, upper: typing.Union[float, NoneType] = None, use_values: bool = True) -> azureml.dataprep.api.dataflow.Dataflow

Clips values so that all values are between the lower and upper boundaries.

clip(columns: MultiColumnSelection, lower: typing.Union[float, NoneType] = None, upper: typing.Union[float, NoneType] = None, use_values: bool = True) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

lower

All values lower than this value will be set to this value.

upper

All values higher than this value will be set to this value.

use_values

If true, values outside boundaries will be set to the boundary values. If false, those values will be set to null.

Returns

The modified Dataflow.

convert_unix_timestamp_to_datetime(columns: MultiColumnSelection, use_seconds: bool = False) -> azureml.dataprep.api.dataflow.Dataflow

Converts the specified column to DateTime values by treating the existing value as a Unix timestamp.

convert_unix_timestamp_to_datetime(columns: MultiColumnSelection, use_seconds: bool = False) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

use_seconds

Whether to use seconds as the resolution. Milliseconds are used if false.

Returns

The modified Dataflow.

derive_column_by_example(source_columns: SourceColumns, new_column_name: str, example_data: ExampleData) -> azureml.dataprep.api.dataflow.Dataflow

Inserts a column by learning a program based on a set of source columns and provided examples. Dataprep will attempt to achieve the intended derivation inferred from provided examples.

derive_column_by_example(source_columns: SourceColumns, new_column_name: str, example_data: ExampleData) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

source_columns

Names of the columns from which the new column will be derived.

new_column_name

Name of the new column to add.

example_data

Examples to use as input for program generation. In case there is only one column to be used as source, examples could be Tuples of source value and intended target value. For example, you can have "example_data=[("2013-08-22", "Thursday"), ("2013-11-03", "Sunday")]". When multiple columns should be considered as source, each example should be a Tuple of dict-like sources and intended target value, where sources have column names as keys and column values as values.

Returns

The modified Dataflow.

Remarks

If you need more control of examples and generated program, create DeriveColumnByExampleBuilder instead.

distinct(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Filters out records that contain duplicate values in the specified columns, leaving only a single instance.

distinct(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

Returns

The modified Dataflow.

distinct_rows() -> azureml.dataprep.api.dataflow.Dataflow

Filters out records that contain duplicate values in all columns, leaving only a single instance. :return: The modified Dataflow.

distinct_rows() -> azureml.dataprep.api.dataflow.Dataflow

drop_columns(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Drops the specified columns.

drop_columns(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

Returns

The modified Dataflow.

drop_errors(columns: MultiColumnSelection, column_relationship: azureml.dataprep.api.engineapi.typedefinitions.ColumnRelationship = <ColumnRelationship.ALL: 0>) -> azureml.dataprep.api.dataflow.Dataflow

Drops rows where all or any of the selected columns are an Error.

drop_errors(columns: MultiColumnSelection, column_relationship: azureml.dataprep.api.engineapi.typedefinitions.ColumnRelationship = <ColumnRelationship.ALL: 0>) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

column_relationship

Whether all or any of the selected columns must be an Error.

Returns

The modified Dataflow.

drop_nulls(columns: MultiColumnSelection, column_relationship: azureml.dataprep.api.engineapi.typedefinitions.ColumnRelationship = <ColumnRelationship.ALL: 0>) -> azureml.dataprep.api.dataflow.Dataflow

Drops rows where all or any of the selected columns are null.

drop_nulls(columns: MultiColumnSelection, column_relationship: azureml.dataprep.api.engineapi.typedefinitions.ColumnRelationship = <ColumnRelationship.ALL: 0>) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

column_relationship

Whether all or any of the selected columns must be null.

Returns

The modified Dataflow.

duplicate_column(column_pairs: typing.Dict[str, str]) -> azureml.dataprep.api.dataflow.Dataflow

Creates new columns that are duplicates of the specified source columns.

duplicate_column(column_pairs: typing.Dict[str, str]) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column_pairs

Mapping of the columns to duplicate to their new names.

Returns

The modified Dataflow.

error(columns: MultiColumnSelection, find: typing.Any, error_code: str) -> azureml.dataprep.api.dataflow.Dataflow

Creates errors in a column for values that match the specified search value.

error(columns: MultiColumnSelection, find: typing.Any, error_code: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

find

The value to find, or None.

error_code

The error code to use in new errors, or None.

Returns

The modified Dataflow.

Remarks

The following types are supported for the find argument: str, int, float, datetime.datetime, and bool.

extract_error_details(column: str, error_value_column: str, extract_error_code: bool = False, error_code_column: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Extracts the error details from error values into a new column.

extract_error_details(column: str, error_value_column: str, extract_error_code: bool = False, error_code_column: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column

The source column.

error_value_column

Name of a column to hold the original value of the error.

extract_error_code

Whether to also extract the error code.

error_code_column

Name of a column to hold the error code.

Returns

The modified Dataflow.

fill_errors(columns: MultiColumnSelection, fill_with: typing.Any) -> azureml.dataprep.api.dataflow.Dataflow

Fills all errors in a column with the specified value.

fill_errors(columns: MultiColumnSelection, fill_with: typing.Any) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

fill_with

The value to fill errors with, or None.

Returns

The modified Dataflow.

Remarks

The following types are supported for the fill_with argument: str, int, float, datetime.datetime, and bool.

fill_nulls(columns: MultiColumnSelection, fill_with: typing.Any) -> azureml.dataprep.api.dataflow.Dataflow

Fills all nulls in a column with the specified value.

fill_nulls(columns: MultiColumnSelection, fill_with: typing.Any) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

fill_with

The value to fill nulls with.

Returns

The modified Dataflow.

Remarks

The following types are supported for the fill_with argument: str, int, float, datetime.datetime, and bool.

filter(expression: azureml.dataprep.api.expressions.Expression) -> azureml.dataprep.api.dataflow.Dataflow

Filters the data, leaving only the records that match the specified expression.

filter(expression: azureml.dataprep.api.expressions.Expression) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

expression

The expression to evaluate.

Returns

The modified Dataflow.

Remarks

Expressions are started by indexing the Dataflow with the name of a column. They support a variety of functions and operators and can be combined using logical operators. The resulting expression will be lazily evaluated for each record when a data pull occurs and not where it is defined.


   dataflow['myColumn'] > dataflow['columnToCompareAgainst']
   dataflow['myColumn'].starts_with('prefix')

from_json(dataflow_json: str) -> azureml.dataprep.api.dataflow.Dataflow

Load Dataflow from 'package_json'.

from_json(dataflow_json: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

dataflow_json

JSON string representation of the Package.

Returns

New Package object constructed from the JSON string.

fuzzy_group_column(source_column: str, new_column_name: str, similarity_threshold: float = 0.8, similarity_score_column_name: str = None) -> azureml.dataprep.api.dataflow.Dataflow

Add a column where similar values from the source column are fuzzy-grouped to their canonical form.

fuzzy_group_column(source_column: str, new_column_name: str, similarity_threshold: float = 0.8, similarity_score_column_name: str = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

source_column

Column with values to group.

new_column_name

Name of the new column to add.

similarity_threshold

Similarity between values to be grouped together.

similarity_score_column_name

If provided, this transform will also add a column with calculated similarity score between every pair of original and canonical values. This information is helpful to determine similarity_threshold.

Returns

The modified Dataflow.

Remarks

This method will cluster similar values automatically. This is useful when working with data gathered from multiple sources or through human input, where the same entities are often represented by multiple values due to different spellings, varying capitalizations, or abbreviations. For example, you may have a column with values Chicago, CHICAGO, SEATTLE, Seattle, St. Louis, St. Lois, ST LOUIS, and you want to redeuce them to the distinct values - Chicago, Seattle and St. Louis. If you need more control of grouping, create FuzzyGroupBuilder instead.

get_files(path: FilePath) -> azureml.dataprep.api.dataflow.Dataflow

Expands the path specified by reading globs and files in folders and outputs one record per file found.

get_files(path: FilePath) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

path

The path or paths to expand.

Returns

A new Dataflow.

get_missing_secrets() -> typing.List[str]

Get a list of missing secret IDs.

get_missing_secrets() -> typing.List[str]

Returns

A list of missing secret IDs.

get_profile(include_stype_counts: bool = False, number_of_histogram_bins: int = 10) -> azureml.dataprep.api.dataprofile.DataProfile

Requests the data profile which collects summary statistics on the full data produced by the Dataflow. A data profile can be very useful to understand the input data, identify anomalies and missing values, and verify that data preparation operations produced the desired result.

:paramtype boolean :param number_of_histogram_bins: Number of bins in a histogram. If not specified will be set to 10. :paramtype int :return: DataProfile object :rtype: DataProfile

get_profile(include_stype_counts: bool = False, number_of_histogram_bins: int = 10) -> azureml.dataprep.api.dataprofile.DataProfile

Parameters

include_stype_counts

Whether to include checking if values look like some well known semantic types of information. For Example, "email address". Turning this on will impact performance.

Pulls the number of records specified from the top of this Dataflow and returns them as a Link pandas.DataFrame.

Parameters

count

The number of records to pull. 10 is default.

Returns

A Pandas Link pandas.DataFrame.

join(left_dataflow: DataflowReference, right_dataflow: DataflowReference, join_key_pairs: typing.List[typing.Tuple[str, str]] = None, join_type: azureml.dataprep.api.dataflow.JoinType = <JoinType.MATCH: 2>, left_column_prefix: str = 'l_', right_column_prefix: str = 'r_', left_non_prefixed_columns: typing.List[str] = None, right_non_prefixed_columns: typing.List[str] = None) -> azureml.dataprep.api.dataflow.Dataflow

Creates a new Dataflow that is a result of joining two provided Dataflows.

join(left_dataflow: DataflowReference, right_dataflow: DataflowReference, join_key_pairs: typing.List[typing.Tuple[str, str]] = None, join_type: azureml.dataprep.api.dataflow.JoinType = <JoinType.MATCH: 2>, left_column_prefix: str = 'l_', right_column_prefix: str = 'r_', left_non_prefixed_columns: typing.List[str] = None, right_non_prefixed_columns: typing.List[str] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

left_dataflow

Left Dataflow or DataflowReference to join with.

right_dataflow

Right Dataflow or DataflowReference to join with.

join_key_pairs

Key column pairs. List of tuples of columns names where each tuple forms a key pair to join on. For instance: [('column_from_left_dataflow', 'column_from_right_dataflow')]

join_type

Type of join to perform. Match is default.

left_column_prefix

Prefix to use in result Dataflow for columns originating from left_dataflow. Needed to avoid column name conflicts at runtime.

right_column_prefix

Prefix to use in result Dataflow for columns originating from right_dataflow. Needed to avoid column name conflicts at runtime.

left_non_prefixed_columns

List of column names from left_dataflow that should not be prefixed with left_column_prefix. Every other column appearing in the data at runtime will be prefixed.

right_non_prefixed_columns

List of column names from right_dataflow that should not be prefixed with left_column_prefix. Every other column appearing in the data at runtime will be prefixed.

Returns

The new Dataflow.

keep_columns(columns: MultiColumnSelection, validate_column_exists: bool = False) -> azureml.dataprep.api.dataflow.Dataflow

Keeps the specified columns and drops all others.

keep_columns(columns: MultiColumnSelection, validate_column_exists: bool = False) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

validate_column_exists

Whether to validate the columns selected.

Returns

The modified Dataflow.

label_encode(source_column: str, new_column_name: str) -> azureml.dataprep.api.dataflow.Dataflow

Adds a column with encoded labels generated from the source column. For explicit label encoding, use LabelEncoderBuilder.

label_encode(source_column: str, new_column_name: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

source_column

Column name from which encoded labels will be generated.

new_column_name

The new column's name.

Returns

The modified Dataflow.

map_column(column: str, new_column_id: str, replacements: typing.Union[typing.List[azureml.dataprep.api.dataflow.ReplacementsValue], NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Creates a new column where matching values in the source column have been replaced with the specified values.

map_column(column: str, new_column_id: str, replacements: typing.Union[typing.List[azureml.dataprep.api.dataflow.ReplacementsValue], NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column

The source column.

new_column_id

The name of the mapped column.

replacements

The values to replace and their replacements.

Returns

The modified Dataflow.

min_max_scale(column: str, range_min: float = 0, range_max: float = 1, data_min: float = None, data_max: float = None) -> azureml.dataprep.api.dataflow.Dataflow

Scales the values in the specified column to lie within range_min (default=0) and range_max (default=1).

min_max_scale(column: str, range_min: float = 0, range_max: float = 1, data_min: float = None, data_max: float = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column

The column to scale.

range_min

Desired min of scaled values.

range_max

Desired max of scaled values.

data_min

Min of source column. If not provided, a scan of the data will be performed to find the min.

data_max

Max of source column. If not provided, a scan of the data will be performed to find the max.

Returns

The modified Dataflow.

new_script_column(new_column_name: str, insert_after: str, script: str) -> azureml.dataprep.api.dataflow.Dataflow

Adds a new column to the Dataflow using the passed in Python script.

new_script_column(new_column_name: str, insert_after: str, script: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

new_column_name

The name of the new column being created.

insert_after

The column after which the new column will be inserted.

script

The script that will be used to create this new column.

Returns

The modified Dataflow.

Remarks

The Python script must define a function called newvalue that takes a single argument, typically called row. The row argument is a dict (key is column name, value is current value) and will be passed to this function for each row in the dataset. This function must return a value to be used in the new column.

Note

Any libraries that the Python script imports must exist in the environment where the dataflow is run.


   import numpy as np
   def newvalue(row):
       return np.log(row['Metric'])

new_script_filter(script: str) -> azureml.dataprep.api.dataflow.Dataflow

Filters the Dataflow using the passed in Python script.

new_script_filter(script: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

script

The script that will be used to filter the dataflow.

Returns

The modified Dataflow.

Remarks

The Python script must define a function called includerow that takes a single argument, typically called row. The row argument is a dict (key is column name, value is current value) and will be passed to this function for each row in the dataset. This function must return True or False depending on whether the row should be included in the dataflow. Any libraries that the Python script imports must exist in the environment where the dataflow is run.


   def includerow(row):
       return row['Metric'] > 100

null_coalesce(columns: typing.List[str], new_column_id: str) -> azureml.dataprep.api.dataflow.Dataflow

For each record, selects the first non-null value from the columns specified and uses it as the value of a new column.

null_coalesce(columns: typing.List[str], new_column_id: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

new_column_id

The name of the new column.

Returns

The modified Dataflow.

one_hot_encode(source_column: str, prefix: str = None) -> azureml.dataprep.api.dataflow.Dataflow

Adds a binary column for each categorical label from the source column values. For more control over categorical labels, use OneHotEncodingBuilder.

one_hot_encode(source_column: str, prefix: str = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

source_column

Column name from which categorical labels will be generated.

prefix

An optional prefix to use for the new columns.

Returns

The modified Dataflow.

open(file_path: str) -> azureml.dataprep.api.dataflow.Dataflow

Opens a Dataflow with specified name from the package file.

open(file_path: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

file_path

Path to the package containing the Dataflow.

Returns

The Dataflow.

parse_delimited(separator: str, headers_mode: azureml.dataprep.api.dataflow.PromoteHeadersMode, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding, quoting: bool, skip_rows: int, skip_mode: azureml.dataprep.api.dataflow.SkipMode, comment: str) -> azureml.dataprep.api.dataflow.Dataflow

Adds step to parse CSV data.

parse_delimited(separator: str, headers_mode: azureml.dataprep.api.dataflow.PromoteHeadersMode, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding, quoting: bool, skip_rows: int, skip_mode: azureml.dataprep.api.dataflow.SkipMode, comment: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

separator

The separator to use to split columns.

headers_mode

How to determine column headers.

encoding

The encoding of the files being read.

quoting

Whether to handle new line characters within quotes. This option will impact performance.

skip_rows

How many rows to skip.

skip_mode

The mode in which rows are skipped.

comment

Character used to indicate a line is a comment instead of data in the files being read.

Returns

A new Dataflow with Parse Delimited Step added.

parse_fwf(offsets: typing.List[int], headers_mode: azureml.dataprep.api.dataflow.PromoteHeadersMode, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding, skip_rows: int, skip_mode: azureml.dataprep.api.dataflow.SkipMode) -> azureml.dataprep.api.dataflow.Dataflow

Adds step to parse fixed-width data.

parse_fwf(offsets: typing.List[int], headers_mode: azureml.dataprep.api.dataflow.PromoteHeadersMode, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding, skip_rows: int, skip_mode: azureml.dataprep.api.dataflow.SkipMode) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

offsets

The offsets at which to split columns. The first column is always assumed to start at offset 0.

headers_mode

How to determine column headers.

encoding

The encoding of the files being read.

skip_rows

How many rows to skip.

skip_mode

The mode in which rows are skipped.

Returns

A new Dataflow with Parse FixedWidthFile Step added.

parse_json_column(column: str) -> azureml.dataprep.api.dataflow.Dataflow

Parses the values in the specified column as JSON objects and expands them into multiple columns.

parse_json_column(column: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column

The source column.

Returns

The modified Dataflow.

parse_lines(headers_mode: azureml.dataprep.api.dataflow.PromoteHeadersMode, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding, skip_rows: int, skip_mode: azureml.dataprep.api.dataflow.SkipMode, comment: str) -> azureml.dataprep.api.dataflow.Dataflow

Adds step to parse text files and split them into lines.

parse_lines(headers_mode: azureml.dataprep.api.dataflow.PromoteHeadersMode, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding, skip_rows: int, skip_mode: azureml.dataprep.api.dataflow.SkipMode, comment: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

headers_mode

How to determine column headers.

encoding

The encoding of the files being read.

skip_rows

How many rows to skip.

skip_mode

The mode in which rows are skipped.

comment

Character used to indicate a line is a comment instead of data in the files being read.

Returns

A new Dataflow with Parse Lines Step added.

pivot(columns_to_pivot: typing.List[str], value_column: str, summary_function: azureml.dataprep.api.dataflow.SummaryFunction = None, group_by_columns: typing.List[str] = None, null_value_replacement: str = None, error_value_replacement: str = None) -> azureml.dataprep.api.dataflow.Dataflow

Returns a new Dataflow with columns generated from the values in the selected columns to pivot.

pivot(columns_to_pivot: typing.List[str], value_column: str, summary_function: azureml.dataprep.api.dataflow.SummaryFunction = None, group_by_columns: typing.List[str] = None, null_value_replacement: str = None, error_value_replacement: str = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns_to_pivot

The columns used to get the values from which the new dataflow's new columns are generated.

value_column

The column used to get the values that will populate the new dataflow's rows.

Remarks

The values of the new dataflow come from the value column selected. Additionally there is an optional summarization that consists of an aggregation and a group by.

promote_headers() -> azureml.dataprep.api.dataflow.Dataflow

Sets the first record in the dataset as headers, replacing any existing ones. :return: The modified Dataflow.

promote_headers() -> azureml.dataprep.api.dataflow.Dataflow

quantile_transform(source_column: str, new_column: str, quantiles_count: int = 1000, output_distribution: str = 'Uniform')

Perform quantile transformation to the source_column and output the transformed result in new_column.

quantile_transform(source_column: str, new_column: str, quantiles_count: int = 1000, output_distribution: str = 'Uniform')

Parameters

source_column

The column which quantile transform will be applied to.

new_column

The column where the transformed data will be placed.

quantiles_count

The number of quantiles used. This will be used to discretize the cdf, defaults to 1000

output_distribution

The distribution of the transformed data, defaults to 'Uniform'

Returns

The modified Dataflow.

random_split(percentage: float, seed: typing.Union[int, NoneType] = None) -> ('Dataflow', 'Dataflow')

Returns two Dataflows from the original Dataflow, with records randomly and approximately split by the percentage specified (using a seed). If the percentage specified is p (where p is between 0 and 1), the first returned dataflow will contain approximately p*100% records from the original dataflow, and the second dataflow will contain all the remaining records that were not included in the first. A random seed will be used if none is provided.

random_split(percentage: float, seed: typing.Union[int, NoneType] = None) -> ('Dataflow', 'Dataflow')

Parameters

percentage

The approximate percentage to split the Dataflow by. This must be a number between 0.0 and 1.0.

seed

The seed to use for the random split.

Returns

Two Dataflows containing records randomly split from the original Dataflow. If the percentage specified is p, the first dataflow contains approximately p*100% of the records from the original dataflow.

read_excel(sheet_name: typing.Union[str, NoneType] = None, use_column_headers: bool = False, skip_rows: int = 0) -> azureml.dataprep.api.dataflow.Dataflow

Adds step to read and parse Excel files.

read_excel(sheet_name: typing.Union[str, NoneType] = None, use_column_headers: bool = False, skip_rows: int = 0) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

sheet_name

The name of the sheet to load. The first sheet is loaded if none is provided.

use_column_headers

Whether to use the first row as column headers.

skip_rows

Number of rows to skip when loading the data.

Returns

A new Dataflow with Read Excel Step added.

read_json(json_extract_program: str, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding)

Creates a new Dataflow with the operations required to read JSON files.

read_json(json_extract_program: str, encoding: azureml.dataprep.api.engineapi.typedefinitions.FileEncoding)

Parameters

json_extract_program

PROSE program that will be used to parse JSON.

encoding

The encoding of the files being read.

Returns

A new Dataflow with Read JSON Step added.

read_parquet_dataset(path: azureml.dataprep.api.datasources.FileDataSource) -> azureml.dataprep.api.dataflow.Dataflow

Creates a step to read parquet file.

read_parquet_dataset(path: azureml.dataprep.api.datasources.FileDataSource) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

path

The path to the Parquet file.

Returns

A new Dataflow.

read_parquet_file() -> azureml.dataprep.api.dataflow.Dataflow

Adds step to parse Parquet files.

read_parquet_file() -> azureml.dataprep.api.dataflow.Dataflow

Returns

A new Dataflow with Parse Parquet File Step added.

read_postgresql(data_source: azureml.dataprep.api.datasources.PostgreSQLDataSource, query: str) -> azureml.dataprep.api.dataflow.Dataflow

Adds step that can read data from a PostgreSQL database by executing the query specified.

read_postgresql(data_source: azureml.dataprep.api.datasources.PostgreSQLDataSource, query: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

data_source

The details of the PostgreSQL database.

query

The query to execute to read data.

Returns

A new Dataflow with Read SQL Step added.

read_sql(data_source: azureml.dataprep.api.datasources.MSSQLDataSource, query: str) -> azureml.dataprep.api.dataflow.Dataflow

Adds step that can read data from an MS SQL database by executing the query specified.

read_sql(data_source: azureml.dataprep.api.datasources.MSSQLDataSource, query: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

data_source

The details of the MS SQL database.

query

The query to execute to read data.

Returns

A new Dataflow with Read SQL Step added.

reference(reference: DataflowReference) -> azureml.dataprep.api.dataflow.Dataflow

Creates a reference to an existing activity object.

reference(reference: DataflowReference) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

reference

The reference activity.

Returns

A new Dataflow.

rename_columns(column_pairs: typing.Dict[str, str]) -> azureml.dataprep.api.dataflow.Dataflow

Renames the specified columns.

rename_columns(column_pairs: typing.Dict[str, str]) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column_pairs

The columns to rename and the desired new names.

Returns

The modified Dataflow.

replace(columns: MultiColumnSelection, find: typing.Any, replace_with: typing.Any) -> azureml.dataprep.api.dataflow.Dataflow

Replaces values in a column that match the specified search value.

replace(columns: MultiColumnSelection, find: typing.Any, replace_with: typing.Any) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

find

The value to find, or None.

replace_with

The replacement value, or None.

Returns

The modified Dataflow.

Remarks

The following types are supported for both the find or replace arguments: str, int, float, datetime.datetime, and bool.

replace_datasource(new_datasource: DataSource) -> azureml.dataprep.api.dataflow.Dataflow

Returns new Dataflow with its DataSource replaced by the given one.

replace_datasource(new_datasource: DataSource) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

new_datasource

DataSource to substitute into new Dataflow.

Returns

The modified Dataflow.

Remarks

The given 'new_datasource' must match the type of datasource already in the Dataflow. For example a MSSQLDataSource cannot be replaced with a FileDataSource.

replace_na(columns: MultiColumnSelection, use_default_na_list: bool = True, use_empty_string_as_na: bool = True, use_nan_as_na: bool = True, custom_na_list: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Replaces values in the specified columns with nulls. You can choose to use the default list, supply your own, or both.

replace_na(columns: MultiColumnSelection, use_default_na_list: bool = True, use_empty_string_as_na: bool = True, use_nan_as_na: bool = True, custom_na_list: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

use_default_na_list

Use the default list and replace 'null', 'NaN', 'NA', and 'N/A' with null.

use_empty_string_as_na

Replace empty strings with null.

use_nan_as_na

Replace NaNs with Null.

custom_na_list

Provide a comma separated list of values to replace with null.

columns

The source columns.

Returns

The modified Dataflow.

replace_reference(new_reference: DataflowReference) -> azureml.dataprep.api.dataflow.Dataflow

Returns new Dataflow with its reference DataSource replaced by the given one.

replace_reference(new_reference: DataflowReference) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

new_reference

Reference to be substituted for current Reference in Dataflow.

Returns

The modified Dataflow.

round(column: str, decimal_places: int) -> azureml.dataprep.api.dataflow.Dataflow

Rounds the values in the column specified to the desired number of decimal places.

round(column: str, decimal_places: int) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column

The source column.

decimal_places

The number of decimal places.

Returns

The modified Dataflow.

run_local() -> None

Runs the current Dataflow using the local execution runtime.

run_local() -> None

run_spark() -> None

Runs the current Dataflow using the Spark runtime.

run_spark() -> None

save(file_path: str)

Saves the Dataflow to the specified file

save(file_path: str)

Parameters

file_path

The path of the file.

set_column_types(type_conversions: typing.Dict[str, TypeConversionInfo]) -> azureml.dataprep.api.dataflow.Dataflow

Converts values in specified columns to the corresponding data types.

set_column_types(type_conversions: typing.Dict[str, TypeConversionInfo]) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

type_conversions

A dictionary where key is column name and value is either FieldType or TypeConverter or a Tuple of FieldType.DATE and List of format strings

Returns

The modified Dataflow.

Remarks

The values in the type_conversions dictionary could be of several types:


   import azureml.dataprep as dprep

   dataflow = dprep.read_csv(path='./some/path')
   dataflow = dataflow.set_column_types({'MyNumericColumn': dprep.FieldType.DECIMAL,
                                      'MyBoolColumn': dprep.FieldType.BOOLEAN,
                                      'MyAutodetectDateColumn': dprep.FieldType.DATE,
                                      'MyDateColumnWithFormat': (dprep.FieldType.DATE, ['%m-%d-%Y']),
                                      'MyOtherDateColumn': dprep.DateTimeConverter(['%d-%m-%Y'])})

Note

If you choose to convert a column to FieldType.DATE and do not provide format(s) to use, DataPrep will attempt to infer a format to use by pulling on the data. If a format could be inferred from the data, it will be used to convert values in the corresponding

column. However, if a format could not be inferred, this step will fail and you will need to either provide the format to use or use the interactive builder ColumnTypesBuilder by calling 'dataflow.builders.set_column_types()'

skip(count: int) -> azureml.dataprep.api.dataflow.Dataflow

Skips the specified number of records.

skip(count: int) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

count

The number of records to skip.

Returns

The modified Dataflow.

sort(sort_order: typing.List[typing.Tuple[str, bool]]) -> azureml.dataprep.api.dataflow.Dataflow

Sorts the dataset by the specified columns.

sort(sort_order: typing.List[typing.Tuple[str, bool]]) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

sort_order

The columns to sort by and whether to sort ascending or descending. True is treated as descending, false as ascending.

Returns

The modified Dataflow.

sort_asc(columns: SortColumns) -> azureml.dataprep.api.dataflow.Dataflow

Sorts the dataset in ascending order by the specified columns.

sort_asc(columns: SortColumns) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The columns to sort in ascending order.

Returns

The modified Dataflow.

sort_desc(columns: SortColumns) -> azureml.dataprep.api.dataflow.Dataflow

Sorts the dataset in descending order by the specified columns.

sort_desc(columns: SortColumns) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The columns to sort in descending order.

Returns

The modified Dataflow.

split_column_by_delimiters(source_column: str, delimiters: Delimiters, keep_delimiters: False) -> azureml.dataprep.api.dataflow.Dataflow

Splits the provided column and adds the resulting columns to the dataflow.

split_column_by_delimiters(source_column: str, delimiters: Delimiters, keep_delimiters: False) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

source_column

Column to split.

delimiters

String or list of strings to be deemed as column delimiters.

keep_delimiters

Controls whether to keep or drop column with delimiters.

Returns

The modified Dataflow.

Remarks

This will pull small sample of the data, determine number of columns it should expect as a result of the split and generate a split program that would ensure that the expected number of columns will be produced, so that there is a deterministic schema after this operation.

split_column_by_example(source_column: str, example: SplitExample = None) -> azureml.dataprep.api.dataflow.Dataflow

Splits the provided column and adds the resulting columns to the dataflow based on the provided example.

split_column_by_example(source_column: str, example: SplitExample = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

source_column

Column to split.

example

Example to use for program generation.

Returns

The modified Dataflow.

Remarks

This will pull small sample of the data, determine the best program to satisfy provided example and generate a split program that would ensure that the expected number of columns will be produced, so that there is a deterministic schema after this operation.

Note

If example is not provided, this will generate a split program based on common split patterns, like splitting by space, punctuation, date parts and etc.

split_stype(column: str, stype: azureml.dataprep.api.dataflow.SType, stype_fields: typing.Union[typing.List[str], NoneType] = None, new_column_names: typing.Union[typing.List[str], NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Creates new columns from an existing column, interpreting its values as a semantic type.

split_stype(column: str, stype: azureml.dataprep.api.dataflow.SType, stype_fields: typing.Union[typing.List[str], NoneType] = None, new_column_names: typing.Union[typing.List[str], NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

column

The source column.

stype

The semantic type used to interpret values in the column.

stype_fields

Fields of the semantic type to use. If not provided, all fields will be used.

new_column_names

Names of the new columns. If not provided new columns will be named with the source column name plus the semantic type field name.

Returns

The modified Dataflow.

str_replace(columns: MultiColumnSelection, value_to_find: typing.Union[str, NoneType] = None, replace_with: typing.Union[str, NoneType] = None, match_entire_cell_contents: bool = False, use_special_characters: bool = False) -> azureml.dataprep.api.dataflow.Dataflow

Replaces values in a string column that match a search string with the specified value.

str_replace(columns: MultiColumnSelection, value_to_find: typing.Union[str, NoneType] = None, replace_with: typing.Union[str, NoneType] = None, match_entire_cell_contents: bool = False, use_special_characters: bool = False) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

value_to_find

The value to find.

replace_with

The replacement value.

match_entire_cell_contents

Whether the value to find must match the entire value.

use_special_characters

If checked, you can use '#(tab)', '#(cr)', or '#(lf)' to represent special characters in the find or replace arguments.

Returns

The modified Dataflow.

summarize(summary_columns: typing.Union[typing.List[azureml.dataprep.api.dataflow.SummaryColumnsValue], NoneType] = None, group_by_columns: typing.Union[typing.List[str], NoneType] = None, join_back: bool = False, join_back_columns_prefix: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Summarizes data by running aggregate functions over specific columns.

summarize(summary_columns: typing.Union[typing.List[azureml.dataprep.api.dataflow.SummaryColumnsValue], NoneType] = None, group_by_columns: typing.Union[typing.List[str], NoneType] = None, join_back: bool = False, join_back_columns_prefix: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

summary_columns

List of SummaryColumnsValue where each value defines column to summarize, summary function to use and name of resulting column to add.

group_by_columns

Columns to group by.

join_back

Whether to append subtotals or replace current data with them.

join_back_columns_prefix

Prefix to use for subtotal columns when appending them to current data.

Returns

The modified Dataflow.

Remarks

The aggregate functions are independent and it is possible to aggregate the same column multiple times. Unique names have to be provided for the resulting columns. The aggregations can be grouped, in which case one record is returned per group; or ungrouped, in which case one record is returned for the whole dataset. Additionally, the results of the aggregations can either replace the current dataset or augment it by appending the result columns.

take(count: int) -> azureml.dataprep.api.dataflow.Dataflow

Takes the specified count of records.

take(count: int) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

count

The number of records to take.

Returns

The modified Dataflow.

take_sample(probability: float, seed: typing.Union[int, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Takes a random sample of the available records.

take_sample(probability: float, seed: typing.Union[int, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

probability

The probability of a record being included in the sample.

seed

The seed to use for the random generator.

Returns

The modified Dataflow.

take_stratified_sample(columns: MultiColumnSelection, fractions: typing.Dict[typing.Tuple, int], seed: typing.Union[int, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Takes a random stratified sample of the available records according to input fractions.

take_stratified_sample(columns: MultiColumnSelection, fractions: typing.Dict[typing.Tuple, int], seed: typing.Union[int, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The strata columns.

fractions

The strata to strata weights.

seed

The seed to use for the random generator.

Returns

The modified Dataflow.

to_bool(columns: MultiColumnSelection, true_values: typing.List[str], false_values: typing.List[str], mismatch_as: azureml.dataprep.api.dataflow.MismatchAsOption = <MismatchAsOption.ASERROR: 2>) -> azureml.dataprep.api.dataflow.Dataflow

Converts the values in the specified columns to booleans.

to_bool(columns: MultiColumnSelection, true_values: typing.List[str], false_values: typing.List[str], mismatch_as: azureml.dataprep.api.dataflow.MismatchAsOption = <MismatchAsOption.ASERROR: 2>) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

true_values

The values to treat as true.

false_values

The values to treat as false.

mismatch_as

How to treat values that don't match the values in the true or false values lists.

Returns

The modified Dataflow.

to_datetime(columns: MultiColumnSelection, date_time_formats: typing.Union[typing.List[str], NoneType] = None, date_constant: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Converts the values in the specified columns to DateTimes.

to_datetime(columns: MultiColumnSelection, date_time_formats: typing.Union[typing.List[str], NoneType] = None, date_constant: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

date_time_formats

The formats to use to parse the values. If none are provided, a partial scan of the data will be performed to derive one.

date_constant

If the column contains only time values, a date to apply to the resulting DateTime.

Returns

The modified Dataflow.

to_json() -> str

Get the JSON string representation of the Dataflow.

to_json() -> str

to_long(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Converts the values in the specified columns to 64 bit integers.

to_long(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

Returns

The modified Dataflow.

to_number(columns: MultiColumnSelection, decimal_point: azureml.dataprep.api.dataflow.DecimalMark = <DecimalMark.DOT: 0>) -> azureml.dataprep.api.dataflow.Dataflow

Converts the values in the specified columns to floating point numbers.

to_number(columns: MultiColumnSelection, decimal_point: azureml.dataprep.api.dataflow.DecimalMark = <DecimalMark.DOT: 0>) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

decimal_point

The symbol to use as the decimal mark.

Returns

The modified Dataflow.

to_pandas_dataframe

Pulls all of the data and returns it as a Pandas Link pandas.DataFrame.

Parameters

extended_types

Whether to keep extended DataPrep types such as DataPrepError in the DataFrame. If False, these values will be replaced with None.

nulls_as_nan

Whether to interpret nulls (or missing values) in number typed columns as NaN. This is done by pandas for performance reasons; it can result in a loss of fidelity in the data.

Returns

A Pandas Link pandas.DataFrame.

Remarks

This method will load all the data returned by this Dataflow into memory.

Since Dataflows do not require a fixed, tabular schema but Pandas DataFrames do, an implicit tabularization step will be executed as part of this action. The resulting schema will be the union of the schemas of all records produced by this Dataflow.

to_spark_dataframe

Creates a Spark Link pyspark.sql.DataFrame that can execute the transformation pipeline defined by this Dataflow.

Returns

A Spark Link pyspark.sql.DataFrame.

Remarks

Since Dataflows do not require a fixed, tabular schema but Spark Dataframes do, an implicit tabularization step will be executed as part of this action. The resulting schema will be the union of the schemas of all records produced by this Dataflow. This tabularization step will result in a pull of the data.

Note

The Spark Dataframe returned is only an execution plan and doesn't actually contain any data, since Spark Dataframes are also lazily evaluated.

to_string(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Converts the values in the specified columns to strings.

to_string(columns: MultiColumnSelection) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

Returns

The modified Dataflow.

transform_partition(script: str) -> azureml.dataprep.api.dataflow.Dataflow

Transforms an entire partition using the passed in Python script.

transform_partition(script: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

script

The script that will be used to transform the partition.

Returns

The modified Dataflow.

Remarks

The Python script must define a function called transform that takes two arguments, typically called df and index. The df argument will be a Pandas Dataframe passed to this function that contains the data for the partition and the index argument is a unique identifier of the partition.

Note

df does not usually contain all of the data in the dataflow, but a partition of the data as it is being processed in the runtime. The number and contents of each partition is not guaranteed across runs.

The transform function can fully edit the passed in dataframe or even create a new one, but must return a dataframe. Any libraries that the Python script imports must exist in the environment where the dataflow is run.


   # the script passed in should have this function defined
   def transform(df, index):
       # perform any partition level transforms here and return resulting `df`
       return df

transform_partition_with_file(script_path: str) -> azureml.dataprep.api.dataflow.Dataflow

Transforms an entire partition using the Python script in the passed in file.

transform_partition_with_file(script_path: str) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

script_path

Relative path to script that will be used to transform the partition.

Returns

The modified Dataflow.

Remarks

The Python script must define a function called transform that takes two arguments, typically called df and index. The first argument df will be a Pandas Dataframe that contains the data for the partition and the second argument index will be a unique identifier for the partition.

Note

df does not usually contain all of the data in the dataflow, but a partition of the data as it is being processed in the runtime. The number and contents of each partition is not guaranteed across runs.

The transform function can fully edit the passed in dataframe or even create a new one, but must return a dataframe. Any libraries that the Python script imports must exist in the environment where the dataflow is run.


   # the script file passed in should have this function defined
   def transform(df, index):
       # perform any partition level transforms here and return resulting `df`
       return df

trim_string(columns: MultiColumnSelection, trim_left: bool = True, trim_right: bool = True, trim_type: azureml.dataprep.api.dataflow.TrimType = <TrimType.WHITESPACE: 0>, custom_characters: str = '') -> azureml.dataprep.api.dataflow.Dataflow

Trims string values in specific columns.

trim_string(columns: MultiColumnSelection, trim_left: bool = True, trim_right: bool = True, trim_type: azureml.dataprep.api.dataflow.TrimType = <TrimType.WHITESPACE: 0>, custom_characters: str = '') -> azureml.dataprep.api.dataflow.Dataflow

Parameters

columns

The source columns.

trim_left

Whether to trim from the beginning.

trim_right

Whether to trim from the end.

trim_type

Whether to trim whitespace or custom characters.

custom_characters

The characters to trim.

Returns

The modified Dataflow.

use_secrets(secrets: typing.Dict[str, str])

Uses the passed in secrets for execution.

use_secrets(secrets: typing.Dict[str, str])

Parameters

secrets

A dictionary of secret ID to secret value. You can get the list of required secrets by calling the get_missing_secrets method on Dataflow.

verify_has_data()

Verifies that this Dataflow would produce records if executed. An exception will be thrown otherwise.

verify_has_data()

write_streams(streams_column: str, base_path: azureml.dataprep.api.datasources.FileOutput, file_names_column: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Writes the streams in the specified column to the destination path. By default, the name of the files written will be the resource identifier of the streams. This behavior can be overriden by specifying a column which contains the names to use.

write_streams(streams_column: str, base_path: azureml.dataprep.api.datasources.FileOutput, file_names_column: typing.Union[str, NoneType] = None) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

streams_column

The column containing the streams to write.

file_names_column

A column containing the file names to use.

base_path

The path under which the files should be written.

Returns

The modified Dataflow.

write_to_csv(directory_path: DataDestination, separator: str = ',', na: str = 'NA', error: str = 'ERROR') -> azureml.dataprep.api.dataflow.Dataflow

Write out the data in the Dataflow in a delimited text format. The output is specified as a directory which will contain multiple files, one per partition processed in the Dataflow.

write_to_csv(directory_path: DataDestination, separator: str = ',', na: str = 'NA', error: str = 'ERROR') -> azureml.dataprep.api.dataflow.Dataflow

Parameters

directory_path

The path to a directory in which to store the output files.

separator

The separator to use.

na

String to use for null values.

error

String to use for error values.

Returns

The modified Dataflow. Every execution of the returned Dataflow will perform the write again.

write_to_parquet(file_path: typing.Union[~DataDestination, NoneType] = None, directory_path: typing.Union[~DataDestination, NoneType] = None, single_file: bool = False, error: str = 'ERROR', row_groups: int = 0) -> azureml.dataprep.api.dataflow.Dataflow

Writes out the data in the Dataflow into Parquet files.

write_to_parquet(file_path: typing.Union[~DataDestination, NoneType] = None, directory_path: typing.Union[~DataDestination, NoneType] = None, single_file: bool = False, error: str = 'ERROR', row_groups: int = 0) -> azureml.dataprep.api.dataflow.Dataflow

Parameters

file_path

The path in which to store the output file.

directory_path

The path in which to store the output files.

single_file

Whether to store the entire Dataflow in a single file.

error

String to use for error values.

row_groups

Number of rows to use per row group.

Returns

The modified Dataflow.

Attributes

dtypes

Gets column data types for the current dataset by calling get_profile(include_stype_counts: bool = False, number_of_histogram_bins: int = 10) -> azureml.dataprep.api.dataprofile.DataProfile and extracting just dtypes from the resulting DataProfile.

Returns

A dictionary, where key is the column name and value is FieldType.

row_count

Count of rows in this Dataflow.

Note

This will trigger a data profile calculation. To avoid calculating profile multiple times, get the profile first by calling get_profile() and then inspect it.

Note

If current Dataflow contains take_sample step or 'take' step, this will return number of rows in the subset defined by those steps.

Returns

Count of rows.

Return type

int

shape

Shape of the data produced by the Dataflow.

Note

This will trigger a data profile calculation. To avoid calculating profile multiple times, get the profile first by calling get_profile() and then inspect it.

Note

If current Dataflow contains take_sample step or 'take' step, this will return number of rows in the subset defined by those steps.

Returns

Tuple of row count and column count.

ExampleData

ExampleData = ~ExampleData

SortColumns

SortColumns = ~SortColumns

SourceColumns

SourceColumns = ~SourceColumns

TypeConversionInfo

TypeConversionInfo = ~TypeConversionInfo