Aggregator Class

Defines an aggregation against specified columns identified with join keys.

Inheritance
builtins.object
Aggregator

Constructor

Aggregator()

Remarks

Aggregators are typically not instantiated directly. Instead, specify the the type of aggregator when using using an enricher such as the HolidayEnricher object.

Derived aggregators include AggregatorAll, AggregatorAvg, AggregatorMax, AggregatorMin, AggregatorTop.

The process(env, customer_data, public_data, join_keys, debug) method performs the aggregation.

Methods

get_log_property

Get log property tuple, None if no property.

process

Left join customer_data with public_data on join_keys.

Drop all columns in join_keys and all columns which is in the list of to_be_cleaned_up_column_names afterward.

process_public_dataset

Perform aggregation on specified public data columns.

get_log_property

Get log property tuple, None if no property.

get_log_property()

process

Left join customer_data with public_data on join_keys.

Drop all columns in join_keys and all columns which is in the list of to_be_cleaned_up_column_names afterward.

process(env: SparkEnv | PandasEnv, customer_data: CustomerData, public_data: PublicData, join_keys: list, debug: bool)

Parameters

Name Description
env
Required

The runtime environment.

customer_data
Required

The customer data.

public_data
Required

The public data.

join_keys
Required

A list of join key pairs.

debug
Required

Indicates whether to print debug info.

Returns

Type Description

A tuple of ( a new instance of class CustomerData, unchanged instance of PublicData, a new joined instance of class CustomerData, join keys (list of tuple))

process_public_dataset

Perform aggregation on specified public data columns.

process_public_dataset(env: RuntimeEnv, _public_dataset: object, cols: List[str] | None = None, join_keys: List[Tuple[str, str]] = []) -> object

Parameters

Name Description
env
Required

The runtime environment.

_public_dataset
Required

A public dataset dataframe.

cols

A list of column names to retrieve.

default value: None
join_keys

A list of join keys to use.

default value: []

Returns

Type Description

A new DataFrame of the public dataset.

Attributes

should_direct_join

should_direct_join = True