Aggregator Class

Reference

Defines an aggregation against specified columns identified with join keys.

Inheritance: builtins.object

Aggregator

Constructor

Aggregator()

Remarks

Aggregators are typically not instantiated directly. Instead, specify the the type of aggregator when using using an enricher such as the HolidayEnricher object.

Derived aggregators include AggregatorAll, AggregatorAvg, AggregatorMax, AggregatorMin, AggregatorTop.

The process(env, customer_data, public_data, join_keys, debug) method performs the aggregation.

Methods

get_log_property

Get log property tuple, None if no property.

process

Left join customer_data with public_data on join_keys.

Drop all columns in join_keys and all columns which is in the list of to_be_cleaned_up_column_names afterward.

process_public_dataset

Perform aggregation on specified public data columns.

get_log_property

Get log property tuple, None if no property.

get_log_property()

process

Left join customer_data with public_data on join_keys.

Drop all columns in join_keys and all columns which is in the list of to_be_cleaned_up_column_names afterward.

process(env: SparkEnv | PandasEnv, customer_data: CustomerData, public_data: PublicData, join_keys: list, debug: bool)

Parameters

Name	Description
env Required	RuntimeEnv The runtime environment.
customer_data Required	CustomerData The customer data.
public_data Required	PublicData The public data.
join_keys Required	list[tuple] A list of join key pairs.
debug Required	bool Indicates whether to print debug info.

Returns

Type	Description
tuple[ CustomerData, PublicData, CustomerData, list[tuple([str, str])]	A tuple of ( a new instance of class CustomerData, unchanged instance of PublicData, a new joined instance of class CustomerData, join keys (list of tuple))

process_public_dataset

Perform aggregation on specified public data columns.

process_public_dataset(env: RuntimeEnv, _public_dataset: object, cols: List[str] | None = None, join_keys: List[Tuple[str, str]] = []) -> object

Parameters

Name	Description
env Required	RuntimeEnv The runtime environment.
_public_dataset Required	DataFrame A public dataset dataframe.
cols	list A list of column names to retrieve. default value: None
join_keys	list A list of join keys to use. default value: []

Returns

Type	Description
object	A new DataFrame of the public dataset.

Attributes

should_direct_join

should_direct_join = True

Aggregator Class

Constructor

Remarks

Methods

get_log_property

process

Parameters

Returns

process_public_dataset

Parameters

Returns

Attributes

should_direct_join

Feedback

Feedback

Additional resources