FabricDataFrame Class

Reference

A dataframe for storage and propogation of PowerBI metadata.

The elements of column_metadata can contain the following keys:

table: table name in originating dataset
column: column name
dataset: originating dataset name
workspace_id: string form of workspace GUID
workspace_name: friendly name of originating workspace
description: description of column (if one is present)
data_type: PowerBI data type for this column
data_category: PowerBI data category for this column
alignment: PowerBI visual alignment for this column

Inheritance: sempy.functions._dataframe._sdataframe._SDataFrame

FabricDataFrame

Constructor

FabricDataFrame(data: ndarray | Iterable | dict | DataFrame | None = None, *args: Any, column_metadata: Dict[str, Any] | None = None, dataset: str | UUID | None = None, workspace: str | UUID | None = None, verbose: int = 0, **kwargs: Any)

Parameters

Name	Description
data	ndarray, Iterable, dict or DataFrame Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs. If data is a list of dicts, column order follows insertion-order. default value: None
*args Required	list Remaining arguments to be passed to standard pandas constructor.
column_metadata	dict Information about dataframe columns to be stored and propogated. default value: None
dataset	str or UUID Name or UUID of the dataset to list the measures for. default value: None
workspace	str or UUID The Fabric workspace name or UUID object containing the workspace ID. Defaults to None which resolves to the workspace of the attached lakehouse or if no lakehouse attached, resolves to the workspace of the notebook. default value: None
verbose	int Verbosity. 0 means no verbosity. default value: 0
**kwargs Required	dict Remaining kwargs to be passed to standard pandas constructor.

Keyword-Only Parameters

Name	Description
column_metadata Required
dataset Required
workspace Required
verbose Required

Methods

add_measure	Join measures from the same dataset to the dataframe.
drop_dependency_violations	Drop rows that violate a given functional constraint. Enforces a functional constraint between the determinant and dependent columns provided. For each value of the determinant, the most common value of the dependent is picked, and all rows with other values are dropped. For example given ZIP CITY 12345 Seattle 12345 Boston 12345 Boston 98765 Baltimore 00000 San Francisco The row with CITY=Seattle would be dropped, and the functional dependency ZIP -> CITY holds in the output.
find_dependencies	Detect functional dependencies between the columns of a dataframe. Columns that map 1:1 will be represented as a list. Uses a threshold on conditional entropy to discover approximate functional dependencies. Low conditional entropy means strong dependence (i.e. conditional entropy of 0 means complete dependence). Therefore a lower threshold is more selective. The function tries to prune the potential dependencies by removing transitive edges. When dropna=True is specified, rows that have a NaN in either columns are eliminated from evaluation. This may result in dependencies being non-transitive, as in the following example. Even though A maps 1:1 with B and B maps 1:1 with C, A does not map 1:1 with C, because comparison of A and C includes additional NaN rows that are excluded when comparing A and C with B: A B C 1 1 1 1 1 1 1 NaN 9 2 NaN 2 2 2 2 In some dropna=True cases the dependency chain can form cycles. In the following example, NaN values mask the pairwise mappings in such a way that A->B, B->C, C->A: A B C 1 1 NaN 2 1 NaN NaN 1 1 NaN 2 1 1 NaN 1 1 NaN 2
list_dependency_violations	Show violating values assuming a functional dependency. Assuming that there's a functional dependency between column A (determinant) and column B (dependent), show values that violate the functional dependency (along with the count of their respective occurences). Allows inspecting approximate dependencies and find data quality issues. For example, given a dataset with zipcodes and cities, we would expect the zipcode to determine the city. However, if the dataset looks like this (where ZIP is the determinant and CITY is the dependent): ZIP CITY 12345 Seattle 12345 Boston 12345 Boston 98765 Baltimore 00000 San Francisco Running this function would output the following violation: ZIP CITY count 12345 Boston 2 12345 Seattle 1 The same zipcode is attached to multiple cities, which means there is some data quality issue within the dataset.
plot_dependency_violations	Show functional dependency violations in graphical format.
to_lakehouse_table	Write the data to OneLake as a Delta table with VOrdering enabled.
to_parquet	Write DataFrame to a parquet file specified by path parameter using Arrow including metadata.

add_measure

Join measures from the same dataset to the dataframe.

add_measure(*measures: List[str], dataset: str | UUID | None = None, workspace: str | UUID | None = None, use_xmla: bool = False, verbose: int = 0) -> FabricDataFrame

Parameters

Name	Description
*measures Required	list[str] List of measure names to join.
dataset	str or UUID Name or UUID of the dataset to list the measures for. If not provided it will be auto-resolved from column metadata. default value: None
workspace	str or UUID The Fabric workspace name or UUID object containing the workspace ID. Defaults to None which resolves to the workspace of the attached lakehouse or if no lakehouse attached, resolves to the workspace of the notebook. default value: None
use_xmla	bool Whether or not to use XMLA as the backend for the client. If there are any issues using the default Client, make this argument True. default value: False
verbose	int Verbosity. 0 means no verbosity. default value: 0

Returns

Type	Description
FabricDataFrame	A new FabricDataFrame with the joined measures.

drop_dependency_violations

Drop rows that violate a given functional constraint.

Enforces a functional constraint between the determinant and dependent columns provided. For each value of the determinant, the most common value of the dependent is picked, and all rows with other values are dropped. For example given

ZIP

CITY

12345

Seattle

12345

Boston

12345

Boston

98765

Baltimore

00000

San Francisco

The row with CITY=Seattle would be dropped, and the functional dependency ZIP -> CITY holds in the output.

drop_dependency_violations(determinant_col: str, dependent_col: str, verbose: int = 0) -> FabricDataFrame

Parameters

Name	Description
determinant_col Required	str Determining column name.
dependent_col Required	str Dependent column name.
verbose	int Verbosity; 0 means no messages, 1 means showing the number of dropped rows, greater than one shows entire row content of dropped rows. default value: 0

Returns

Type	Description
FabricDataFrame	New dataframe with constraint determinant -> dependent enforced.

find_dependencies

Detect functional dependencies between the columns of a dataframe.

Columns that map 1:1 will be represented as a list.

Uses a threshold on conditional entropy to discover approximate functional dependencies. Low conditional entropy means strong dependence (i.e. conditional entropy of 0 means complete dependence). Therefore a lower threshold is more selective.

The function tries to prune the potential dependencies by removing transitive edges.

When dropna=True is specified, rows that have a NaN in either columns are eliminated from evaluation. This may result in dependencies being non-transitive, as in the following example. Even though A maps 1:1 with B and B maps 1:1 with C, A does not map 1:1 with C, because comparison of A and C includes additional NaN rows that are excluded when comparing A and C with B:

NaN

In some dropna=True cases the dependency chain can form cycles. In the following example, NaN values mask the pairwise mappings in such a way that A->B, B->C, C->A:

NaN

find_dependencies(dropna: bool = False, threshold: float = 0.01, verbose: int = 0) -> FabricDataFrame

Parameters

Name	Description
dropna	bool Ignore rows where either column is NaN in dependency calculations. default value: False
threshold	float Threshold on conditional entropy to consider a pair of columns a dependency. Lower thresholds result in less dependencies (higher selectivity). default value: 0.01
verbose	int Verbosity. 0 means no verbosity. default value: 0

Returns

Type	Description
DataFrame	A dataframe with dependencies between columns and groups of columns. To better visualize the 1:1 groupgings, columns that belong to a single groups are put into a single cell. If no suitable candidates are found, returns an empty DataFrame.

list_dependency_violations

Show violating values assuming a functional dependency.

Assuming that there's a functional dependency between column A (determinant) and column B (dependent), show values that violate the functional dependency (along with the count of their respective occurences).

Allows inspecting approximate dependencies and find data quality issues.

For example, given a dataset with zipcodes and cities, we would expect the zipcode to determine the city. However, if the dataset looks like this (where ZIP is the determinant and CITY is the dependent):

ZIP

CITY

12345

Seattle

12345

Boston

12345

Boston

98765

Baltimore

00000

San Francisco

Running this function would output the following violation:

ZIP

CITY

count

12345

Boston

12345

Seattle

The same zipcode is attached to multiple cities, which means there is some data quality issue within the dataset.

list_dependency_violations(determinant_col: str, dependent_col: str, *, dropna: bool = False, show_feeding_determinants: bool = False, max_violations: int = 10000, order_by: str = 'count') -> FabricDataFrame

Parameters

Name	Description
determinant_col Required	str Candidate determinant column.
dependent_col Required	str Candidate dependent column.
dropna	bool Whether to drop rows with NaN values in either column. default value: False
show_feeding_determinants	bool Show values in a that are mapped to violating values in b, even if none of these values violate the functional constraint. default value: False
max_violations	int The number of violations to return. default value: 10,000
order_by	str Primary column to sort results by ("count" or "determinant"). If "count", sorts in order of determinant with highest number of dependent occurences (grouped by determinant). If "determinant", sorts in alphabetical order based on determinant column. default value: "count"

Returns

Type	Description
FabricDataFrame	FabricDataFrame containing all violating instances of functional dependency. If there are no violations, returns an empty DataFrame.

plot_dependency_violations

Show functional dependency violations in graphical format.

plot_dependency_violations(determinant_col: str, dependent_col: str, *, dropna: bool = False, show_feeding_determinants: bool = False, max_violations: int = 10000, order_by: str = 'count') -> graphviz.Graph

Parameters

Name	Description
determinant_col Required	str Candidate determinant column.
dependent_col Required	str Candidate dependent column.
dropna	bool Whether to drop rows with NaN values in either column. default value: False
show_feeding_determinants	bool Show values in a that are mapped to violating values in b, even if none of these values violate the functional constraint. default value: False
max_violations	int The number of violations to return. default value: 10,000
order_by	str Primary column to sort results by ("count" or "determinant"). If "count", sorts in order of determinant with highest number of dependent occurences (grouped by determinant). If "determinant", sorts in alphabetical order based on determinant column. default value: "count"

Returns

Type	Description
Graph	Graph of violating values.

to_lakehouse_table

Write the data to OneLake as a Delta table with VOrdering enabled.

to_lakehouse_table(name: str, mode: str | None = 'error', spark_schema: StructType | None = None, delta_column_mapping_mode: str = 'name') -> None

Parameters

Name	Description
name Required	str The name of the table to write to.
mode	str Specifies the behavior when table already exists, by default "error". Details of the modes are available in the Spark docs. default value: "error"
spark_schema	<xref:pyspark.sql.types.StructType> Specifies the schema of spark table to which the dataframe will be written in the lakehouse. If not provided, will be auto-generated via _pandas_to_spark_schema function. default value: None
delta_column_mapping_mode	str Specifies the column mapping mode to be used for the delta table. By default, it is set to "name". default value: "name"

to_parquet

Write DataFrame to a parquet file specified by path parameter using Arrow including metadata.

to_parquet(path: str, *args, **kwargs) -> None

Parameters

Name	Description
path Required	str String containing the filepath to where the parquet should be saved.
*args Required	list Other args to be passed to PyArrow `write_table`.
**kwargs Required	dict Other kwargs to be passed to PyArrow `write_table`.

Attributes

column_metadata

Information for the columns in the table.

Share via

FabricDataFrame Class

Constructor

Parameters

Keyword-Only Parameters

Methods

add_measure

Parameters

Returns

drop_dependency_violations

Parameters

Returns

find_dependencies

Parameters

Returns

list_dependency_violations

Parameters

Returns

plot_dependency_violations

Parameters

Returns

to_lakehouse_table

Parameters

to_parquet

Parameters

Attributes

column_metadata

Zusätzliche Ressourcen