Tutorial: Discover relationships in a semantic model, using semantic link

Straipsnis
04/28/2024

This tutorial illustrates how to interact with Power BI from a Jupyter notebook and detect relationships between tables with the help of the SemPy library.

In this tutorial, you learn how to:

Discover relationships in a semantic model (Power BI dataset), using semantic link's Python library (SemPy).
Use components of SemPy that support integration with Power BI and help to automate data quality analysis. These components include:
- FabricDataFrame - a pandas-like structure enhanced with additional semantic information.
- Functions for pulling semantic models from a Fabric workspace into your notebook.
- Functions that automate the evaluation of hypotheses about functional dependencies and that identify violations of relationships in your semantic models.

Prerequisites

Get a Microsoft Fabric subscription. Or, sign up for a free Microsoft Fabric trial.
Sign in to Microsoft Fabric.
Use the experience switcher on the left side of your home page to switch to the Synapse Data Science experience.

Select Workspaces from the left navigation pane to find and select your workspace. This workspace becomes your current workspace.
Download the Customer Profitability Sample.pbix and Customer Profitability Sample (auto).pbix semantic models from the fabric-samples GitHub repository and upload them to your workspace.

Follow along in the notebook

The powerbi_relationships_tutorial.ipynb notebook accompanies this tutorial.

To open the accompanying notebook for this tutorial, follow the instructions in Prepare your system for data science tutorials, to import the notebook to your workspace.

If you'd rather copy and paste the code from this page, you can create a new notebook.

Be sure to attach a lakehouse to the notebook before you start running code.

Set up the notebook

In this section, you set up a notebook environment with the necessary modules and data.

Install SemPy from PyPI using the %pip in-line installation capability within the notebook:
```
%pip install semantic-link
```

Perform necessary imports of SemPy modules that you'll need later:

import sempy.fabric as fabric

from sempy.relationships import plot_relationship_metadata
from sempy.relationships import find_relationships
from sempy.fabric import list_relationship_violations

Import pandas for enforcing a configuration option that helps with output formatting:
```
import pandas as pd
pd.set_option('display.max_colwidth', None)
```

Explore semantic models

This tutorial uses a standard sample semantic model Customer Profitability Sample.pbix. For a description of the semantic model, see Customer Profitability sample for Power BI.

Use SemPy's list_datasets function to explore semantic models in your current workspace:
```
fabric.list_datasets()
```

For the rest of this notebook you use two versions of the Customer Profitability Sample semantic model:

Customer Profitability Sample: the semantic model as it comes from Power BI samples with predefined table relationships
Customer Profitability Sample (auto): the same data, but relationships are limited to those ones that Power BI would autodetect.

Extract a sample semantic model with its predefined semantic model

Load relationships that are predefined and stored within the Customer Profitability Sample semantic model, using SemPy's list_relationships function. This function lists from the Tabular Object Model:
```
dataset = "Customer Profitability Sample"
relationships = fabric.list_relationships(dataset)
relationships
```
Visualize the relationships DataFrame as a graph, using SemPy's plot_relationship_metadata function:
```
plot_relationship_metadata(relationships)
```

This graph shows the "ground truth" for relationships between tables in this semantic model, as it reflects how they were defined in Power BI by a subject matter expert.

Complement relationships discovery

If you started with relationships that Power BI autodetected, you'd have a smaller set.

Visualize the relationships that Power BI autodetected in the semantic model:
```
dataset = "Customer Profitability Sample (auto)"
autodetected = fabric.list_relationships(dataset)
plot_relationship_metadata(autodetected)
```
Power BI's autodetection missed many relationships. Moreover, two of the autodetected relationships are semantically incorrect:
- Executive[ID] -> Industry[ID]
- BU[Executive_id] -> Industry[ID]
Print the relationships as a table:
```
autodetected
```
Incorrect relationships to the Industry table appear in rows with index 3 and 4. Use this information to remove these rows.
Discard the incorrectly identified relationships.
```
autodetected.drop(index=[3,4], inplace=True)
autodetected
```
Now you have correct, but incomplete relationships.
Visualize these incomplete relationships, using plot_relationship_metadata:
```
plot_relationship_metadata(autodetected)
```

Load all the tables from the semantic model, using SemPy's list_tables and read_table functions:

tables = {table: fabric.read_table(dataset, table) for table in fabric.list_tables(dataset)['Name']}

tables.keys()

Find relationships between tables, using find_relationships, and review the log output to get some insights into how this function works:

suggested_relationships_all = find_relationships(
    tables,
    name_similarity_threshold=0.7,
    coverage_threshold=0.7,
    verbose=2
)

Visualize newly discovered relationships:
```
plot_relationship_metadata(suggested_relationships_all)
```
SemPy was able to detect all relationships.

Use the exclude parameter to limit the search to additional relationships that weren't identified previously:

additional_relationships = find_relationships(
    tables,
    exclude=autodetected,
    name_similarity_threshold=0.7,
    coverage_threshold=0.7
)

additional_relationships

Validate the relationships

First, load the data from the Customer Profitability Sample semantic model:

dataset = "Customer Profitability Sample"
tables = {table: fabric.read_table(dataset, table) for table in fabric.list_tables(dataset)['Name']}

tables.keys()

Check for overlap of primary and foreign key values by using the list_relationship_violations function. Supply the output of the list_relationships function as input to list_relationship_violations:
```
list_relationship_violations(tables, fabric.list_relationships(dataset))
```
The relationship violations provide some interesting insights. For example, one out of seven values in Fact[Product Key] isn't present in Product[Product Key], and this missing key is 50.

Exploratory data analysis is an exciting process, and so is data cleaning. There's always something that the data is hiding, depending on how you look at it, what you want to ask, and so on. Semantic link provides you with new tools that you can use to achieve more with your data.

Check out other tutorials for semantic link / SemPy:

Bendrinti naudojant

Tutorial: Discover relationships in a semantic model, using semantic link

Prerequisites

Follow along in the notebook

Set up the notebook

Explore semantic models

Extract a sample semantic model with its predefined semantic model

Complement relationships discovery

Validate the relationships

Atsiliepimai

Atsiliepimai

Papildomi ištekliai

Bendrinti naudojant

Tutorial: Discover relationships in a semantic model, using semantic link

Prerequisites

Follow along in the notebook

Set up the notebook

Explore semantic models

Extract a sample semantic model with its predefined semantic model

Complement relationships discovery

Validate the relationships

Related content

Atsiliepimai

Atsiliepimai

Papildomi ištekliai