Redaguoti

Bendrinti naudojant


Tutorial: Discover relationships in a semantic model, using semantic link

This tutorial illustrates how to interact with Power BI from a Jupyter notebook and detect relationships between tables with the help of the SemPy library.

In this tutorial, you learn how to:

  • Discover relationships in a semantic model (Power BI dataset), using semantic link's Python library (SemPy).
  • Use components of SemPy that support integration with Power BI and help to automate data quality analysis. These components include:
    • FabricDataFrame - a pandas-like structure enhanced with additional semantic information.
    • Functions for pulling semantic models from a Fabric workspace into your notebook.
    • Functions that automate the evaluation of hypotheses about functional dependencies and that identify violations of relationships in your semantic models.

Prerequisites

  • Select Workspaces from the left navigation pane to find and select your workspace. This workspace becomes your current workspace.

  • Download the Customer Profitability Sample.pbix and Customer Profitability Sample (auto).pbix semantic models from the fabric-samples GitHub repository and upload them to your workspace.

Follow along in the notebook

The powerbi_relationships_tutorial.ipynb notebook accompanies this tutorial.

To open the accompanying notebook for this tutorial, follow the instructions in Prepare your system for data science tutorials, to import the notebook to your workspace.

If you'd rather copy and paste the code from this page, you can create a new notebook.

Be sure to attach a lakehouse to the notebook before you start running code.

Set up the notebook

In this section, you set up a notebook environment with the necessary modules and data.

  1. Install SemPy from PyPI using the %pip in-line installation capability within the notebook:

    %pip install semantic-link
    
  2. Perform necessary imports of SemPy modules that you'll need later:

    import sempy.fabric as fabric
    
    from sempy.relationships import plot_relationship_metadata
    from sempy.relationships import find_relationships
    from sempy.fabric import list_relationship_violations
    
  3. Import pandas for enforcing a configuration option that helps with output formatting:

    import pandas as pd
    pd.set_option('display.max_colwidth', None)
    

Explore semantic models

This tutorial uses a standard sample semantic model Customer Profitability Sample.pbix. For a description of the semantic model, see Customer Profitability sample for Power BI.

  • Use SemPy's list_datasets function to explore semantic models in your current workspace:

    fabric.list_datasets()
    

For the rest of this notebook you use two versions of the Customer Profitability Sample semantic model:

  • Customer Profitability Sample: the semantic model as it comes from Power BI samples with predefined table relationships
  • Customer Profitability Sample (auto): the same data, but relationships are limited to those ones that Power BI would autodetect.

Extract a sample semantic model with its predefined semantic model

  1. Load relationships that are predefined and stored within the Customer Profitability Sample semantic model, using SemPy's list_relationships function. This function lists from the Tabular Object Model:

    dataset = "Customer Profitability Sample"
    relationships = fabric.list_relationships(dataset)
    relationships
    
  2. Visualize the relationships DataFrame as a graph, using SemPy's plot_relationship_metadata function:

    plot_relationship_metadata(relationships)
    

    Screenshot showing a plot of the relationships between tables in the semantic model.

This graph shows the "ground truth" for relationships between tables in this semantic model, as it reflects how they were defined in Power BI by a subject matter expert.

Complement relationships discovery

If you started with relationships that Power BI autodetected, you'd have a smaller set.

  1. Visualize the relationships that Power BI autodetected in the semantic model:

    dataset = "Customer Profitability Sample (auto)"
    autodetected = fabric.list_relationships(dataset)
    plot_relationship_metadata(autodetected)
    

    Screenshot showing the relationships that Power BI autodetected in the semantic model.

    Power BI's autodetection missed many relationships. Moreover, two of the autodetected relationships are semantically incorrect:

    • Executive[ID] -> Industry[ID]
    • BU[Executive_id] -> Industry[ID]
  2. Print the relationships as a table:

    autodetected
    

    Incorrect relationships to the Industry table appear in rows with index 3 and 4. Use this information to remove these rows.

  3. Discard the incorrectly identified relationships.

    autodetected.drop(index=[3,4], inplace=True)
    autodetected
    

    Now you have correct, but incomplete relationships.

  4. Visualize these incomplete relationships, using plot_relationship_metadata:

    plot_relationship_metadata(autodetected)
    

    Screenshot that shows a visualization of relationships after removing incorrect ones.

  5. Load all the tables from the semantic model, using SemPy's list_tables and read_table functions:

    tables = {table: fabric.read_table(dataset, table) for table in fabric.list_tables(dataset)['Name']}
    
    tables.keys()
    
  6. Find relationships between tables, using find_relationships, and review the log output to get some insights into how this function works:

    suggested_relationships_all = find_relationships(
        tables,
        name_similarity_threshold=0.7,
        coverage_threshold=0.7,
        verbose=2
    )
    
  7. Visualize newly discovered relationships:

    plot_relationship_metadata(suggested_relationships_all)
    

    Screenshot that shows visualization of newly discovered relationships.

    SemPy was able to detect all relationships.

  8. Use the exclude parameter to limit the search to additional relationships that weren't identified previously:

    additional_relationships = find_relationships(
        tables,
        exclude=autodetected,
        name_similarity_threshold=0.7,
        coverage_threshold=0.7
    )
    
    additional_relationships
    

Validate the relationships

  1. First, load the data from the Customer Profitability Sample semantic model:

    dataset = "Customer Profitability Sample"
    tables = {table: fabric.read_table(dataset, table) for table in fabric.list_tables(dataset)['Name']}
    
    tables.keys()
    
  2. Check for overlap of primary and foreign key values by using the list_relationship_violations function. Supply the output of the list_relationships function as input to list_relationship_violations:

    list_relationship_violations(tables, fabric.list_relationships(dataset))
    

    The relationship violations provide some interesting insights. For example, one out of seven values in Fact[Product Key] isn't present in Product[Product Key], and this missing key is 50.

Exploratory data analysis is an exciting process, and so is data cleaning. There's always something that the data is hiding, depending on how you look at it, what you want to ask, and so on. Semantic link provides you with new tools that you can use to achieve more with your data.

Check out other tutorials for semantic link / SemPy: