Best practices for creating a data warehouse using dataflows
Designing a data warehouse is one of the most common tasks you can do with a dataflow. This article highlights some of the best practices for creating a data warehouse using a dataflow.
One of the key points in any data integration system is to reduce the number of reads from the source operational system. In the traditional data warehouse architecture, this reduction is done by creating a new database called a staging database. The purpose of the staging database is to load data as-is from the data source into the staging database on a regular schedule.
The rest of the data integration then uses the staging database as the source for further transformation and conversion to the data warehouse model structure.
We recommend that you follow the same approach by using dataflows. Create a set of dataflows that are responsible for just loading data as-is from the source system (and only for the tables you need). The result is then stored in the storage structure of the dataflow (either Azure Data Lake Storage or Dataverse). This change ensures that the read operation from the source system is minimal.
Next, you can create other dataflows that source their data from staging dataflows. The benefits of this approach include:
- Reducing the number of read operations from the source system, and reducing the load on the source system as a result.
- Reducing the load on data gateways if an on-premises data source is used.
- Having an intermediate copy of the data for reconciliation purpose, in case the source system data changes.
- Making the transformation dataflows source-independent.
Image emphasizing staging dataflows and staging storage, and showing the data being accessed from the data source by the staging dataflow, and entities being stored in either Datavers or Azure Data Lake Storage. The entities are then shown being tranformed along with other dataflows, which are then sent out as queries.
When you've separated your transformation dataflows from the staging dataflows, the transformation will be independent from the source. This separation helps if you're migrating the source system to a new system. All you need to do in that case is to change the staging dataflows. The transformation dataflows are likely to work without any problem, because they're sourced only from the staging dataflows.
This separation also helps in case the source system connection is slow. The transformation dataflow won't need to wait for a long time to get records coming through a slow connection from the source system. The staging dataflow has already done that part, and the data will be ready for the transformation layer.
A layered architecture is an architecture in which you perform actions in separate layers. The staging and transformation dataflows can be two layers of a multilayered dataflow architecture. Trying to do actions in layers helps minimize the maintenance required. When you want to change something, you just need to change it in the layer in which it's located. The other layers will continue to work as usual.
The following image shows a multilayered architecture for dataflows whose entities are then used in Power BI datasets.
Use a computed entity as often as possible
When you use the result of a dataflow in another dataflow, you're using the concept of the computed entity, which means getting data from an "already-processed-and-stored" entity. The same thing can happen inside a dataflow. When you reference an entity from another entity, you can use the computed entity. This is helpful when you have a set of transformations that need to be done in multiple entities, which are called common transformations.
In the previous image, the computed entity gets the data directly from the source. However, in the architecture of staging and transformation dataflows, it's likely that the computed entities are sourced from the staging dataflows.
Build a star schema
The best data warehouse model is a star schema model that has dimensions and fact tables designed in a way to minimize the amount of time to query the data from the model, and also makes it easy to understand for the data visualizer.
It isn't ideal to bring data in the same layout of the operational system into a BI system. The data tables should be remodeled. Some of the tables should take the form of a dimension table, which keeps the descriptive information. Some of the tables should take the form of a fact table, to keep the aggregable data. The best layout for fact tables and dimension tables to form is a star schema. More information: Understand star schema and the importance for Power BI
Use a unique key value for dimensions
When building dimension tables, make sure you have a key for each one. This ensures that there are no many-to-many (or in other words, "weak") relationships among dimensions. You can create the key by applying some transformation to make sure a column or a combination of columns is returning unique rows in the dimension. Then that combination of columns can be marked as a key in the entity in the dataflow.
Do an incremental refresh for large fact tables
Fact tables are always the largest tables in the data warehouse. We recommend that you reduce the number of rows transferred for these tables. If you have a very large fact table, ensure that you use incremental refresh for that entity. An incremental refresh can be done in the Power BI dataset and also in the dataflow entities.
You can use incremental refresh to refresh only part of the data, the part that has changed. There are multiple options to choose which part of the data to be refreshed and which part to be persisted. More information: Using incremental refresh with Power BI dataflows
Use referencing to create dimensions and fact tables
In the source system, you often have a table that you use for generating both fact and dimension tables in the data warehouse. These tables are good candidates for computed entities and also intermediate dataflows. The common part of the process—such as data cleaning, and removing extra rows and columns—can be done once. By using a reference from the output of those actions, you can produce the dimension and fact tables. This approach will use the computed entity for the common transformations.