Lesson 3: Optimizing the Database Design by Denormalizing
Estimated lesson time: 20 minutes
You implement normalization with a specific purpose: to maintain data integrity. However, in a real-life project, you typically have to bring back some data redundancy either for performance reasons or to maintain a history.
A fully normalized schema shows current state only. For example, in an invoicing application that is in a fully normalized design, you keep a customer’s address only in the Customers table. Suppose that a customer moves and you update that customer’s address with the new one. A few days later, the same customer reports that he accidentally lost some printed invoices during the move and asks you to print copies of lost invoices. It would be impossible to print exact copies of the invoices if you do not maintain some history about the customer’s address. You can solve this problem by maintaining a copy of customer address information on the invoice date in the Invoices table. Similarly, customers can change their name, so you also need to maintain a copy of customer names as part of the invoice data in the Invoices table.
During the normalization process, you decompose tables into more tables. The more tables you have, the more joins you have to perform in your queries, and joins have a negative impact on performance. To help, you can replicate a foreign key from the first child table to the second one. Take the invoicing system example again. Say you have an Employees parent table, and you add the EmployeeId column to the Customers table to serve as the foreign key because every customer has an account manager. Now, your typical queries must find invoices together with customer account manager data, requiring you to join three tables—Invoices, Customers, and Employees. However, if you replicate the EmployeeId column in the Invoices table, you can achieve the same goal and satisfy the same query by joining only the Invoices and Employees tables.
The best way to boost performance in a normalized system is to use derived data. For example, before creating an invoice item, you must check whether a product is in stock. You can always calculate levels and states from events, but what if you had to aggregate all the events for the current year just to tell a customer the product he or she wants to buy is out of stock? The customer would probably never return. You can solve this problem by maintaining total quantity in stock in a column in the Products table and by maintaining a separate ProductsIn-Warehouses table that holds the quantity of products in stock in your warehouses (if you have multiple warehouses).
Finally, you might have many queries that aggregate sales across customers. In a fully normalized system, you would have to aggregate all invoice details for a single customer multiple times. You can greatly improve performance by maintaining the year-to-date sales summary in a column in the Customers table. Figure 2-3 shows ER diagrams of the invoicing database before normalization, and Figure 2-4 shows ER diagrams after denormalization.
Figure 2-3 Part of the invoicing database before denormalization
Figure 2-4 Part of the invoicing database after denormalization
Still, denormalization brings the danger of update anomalies back to the database. Therefore, you have to do it deliberately. You should document any denormalization thoroughly. To make sure your application correctly maintains denormalized data, you need to use transactions appropriately. A transaction is the smallest unit of work that must either complete entirely or not at all. For example, in the Invoices system, a transaction might mean an insertion to a table such as InvoiceDetails, which must be followed immediately by an update of the derived-level column TotalInStock in the Product table or a StockLevel column in the ProductsIn-Warehouses table, depending where you maintain the stock level information. If one of those actions fails, the entire transaction is rolled back to ensure data consistency. An even better choice for maintaining denormalized data, however, is to leave that task to the RDBMS. You can do this by introducing data manipulation language (DML) triggers. A RDBMS fires a trigger automatically as part of a transaction. In the invoices example, a DML trigger for insert, update, and delete on the InvoiceDetails table can maintain the TotalInStock column and the ProductsInWarehouses table. In addition, you should have procedures in place to rebuild the derived data from scratch. You can always rebuild this data from events tables in case of inconsistency between the sum of events and states.
IMPORTANT Maintain denormalized data in transactions
After a denormalization, you have to make sure to maintain the denormalized data in transactions. For example, you have to correct the stock level any time you sell a product in a single transaction.
For reports and analysis, a better practice than maintaining aggregate information in an OLTP database is to introduce a data warehouse, which is based on a somewhat denormalized multidimensional model and a Microsoft Online Analytical Processing (OLAP) system. An OLAP database management system typically takes care of the aggregations for you.
Practice: Denormalizing the Database
In the previous practice, you prepared a fully normalized model for the Projects database. In this practice, you will denormalize this model to improve performance. Your current design is:
Projects(ProjectId, ProjectName, CustomerId) ProjectDetails(ProjectId, ItemId, ActivityId, EmployeeId, WorkDate, TimeSpent) Customers(CustomerId, CustomerName) Activities(ActivityId, ActivityName) Employees(EmployeeId, EmployeeName)
Exercise 1: Denormalize to Maintain History
In this exercise, you will perform the denormalization to maintain history.
Your company requires your system to maintain a customer’s name from the time that you started a project for a customer. A customer can later change the name. How would you achieve this requirement?
In the Customers table, you maintain the current name only. Therefore, you have to add the customer name to the Projects table.
In addition, it is always a good practice to have the information about the date when the name was valid. Therefore, you should add the start date in the Projects table as well. Note that you would probably already have the start date of a project in a real-life design. Your improved design should look similar to the following. (The denormalized attributes are StartDate and CustomerName.)
Projects(ProjectId, ProjectName, CustomerID, StartDate, CustomerName)
Exercise 2: Denormalize for Performance
In this exercise, you will perform the denormalization to improve performance.
You have performance problems with a report that calculates total time spent on a project. Because you are tracking multiple projects, you run this report multiple times a day. How would you improve report performance?
The problem is that you have to aggregate project detail rows, in which you have the information about time spent or each activity on a project. You have to add a column to hold the total time spent in the Projects table.
Your improved design should look similar to the following. (Denormalized attribute that you should add in this exercise is TotalTimeSpent.)
Projects(ProjectId, ProjectName, CustomerId, StartDate, CustomerName, TotalTimeSpent)
- What are some benefits of denormalization?
- What problems can you encounter when you decide to introduce some denormal-ization into your model?
Quick Check Answers
- Some benefits of denormalization are improved performance, the need for fewer joins, and the ability to maintain history information.
- When you denormalize a database, you can encounter problems with data consistency, especially with aggregated data, which must be updated for any single event.
© Microsoft. All Rights Reserved.