Project REAL: Analysis Services Technical Drilldown
By Dave Wickert, Microsoft Corporation
SQL Server Technical Article
Published: September 2005
Applies To: SQL Server 2005
Summary: This white paper is a detailed technical discussion of designs and best practices for Analysis Services that were developed in Project REAL. It drills down into each of the different types of objects, such as data sources, data source views, dimensions, hierarchies, attributes, measure groups, and partitions. It shows how to write a SQL Server 2005 Integration Services package that automates the creation of measure group partitions based on the relational database partitioning scheme.
On This Page
About Project REAL
Three Ways to Interact With Analysis Services
Data Sources and Data Source Views
Appendix A: Automating Partition Creation
About Project REAL
Project REAL is an effort to discover best practices for creating business intelligence applications that are based on Microsoft® SQL Server™ 2005 by creating reference implementations that are based on actual customer scenarios. This means that customer data is brought in-house and is used to work through the same issues that the customers face during deployment. These issues include:
The design of schemas—both relational and for Analysis Services.
The implementation of a data extraction, transformation, and loading (ETL) process.
The design and deployment of client front-end systems, both for reporting and for interactive analysis.
The sizing of systems for production.
The management and maintenance of the systems on an ongoing basis, including incremental updates to the data.
By working with real deployment scenarios, we gain a complete understanding of how to work with the tools. Our goal is to address the full gamut of concerns that a large company would face during their own real-world deployment.
This white paper offers a detailed technical discussion of the Analysis Services designs and best practices that were developed in Project REAL. We will drilldown into each of the different types of objects, such as data sources, data source views, dimensions, hierarchies, attributes, measure groups, partitions, etc. pointing out important issues as we go along.
For an overview of Project REAL, see the Project REAL: Technical Overview white paper. A number of papers, tools, and samples will be produced over the lifetime of Project REAL. To find the latest information, check back to the Project REAL Web site
This paper reviews the technical design of Project REAL Analysis Services and discusses various issues that impact this design. We assume that the reader is familiar with Analysis Services design and has examined the schema that ships with Project REAL. For example, we assume that you already know that the many-to-many Vendor dimension exists. We focus our discussion on why it exists (and the alternative we considered before finalizing that design).
In this paper, we examine each type of Analysis Services object in the multidimensional design. We start with physical schema objects, such as data sources and data source views. We then discuss logical objects such as dimensions, user-defined hierarchies, attribute hierarchies, and measure groups. We then drill down into the measure group characteristics, such as partitioning, aggregate design, and proactive caching. The section closes with a discussion of the remainder of the logical design, including calculations, Key Performance Indicators (KPIs), actions, perspectives, custom assemblies, user-defined functions (UDFs), and MDX scripts.
In the last section we take a detailed look at two design alternatives that were considered during the design of the Analysis Services schema. We present the goal, what we considered doing, and what we implemented.
The paper closes with information about the server-wide settings that we changed and a discussion of why we had to modify them.
The Project REAL design relies heavily on partitioning. There are several hundred partitions defined across all of the measure groups. In Appendix A, we show you how we solved the management problem of creating and managing all of those partitions in the various databases that we created.
Three Ways to Interact With Analysis Services
There are three ways to interact with Analysis Services: SQL Server Management Studio, Business Intelligence (BI) Development Studio in project mode, and BI Development Studio in directly connected mode. Let us examine each of these and then describe how you can get into trouble if you improperly combine these methods.
SQL Server Management Studio
When you start SQL Server Management Studio, it connects directly to the Analysis Services server and immediately performs the requested operation.
For example, if you process a cube or a dimension, internally the system uses the new managed API called Analysis Management Objects (AMO) to look up information and request management functions, such as object processing. Some operations are done via AMO and some use XMLA scripting. In either case, the operations are performed immediately against the live Analysis Services server. Since Analysis Services supports concurrent access, the system maintains locking structures that block multiple operators from performing conflicting operations. For example, these structures prevent an operation from deleting a dimension at the same time that another operation is processing the dimension.
SQL Server Management Studio is designed as a management utility. It is not a development environment. It allows you to browse the Analysis Services server and perform various operations-related activities, such as performing backups and restores, synchronizing servers, and configuring servers. While SQL Server Management Studio can perform basic object manipulation, such as changing minor properties or deleting objects, it relies on other programs to design and deploy Analysis Services objects.
BI Development Studio in project mode (default)
A developer or database administrator (DBA) who wants to design and implement an Analysis Services database uses BI Development Studio to interact with Analysis Services. BI Development Studio is built using the Microsoft Visual Studio® shell. Performing operations in BI Development Studio is similar to performing operations with other SQL Server tools (such as creating an Integration Services package or a Reporting Services report) and with other programming tools (such as writing a Visual Basic®. NET or Visual C#® program). One of the tenets of the Visual Studio shell is that development is an iterative edit, build, and deploy cycle.
For Analysis Services, this means that once you start working with an Analysis Services project, you are disconnected from the Analysis Services server. You initially build the project by either: 1) creating dimensions, cubes, and other objects from relational sources and then deploying the project to the Analysis Services server; or 2) taking an existing Analysis Services database from a server and creating a project from it. Once the project is created, it exists disconnected from the Analysis Services server (and the database that it represents). You can take the project offline with you on an airplane, home with you on your laptop, or how ever you wish. Only when you deploy the project does it reconnect to the Analysis Services server.
When you deploy the project, BI Development Studio detects the differences between the objects in the project and what exists on the Analysis Services server. These differences are then applied to the Analysis Services server so that it is synchronized with the project. This might mean creating new objects that you created while offline or editing properties that you changed while offline or even deleting objects if they exist on the Analysis Services server but not in the project.
This is very powerful paradigm. Developers develop an application and then compile and deploy it. From then on, the project exists offline on the developer’s workstation, where it can be edited and developed, then recompiled and deployed again. This is the same paradigm that is used for Analysis Services projects.
How do developers ensure that no two developers make a conflicting edit to the same piece of application code? They use a source control system. When everyone is working through the same source control system, no two developers can make conflicting changes. The same thing happens with BI Development Studio project files. This also applies to project subfiles, such as the dimension and cube files which make up the project, not unlike the C# program source files that make up an application.
BI Development Studio in directly connected mode
You might be thinking that all is right in the world as we’ve outlined two tools: SQL Server Management Studio for operators and BI Development Studio for developers. However, there are circumstances where DBAs and operators need the functionality of BI Development Studio to perform their work. This would be the case, for example, when they need to find out what measures come from which cubes or what the overall schema looks like. To facilitate this, BI Development Studio has a direct connect mode. To access this mode, after starting BI Development Studio, select Open from the File menu and then select Analysis Services database.
In this mode, you are not using an offline project file. Instead, you are directly connected to the Analysis Services server. If you create a partition, a cube, or a dimension, you immediately create, update, or delete the object. You are working against the live database. There is no offline editing and thus no deployment or check-in of the underlying files.
How to get into trouble . . .
Clearly you can see the problem. Developers use a source control system to guarantee that the project file is consistent at any point in time. Operators who are using SQL Server Management Studio or BI Development Studio in directly connect mode (and application programs that use AMO) rely on the running service to ensure that two operators don’t make incompatible changes. The two mechanisms work within their own groups. However, they are unaware of the other’s changes. The running service doesn’t know about the source control system and the source control system doesn’t know about the running service. The project file does not always match the live database. Following are several examples of how this might get you into trouble.
A developer creates a project by importing an existing Analysis Services database into a project. Then an operator deletes a cube in the database by using SQL Server Management Studio. When the developer deploys the project, the cube comes back. This is announced by a message similar to that shown in Figure 1.
Figure 1: BI Development Studio detects that the database has changed since the last deployment
Notice that the message does not tell the developer which objects have changed. It only reports that the versions of the database no longer match. Since the developer does not know if the differences are major or minor, his or her only real option is to click the Yes button. Then the cube comes back.
The Analysis Services project is kept in a source control system. You create a new partition on the Analysis Services server by executing an XMLA script in a SQL Server 2005 Integration Services package. For example, a daily SQL Agent job might run the package to process incoming data files. When a new month’s worth of data starts to appear, the package creates a new monthly partition. Then a developer makes a change to the project under source control and deploys it. All of the monthly partitions that were created by the SQL Agent job disappear.
The live server has a system crash and loses a disk drive. Operators restore the database to the state it was in three months ago and then reprocess all of the databases. A developer deploys some changes pending in QA (under source control) into production causing the application to break. Operators think that the application broke because the restored backup was corrupt. They restore the system from the backup of four months ago. When the developer looks at production, he notices that none of his changes took. So the developer deploys the project again . . . which breaks production . . . so the operators then go back to a backup of five months ago . . . (you get the picture).
How to fix the problem . . .
Endless combinations of these events can occur. This brings up several questions:
How do you detect that a problem has occurred?
How do you fix the problem?
Detecting a problem is usually easy. Even a minor change such as changing the property of an object will cause the error in Figure 1 to appear. Notice that it is only the developers who are told that the versions are no longer valid in the offline project files. Operators never know that their online changes are impacting developers unless the developers tell them. If the developers do not notify the operators and if they continue to deploy their projects even if the database has changed, the operators will see their changes come and go seemingly at random.
The real challenge is how to fix the situation. Unfortunately there is no mechanism for automatically synchronizing a live Analysis Services database with an offline Analysis Services project. The only way to do this is by using the following steps.
Re-import the Analysis Services database into a new project.
Use the import function to re-create all of the files.
Use the source control system to check out the project (and its subfiles).
Restore the new files over the checked-out files.
Check in the new versions to the source control system. If you’ve done it all right, the source control system now contains all new versions of everything.
Deploy the new project so the database version numbers match.
The alternative approach is to compare the newly imported project files and the current files that are checked into source control. Once you know the differences, modify each object and property one-by-one to get the files to match. You are in effect doing a manual synchronization between the Analysis Services database and the source control system. This approach is error prone and there is no guarantee that all differences have been identified and properly synchronized.
During the reconciliation process, whether this is done manually or by replacing entire files, operators should not make any changes to the online system. Any changes they make at this time will be overridden during the final deployment. All developers must now get the latest version from the source control system, or else check out the entire project to update their local copies of the files.
Needless to say, this method involves a lot of moving parts and it is easy to get it wrong. If you are going to mix operations as we’ve described in this section, you need to publish (and ensure compliance) detailed guidelines on how developers and operators will interact.
The Project REAL Data Warehouse is implemented in a series of multidimensional Analysis Services databases called:
REAL_Warehouse_V<n> (where <n> represents the data-masked version of the data) is the full multi-terabyte version of the database.
REAL_Warehouse_Development_V<n> is the development version of the database. It contains the full set of dimension members, but with 15 partitions (Nov 27, 2004 through March 3, 2005) in 5 stores. It is used within the REAL development team as a dataset (~15 GB) which could easily be built and tested. Its small size made sharing among the geographically dispersed development team easier.
REAL_Warehouse_Sample_V<n> is the sample version of the database. It contains a subset of the dimensional structures and a restricted number of facts. However, it covers a longer period of time (52 weeks of data), so that all of the data exists for a good demo of the system itself. It is designed for external distribution as part of the Project REAL system. This means that you will have enough data to constructively work with the system. It is also the basis for various Project REAL demos.
All of the databases have the same objects in them with the same structures. Within each database is a single cube, 11 dimensions, and 4 measure groups.
Data Sources and Data Source Views
Data sources are the starting point for your modeling activity within Analysis Services. For Project REAL, we consolidated three SQL Server 2005 relational databases into a single RDBMS which is used for the data source for the cubes and structures used by the system.
Having a single data source for all data is not necessarily a commonly used configuration. We knew that ultimately we wanted to test what access was like between Analysis Services and the source RDBMS when they resided on servers in different domains, possibly on servers with firewalls between them. Thus, we decided early on that we would use SQL Server standard security rather than integrated security for accessing the RDBMS.
We created a dedicated SQL login for access by Analysis Services. This account name was REAL_User and its password was password. This did not create a high-security environment, but it was complex enough for our uses. In SQL Server databases, this login was only given data reader permissions.
As part of creating and working with this data source, we noticed an interesting SQL Server 2005 Analysis Services action. It turns out that the data source password is both encrypted internally (that is, it is not stored in clear text) and removed when the data source is created or scripted. Thus, whenever we made a copy of the database (by selecting Scripting and then Create Script on the Task menu in SQL Management Studio), the password for the data source in the XMLA script was set to blank. In fact, we did not find a way to copy the password into another object by using scripting, or copying and pasting, or any other mechanism. No matter how we made the copy, the system always complained that the password was missing or that the wrong password was set for the data source.
Best Practice: When working with standard security, remember your password. As objects get moved around, you will be frequently asked to re-enter it.
Removing the password has both advantages and disadvantages. It makes theft of the password much harder. Even an administrator cannot copy an object and continue to use it. Administrators must re-enter the password whenever the copy is used for the first time. After that first use, there is an option on the data source that specifies that the password is stored (encrypted) inside the data source itself and it must be re-entered every time the data source is opened. A malicious user who gains access as an OLAP administrator cannot copy an object and use it in some other context or determine what the password is by examining the data source.
Unfortunately, this makes it difficult to construct an automated nightly (or weekly) rebuild of the system from scripts that are stored in a source control system. This is because the object that is stored in the source control system does not have a valid password contained within it. If your development shop runs nightly, automated rebuilds from source (for example, we do this at Microsoft), then objects that are built from that data source cannot be processed by using Analysis Services until you have set a proper password. After you extract and re-create all of the Analysis Services objects from source control, you must set a valid password for the data source before you can automate the processing of the dimensions, attributes, partitions, and other objects. This entails a security risk since the password will probably be set in clear text. So be careful how you set the password and how you control access for users who need to read those procedures.
Data source views
We created one data source view for use by the system. The data source view is an abstraction layer that is used to extend the objects (relational tables and views) that are exposed by the data source to a collection of objects from which Analysis Services objects are created. In keeping with typical relational best practices, we did not create any objects from the raw tables themselves, but always used views on the relational side.
Within the data source view we included all of the relational views that were used to create dimensions, hierarchies, and attributes. We did not include the relational views that were used to create the partitions. We also included one view as a “template table” for each measure group—this would be the basis for each of the measure group partitions. The template did not contain any data—it was just a definition of what each of the partition tables, views, query bindings, and other objects would return.
From this work, a number of best practices emerged.
Where possible, continue to use views to present a good quality star / snowflake schema to Analysis Services.
While data source views allow a DBA to construct complex structures, they are not necessarily the right place to add business value to the data. Data source views cannot be shared across databases, servers, and applications. Nor can they be shared across other BI components within SQL Server, such as between Analysis Services, Integration Services, and Reporting Services. If you add business knowledge though views, you can better reuse that expertise across all applications.
We found that extending a system by using a data source view is an excellent decision in two instances. One is when the RDBMS will not allow designers to create (or change) views. Another is when the business knowledge is related to just the multidimensional nature of the cubes and not to the relational objects and reuse is not important. (See the discussion on the Months Since Opened named query later in this paper.)
Use data source views to create foreign key (FK) relationships that might not exist in the underlying data sources.
There are many reasons why FK relationships might not exist in the data sources. For example, if the data is coming from a data warehouse, it is not uncommon for data warehouses to relax referential integrity constraints even if they are known in the data model. This might be done to speed the processing of large amounts of data or to include questionable data while the cleansing phase is running. Likewise, if the data source is a production database, then referential integrity constraints might not be declared because the underlying relational database does not support them. For example, the tables might be in different databases or in different servers. Referential integrity constraints across databases and servers are not allowed in SQL. However, foreign key relationships are very useful when Analysis Services is constructing the SQL statements to gather data from the data sources when processing (and/or querying if the storage is ROLAP). Fortunately, data source views can be used to add FK relationships for Analysis Services, over and beyond what is defined within the data source itself.
It is easy to get carried away with data source views and create too many relationships.
It is important to realize that if your data is already contained in a star schema, then foreign key relationships are not needed. The relationships are typically used so that Analysis Services knows how to construct the underlying SQL statements for processing. The relationships are not needed for browsing or manipulating objects or queries. In the Project REAL database, we have no data source view relationships defined and our system builds quite nicely.
Extending metadata by using a data source view
Extending metadata by using a data source view is easy and straightforward. For Project REAL, we used extended metadata in a number of ways.
We used extended metadata to create computed columns that convert the data used in discretization grouping. For example, the Item dimension has an attribute named Publish Year. However, the year itself has few analysis capabilities. What differentiates a book published in 1934 from one published in 1992? A better attribute for analysis is the number of years since publication. So, we created the following computed column in the Item table (using the data source view).
This statement uses standard SQL functions to convert the year of publication to a variable covering the number of years since publication.
Analysis Services supports the automatic discretization of a continuous variable (such as the years since publication). Using data mining algorithms, Analysis Services clusters values for stores into statistically significant groupings. In this case, we ended up with the following ten groupings.
-3 to 1 years (because the Item table includes books that are scheduled for publication 3 years into the future, -3 is valid data.)
We used this statistical approach for the following attributes.
Years Since Published grouping
Retail Price grouping
Total Num of Pages grouping
Linear Ft grouping
(the amount of shelf space)
Square Ft grouping
(the size of the store)
For discretization groups with known thresholds (rather than statistical), we used a slightly different approach. For these, we created computed columns with thresholds hard coded in a Transact-SQL CASE statement. For example, here is the computed column called Age Of Store in the Store data source view. It uses the CASE statement to “bucketize” the source attribute (open_date).
CASE WHEN DATEDIFF(mm,open_date,GETDATE()) BETWEEN 0 AND 12 THEN 'Less than a year old' WHEN DATEDIFF(mm,open_date,GETDATE()) BETWEEN 13 AND 24 THEN '1 - 2 years old' WHEN DATEDIFF(mm,open_date,GETDATE()) BETWEEN 25 AND 60 THEN '2 - 5 years old' WHEN DATEDIFF(mm,open_date,GETDATE()) BETWEEN 61 AND 120 THEN '5 - 10 years old' WHEN DATEDIFF(mm,open_date,GETDATE()) > 121 THEN 'Older than 10 years' ELSE 'Unknown' END
While the statistical technique doesn’t require hard coding, it also doesn’t allow the business user to identify how they want the groupings calculated. Here we can embed business knowledge directly within the data source view.
To sort Age Of Store correctly, we created a computed column called Months Since Opened and defined it as:
Then we added both fields to the Store dimension as attributes. We created a relationship between Months Since Opened and Age of Store and set the Order By properties of Months Since Opened to Key. We set the Order By properties of Age of Store to Attribute Key. This configuration allows Age of Store to sort properly (by Months Since Opened) and it allows both attributes to be included in queries.
Omitting objects from a data source view
While the data source view is a powerful abstraction, there are times when you should not include objects in the data source view. For example, you should not include the following objects in your data source view.
Unnecessary tables and views (particularly for partition tables). If a table or view is not to be used in dimensions, or as a template table for a measure group, then don’t include it. Doing so just makes the data source view diagram more cluttered and difficult to navigate.
Business object tables that are not used by the system. For example, if the Personnel data source is included in the data source view so that it can be used, then add the Employee table but do not add anything else. Don’t include all of the other reference tables and entity tables that are used by the Personnel application.
Control tables used by other applications. For example, in the Personnel data source you may have tables that are used to control batches or the ordering of ETL processes and update streams for the Personnel application. Do not include those tables.
Modifying data source views
At some point you will want to modify a data source view. For example, you might change a field in one of the data source tables from an integer to a varchar, or you might increase the size of a varchar from 20 characters to 50 characters. Or, you might change the dimension definition, which then needs to be propagated up to the cube.
As you build upon objects such as dimensions from the data source view, and then add the dimension to a cube, and then add partitions to the cube, the existing metadata is propagated up the dependency object tree. Making a change at a lower level does not necessary mean that this change will be automatically moved up to higher objects. For example, suppose you change the data type of a field in the RDBMS dimension table. The data source view will not change automatically. The data source view must be refreshed to pick up the changes.
Likewise, if the data source view changes, objects that are built upon it will not change automatically. If a dimension or cube attribute is built upon the field, then you must re-create the object from the data source view. In fact, you can assume that when you make a change, you will have to manually update the higher objects. How you make the change up through the system depends on what the change is. Here are some guidelines.
If the change is a simple change in data type size, such as changing a database field from varchar(20) to varchar(50), right-click the background of the data source view and select Refresh. This updates the data source view with the new definition. There may be objects that still have the old value of 20 embedded in them. For fields that are used as measures in a measure group, edit the cube and look at the data type and size of the measure. For fields that are used as attributes, edit the dimension and change the attribute values.
If the change is the addition of a field, click Refresh on the data source view. This makes the field available so that Analysis Services objects can be created from it (that is, attributes for dimensions; measures for measure groups).
If the change is the deletion of a field, what happens when the data source view is refreshed depends on whether or not the field was used to create Analysis Services objects. If the field was not used to create Analysis Services objects, click Refresh on the data source view and the field is removed. Otherwise, when you click Refresh on the data source view, the system tells you which objects will be deleted (since they were based on this field).
If the change is the addition or deletion of a table, right-click the data source view and select Add/Remove Tables. Deleting a table that is in use by Analysis Services is by far the most disruptive change. Before you delete the table, first go to the higher objects in the cube. If the table is used in a measure group, delete the measure group. If it is a dimension table, then first remove the dimension from any cubes and then delete the dimension.
Making changes incrementally allows you to assess the impact of delete operations in a methodical way. If you just remove the table from the data source view, the system will tell you which higher level objects will also be removed. (This might be a long list if the table was used in many places.) When you click OK, the table and the specified objects are removed. While this is an easy one-step operation, because it does not let you assess the impact of far-reaching changes, it is better to proceed slowly and in stages.
The change might be to the logical structure rather than to the physical structure. For example, you might change the key structure of an attribute. The physical data source view hasn’t changed, but anything that uses this new logical structure will have to be changed, either by re-creating it or by refreshing the structures.
For example, it so happens that when you use a dimension on the grain of a cube, the key structure is embedded in the cube. To reset the structures, just change the grain of the cube to some other attribute, and then reset it back to the original attribute. This rebuilds the structures so the new key definitions are used. We encountered this when we had to adjust the key collection for the Project REAL time dimension.
When making attribute keys unique, it is common to create a key collection rather than to use a single field. In the early versions of the Project REAL schema, we had a much simpler time dimension that had nonunique keys for the Month attribute (that is, 1-12). To make the Month attribute key unique, we added the Year field to the Month key. When we did this, we found that we could no longer deploy the cube. The dimension built without error and browsed with the proper structure, but we got an error when we tried to deploy the cube. This was because the cube’s time dimension had one field configured for the key of the Month attribute, whereas the new key in the dimension had two fields. To fix this, we synchronized the Month attribute in the cube dimension.
To synchronize, we changed the grain of the cube from Time.Day to Time.Week. Then we saved the cube and reset it so that it used Time.Day again. The reset of the embedded dimension also reset the key structure so it now had two fields for the Month attribute. The problem was fixed.
In summary, data source views have two important roles in Analysis Services.
First, they are a layer of abstraction between the objects that are used by Analysis Services and the data source. This allows you to create objects such as named queries and computed columns, which could be created in the data source itself (for example, in relational views). This is important because Analysis Services administrators might not have the rights necessary to make metadata changes in the source system. For example, the source system may be a corporate SAP system that maintains a master personal roster. As a corporate resource, the DBA administering the Analysis Services server would not have rights to add, remove, or change views in the SAP system. By placing the changes in the data source views, the DBA gains the benefits of a layer of abstraction without having to change the source system itself.
Data source views allow you to create relationships between tables and views that are not physically located in the database or that are impossible because they are between databases. For example, you cannot create foreign key constraints between tables that are stored in two different tables. These relationships are important for dimensions that are constructed from 3NF or OLTP databases.
Don’t get carried away with data source views. They are a tool—not a panacea. If you develop a good multidimensional design, you will probably find that they are really just not that interesting.
In Project REAL we have a wide range of rich and complex dimensions. This was one of the things which drew us to this particular application. There is a lot of analysis based on the roles that vendors play. There are also a lot of semiadditive measures such as quantity on-hand and other types of inventory data. This multidimensional design will be very familiar to any one who has developed or maintained a retail database.
Dimensions contain the data that is typically used to filter or look at the measures in the fact table. Dimensions within the cube map to relational dimension tables.
The logical schema has two types of keys: surrogate keys and business keys. Surrogate keys are used for internal relationships between tables. Business keys (sometimes referred to as natural keys) serve as the identifiers to entities from the business itself. So, for example, while vendors are internally specified by their surrogate key (an integer with an identity property, such as 1-n), to the outside world their natural key attribute is the vendor number—for example, Vendor #7233703. For fact tables, the ETL process maps business keys to surrogate keys so that when records are exposed to Analysis Services, the translation has already occurred.
In our case, surrogate keys are used to track Type 2 slowly changing dimensions (SCDs). You will notice that all of the dimension key attributes use the surrogate key fields. We cannot use the natural business keys for them because multiple Type 2 change records will have the same business key. While the various types and uses of SCD is beyond the scope of this white paper, the data model itself uses Type 1 and Type 2 SCDs with inferred members for inserts. This makes the ETL process complex and very interesting. For more information on SCDs, see the Project REAL: Business Intelligence ETL Design Practices white paper. We also recommend Ralph Kimball’s book The Data Warehouse Toolkit, Wiley, 2nd edition, at http://search.barnesandnoble.com/ booksearch/isbnInquiry.asp?isbn=0471200247)
While dimensions give context and meaning to measures in reporting, the real analysis objects used by end users are hierarchies. SQL Server 2005 Analysis Services has two types of hierarchies: attribute hierarchies and user-defined hierarchies.
Within a hierarchy is a collection of levels. A drilldown in the hierarchy is called a level. For example, the Time hierarchy has the Year, Quarter, Month, Week, and Day levels. The Book hierarchy (or what we call Item in Project REAL) might have Type, Subject, Category, SubCategory, and Item levels.
There are two types of user-defined hierarchies (although both types are built in the same way and look exactly the same).
The first type of user-defined hierarchy is a reporting hierarchy. The purpose of this hierarchy is only for reporting. The underlying data does not support a real relationship between the various levels. For example, we have a reporting hierarchy that was set up as All -> Author -> Item. This is for reporting purposes only because the item determines the book; not the other way around. Placing the author above the item in the hierarchy results in a many-to-many relationship.
Another type of hierarchy is the natural hierarchy, sometimes called a strong hierarchy. In a natural hierarchy, the data in the levels have a many-to-one relationship with each other. Days roll up to weeks; weeks roll up to months; months roll up to quarters; quarters roll up to years. This allows the system to pre-calculate a rollup, or aggregation, such as a subtotal for Store Sales by city, then the cities are subtotaled by state, then states are subtotaled by region, then regions are subtotaled by country (subtotaled up the hierarchy). You can break and subdivide a natural hierarchy at any point because each member has one and only one parent.
Natural user-defined hierarchies are not new in SQL Server 2005. They exist in earlier versions of Analysis Services. What is new is reporting hierarchies and the way they are built—using attribute hierarchies.
An attribute hierarchy is a new type of hierarchy in SQL Server 2005. It is based on attributes or fields within the dimension table. Virtually any field in the database could be an attribute of the dimension. When an attribute is defined for a dimension, an attribute hierarchy is created for it. This means that if the time dimension table has a Holiday field that uses the values 0 or 1 to indicate if a day is a holiday, then you can create a Holiday attribute hierarchy on the field. Attribute hierarchies are flat. A hierarchy is flat when there are only two levels: the ALL level (which sums up to get a total for the attribute) and then the attribute level itself. Since Holiday exists as a stand-alone hierarchy independent of its use in a user-defined hierarchy, end users can analyze sales by Holiday. SQL Server 2000 Analysis Services has a similar concept called a virtual dimension. Using virtual dimensions you can create a dimension off of a member property, but virtual dimensions are limited in scope and flexibility.
Hierarchy-based vs. attribute-based systems
This is an important and fundamental difference between the two releases of SQL Server.
SQL Server 2000 Analysis Services is a hierarchy-based system. Internal structures are oriented around hierarchies and levels. This includes things like aggregates, which are a combination of levels, one for each dimension; and virtual dimensions, which promote member properties to dimensions. A dimension can have only one hierarchy. Multiple hierarchies are a naming convention. Two Time hierarchies, such as Time.Calendar and Time.Fiscal, might look as if they are related to Time, but internally these are two dimensions that just happen to have Time in their names. A dimension named Product was also a hierarchy named Product. There is no real distinction between dimensions and hierarchies.
SQL Server 2005 Analysis Services is an attribute-based system. Attributes are the fundamental building block of objects. Aggregates are combinations of attribute subtotals. By default, attribute hierarchies are automatically created for attributes. A dimension can also have multiple hierarchies. Time.Calendar and Time.Fiscal are two hierarchies (Calendar and Fiscal) of a single time dimension. User-defined hierarchies (such as Calendar and Fiscal) are purely navigational entities. They exist only to assist in the organization of attributes. As you will soon see, this concept will come up again and again as go through this section on dimensions.
So much for the logical structures used in dimensions. Now let us now look at a physical characteristic of dimensions—in this case, their use of memory.
Dimension cache in Analysis Services 2005
SQL Server 2000 and Analysis Services 2005 handle dimension members differently. In SQL Server 2000, all dimension members in all databases had to be loaded into the server’s address space at startup. Memory for other uses, such as process buffers and the data cache was then over and beyond dimension memory. This was very limiting. It meant that in a 32-bit system (which is limited to 3 GB of process virtual address space since Analysis Services is not AWE-aware), the maximum number of total dimension members you could have was typically a few million members. If you limited the number of member properties and kept names and keys short, you could have perhaps 3 to 4 million members. Beyond that you were forced to use a 64-bit server to get the larger virtual address space. The original database as it came to Project REAL was from a 64-bit server. This is because the Item dimension consisted of almost 7 million members. The Customer dimension consisted of almost 5 million members. This is far and beyond what is possible in SQL Server 2000 Analysis Services on 32-bit hardware.
With SQL Server 2005, Analysis Services uses a dynamic dimension cache rather than keeping all of its members statically mapped into memory. The system brings members into memory as they are needed. It then releases them when other dimension members are needed. We successfully built the entire Project REAL system (in fact, many versions of the entire system) on both 32-bit and 64-bit hardware. This bodes well for larger systems running on 32-bit hardware.
We have found dimensions which cannot be built using 32-bit hardware; but the threshold is much larger than with SQL Server 2000. The key determining characteristic is the hash table that is constructed to process the attribute hierarchy for the key attribute of the dimension. As you get more attributes (or if the size of the attributes is too large), then ultimately the available memory is exceeded and the dimension cannot be processed.
For dimensions that just fall on the threshold of not being built because they are too large to fit into memory, there are two ways to reduce the memory required to process them.
First, make sure that all of the attributes are truly needed for analysis. If they are not needed, then remove them from the dimension. Then the space needed for the hash table is smaller.
Second, use cascading attribute relationships so that you minimize the number of attributes which are directly dependent on the key. Every attribute must be directly or indirectly related to the key of the dimension. When you first create the dimension, you will notice that all of the attributes are directly related to the key. This is the nature of the key attribute—all of the attributes are related to it. However, as natural user-defined hierarchies are defined for the dimension, some of the attributes pick up additional relationships. They become both directly related to the key and also indirectly related to the key attribute through the natural hierarchy. For example, if you know the Day in the time dimension, then you also know the Year (the year is implied directly from the Day). By Day also implies Week, which implies, Month, which implies Quarter, which implies Year. Thus by defining this natural user-defined hierarchy you can remove Week, Month, Quarter, and Year from being directly related to Day. If indirect relationships exist, then delete the direct relationships.
The dependent attributes can be removed so that they are not directly related to the key since they are also related to it through the attribute relationships with the natural hierarchies. And thus they too can be removed from the key attribute of the dimension. As a good example of this, look at the time dimension in Figure 2.
Figure 2: A good design for a time dimension
Compare how the attribute relationships are defined for the calendar attributes and how they are defined for the fiscal attributes. Most of the calendar attributes are not directly related to the key attribute (Day); only the week is. Instead, most of the calendar attributes cascade up the calendar natural hierarchy. The fiscal attribute relationships are also in place, but the attributes are directly related to the key attribute.
The approach taken for calendar attributes is the preferred approach. If you have defined natural user-defined hierarchies, or other attribute relationships which have a cascading effect, then you must remove the direct relationships. There are two reasons for this. The first is that it consumes less memory. The second is that when the Aggregation Design Wizard runs, it uses the indirect relationships to decide which aggregates to design for the database. Having the redundant direct relationships will result in sub-optimal (slower) aggregate designs.
In Figure 2, note the relationships that are defined between attributes. Note the attribute relationships between day to week; week to month; month to quarter; and quarter to year. What does it mean when you define relationships like that? Interestingly, it means that there is a many-to-one natural hierarchy in place. Given the week, you know the one and only one month that it rolls up to. Given the month, you know the one and only one quarter that it rolls up to. Given the quarter, you know the one and only one year that it rolls up to. The system can then optimize its calculations so that rollups are done up the hierarchy.
Best Practice: Spend time with your dimension design to capture the attribute relationships within the dimensions.
Important: You must define attribute relationships if you want to design effective aggregates, if you want effective run-time calculations from the formula engine, or if you want valid results in MDX time functions.
In the hierarchy-based nature of SQL Server 2000 Analysis Services (which only supports natural hierarchies), aggregates are designed around hierarchies. In SQL Server 2005 Analysis Services, aggregates are combinations of attributes. User-defined hierarchies are not used. The Storage Design Wizard uses attribute relationships to determine when combinations of attribute rollups will be useful (and thus aggregates will be designed for those attributes). Without relationships, one attribute is as significant as any other attribute, so the Storage Design Wizard simply ignores the attribute and uses the ALL level for the dimension. Thus if you want to design effective aggregates, you must define attribute relationships. Without them the system still returns the proper number, but values must be calculated at run time and aggregates are not useful.
Attribute relationships are also used by the formula engine when calculating complex MDX expressions. Without attribute relationships, many operations such as non-empty cross joins cannot be optimized or handled effectively. Thus if you want effective run-time calculations from the formula engine, you must define attribute relationships.
Lastly, attribute relationships are critical with time dimensions. Many of the MDX functions that are related to time return valid results only if the attribute relationships and attribute types have been set up properly. Normally you can rely on the BI Time Intelligence Wizard to set up the proper structures and relationships. However, if you are manually defining a time dimension and you are using MDX time functions, then you must define attribute relationships within your time dimensions if you want valid results.
There are two types of attribute relationships. The first type of relationship, which we have been discussing so far, is used by the system to express where rollups are useful, that is, when to design aggregates. The reader who is familiar with SQL Server 2000 Analysis Services will notice that there is another type of relationship as well. This is the classical member property.
For example, we have an attribute in the Project REAL logical schema that is the regional manager’s name and another for his or her phone number. These attributes are denormalized into the Store dimension table. These attributes differ from other attributes, such as region, district, and store since they are used for display purposes and not for analysis. Analysis is done for the region and not for the manager’s phone number. While the manager’s name and phone number are not typically used for analysis, they are related to the region. If you know the manager’s name or phone number, then you automatically know what region the manager is in. Thus, you should express a relationship between these two attributes. While this kind of relationship isn’t useful for analysis, it is useful for end-user client tools. Since member properties are expressed as relationships, the relationships must be in place so that the client tool can display a list of member properties when the end user right-clicks the member.
In summary, to ensure that you design and implement a system that performs well, specify the attribute relationships. Ignore them at your peril.
SQL Server 2005 requires that an attribute be identified as the key attribute in the dimension. This key must be unique. The system will force you to guarantee uniqueness when it processes the dimension. This is particularly challenging for DBAs who are familiar with SQL Server 2000 Analysis Services.
The hierarchy-based nature of SQL Server 2000 Analysis Services allows a DBA to express key uniqueness in several ways. Member keys can be set as unique across the whole dimension. For example, there might be only one member with key #42 among all members in the dimension. Or, uniqueness can be set across a level. For example, the Subject level can have a member with key #42, but so can a member at the Item level. Or, uniqueness can be set to off. In this case, there can be multiple members with key #42. The only restriction is that no two members with key #42 can have the same parent. Internally the system uses the fact that there is only one hierarchy for the dimension to create a unique key even if you didn’t specify a unique key. The system constructs a unique key by using a combination of keys going up the hierarchy and uses that to guarantee uniqueness.
In SQL Server 2005 Analysis Services, a dimension can have many hierarchies. Or, it might have no user-defined hierarchies and only attribute hierarchies. What does this mean for key uniqueness? Since an attribute must be able to stand on its own without a structured hierarchy, such as in a flat attribute hierarchy, what does key #42 mean? Ultimately, it means that keys must be real keys. That is, they must uniquely identify one and only one attribute.
Besides the unique key attribute, the system also looks at key uniqueness for every attribute. While this is not enforced, as we will soon see, it can lead to unexpected results if you don’t set up the dimension to guarantee attribute key uniqueness.
Best Practice (required): You must always ensure attribute key uniqueness.
First, we will explain how you specify a unique key for an attribute, then we will demonstrate what happens if you do it wrong.
Assume that you create a standard time dimension for your project (like we did with Project REAL). As an attribute in that dimension, consider the Month field. It contains a number from 1 through 12 in an association that you would expect: 1 = January, 2 = February, and so on until 12 = December. On the surface this looks as if the uniqueness requirement has been met. But, how does the month for Jan 2003 compare to the month for Jan 2005? In reality, the key is not 1 through 12. The key is a combination of the month and year. To a relational expert this is not a new concept. This is a primary key and a concatenated key. However, for some who are used to SQL Server 2000 Analysis Services, it is a new idea. With SQL Server 2000, you could set the uniqueness at a level and then forget about it. With SQL Server 2005, we need to undo some old habits.
Fortunately, it is easy to specify a unique key. Simply look at the Key property of the attribute. You will see that you can specify a collection of fields in addition to a single field. Each field that you add to the collection increases its uniqueness (as you build the concatenated key). For the Month key, we need to create a collection of the Month and Year fields as shown in Figure 3.
Figure 3: Specifying a key collection for month and year
Let us see what happens before and after making these changes.
Figure 4 shows what Project REAL Time.Calendar hierarchy when you use nonunique keys for Quarter, Month, and Week. In the underlying RDBMS, the keys range from (1 through 4), (1 through 12), and (1 through 52) respectively. The problem is that these keys are not unique across the dimension. For example, how do you differentiate between January 2001 and January 2002, or Q2 2003 and Q4 2004? To make the keys unique, you need to create a concatenated key with Year. Thus, the Quarter key should be (Year, Quarter), the Month key must be (Year, Month), and the Week key must be (Year, Week). In this case we did not create concatenated keys. You can see the results of this in Figure 4—years without quarters, wrong quarters in years, and other incorrect results.
Figure 4: Time dimension prior to setting up month key uniqueness
Figure 5 shows what the same hierarchy looks like once we add Calendar_Year_ID to the respective key collections for Quarter, Month, and Week. The results are now correct.
Figure 5: Time dimension after setting up month key uniqueness (by creating the key collection)
It is also interesting to look at the row counts in the processing dialog box. Before we created the concatenated key with Year, the Month attribute hierarchy returned 13 records. This is what you would expect when using a SELECT DISTINCT statement against just the month keys plus an unknown member. After the Year key was added, 219 records were processed since the SELECT DISTINCT now includes years.
Several comments are in order. First, if you do this exercise using the sample database provided with Project REAL, you will notice that no errors are reported (warning or fatal) when the dimension is processed. You have to scan your dimensions and notice that members are not coming out where you expected.
Note that key uniqueness is tied to related attributes. If you have not specified related attributes, then all attributes are related to the key. Since the key must be unique, all is well. However, when you start specifying related attributes, you are saying something implicitly about those attributes. You are saying that the keys are unique. Notice that in the previous example, the related attribute relationships were in place when we found that key nonuniqueness was causing a problem.
Had we been unable to generate a unique key, our only alternative would have been to remove the related attribute. Had we done that, the attributes would be directly related to the key attribute—and key attributes must be unique. Not having the related attributes makes it impossible to produce a good quality aggregation design. So the bottom line is that you should respect the word “key” in Analysis Services. Routinely ensure throughout your design process that keys are unique across the dimension.
Best Practice: Always scan your dimensions to ensure that they have the wide, even distribution of members that you expect.
If you see a lopsided distribution, then in all probability one of two things has occurred. Either unique keys have not been identified for the underlying attributes or your hierarchy is ordered incorrectly. For example, you inadvertently specified Year, Month, Quarter rather than Year, Quarter, Month. Both of these situations result in unusual hierarchies ever though no errors are reported during processing.
If you are going to define attribute relationships, they must be based on valid patterns in the data. Attribute relationships are critical to designing and implementing a high performance system. What happens if the key structure directs the system to perform rollups which are wrong? For example, what if you reverse two of the levels in the natural hierarchy?
Suppose you define the Store natural hierarchy in the order (ALL, Region, District, City, Store) when the order of the data is really (ALL, District, Region, City, Store). Natural hierarchies are built from attribute relationships. In this case, the single-parent assumption that is built into an attribute relationship is not supported by the data. An example of this might be when a region rolls up to many districts. Thus, the calculations are wrong. Numbers are counted twice and miscalculated in the hierarchy at run time. The system runs poorly and returns the wrong answers. So be careful when defining attribute relationships, because no error is returned if they are incorrectly defined.
Best Practice: If you tell the system that attributes are related, then they must be related.
The data must support the relationships. In addition, the uniqueness of the keys must allow these relationships to be executed.
This raises an interesting question. What happens if you simply don’t define related attributes? Surprisingly, this results in a valid hierarchy that returns the correct answers. See Figure 6.
Figure 6: Time with no attribute relationships defined
The hierarchy in Figure 6 looks valid. When you query the system, the numbers add up correctly. The problem is that without related attributes, the system is unaware of the underlying strong relationship between attributes. This relationship indicates that rollups can be precalculated up the hierarchy. (SQL Server 2000 Analysis Services uses aggregates for this purpose.) Instead, the system does the rollups by using its run-time system at query calculation time. Since aggregates are not designed, you haven’t told the system that the attributes in the hierarchy are related. So you get a clean hierarchy and good numbers up the hierarchy, but you cannot use precalculated aggregates to calculate the subtotals—and performance suffers.
Converting virtual dimensions to attribute hierarchies
Unfortunately, we cannot include in our Project REAL documentation the original SQL Server 2000 Analysis Services design that we started with before we converted it to SQL Server 2005. The original design is much simpler than the final design after conversion. Almost a half a dozen dimensions have been eliminated. In their place are attribute hierarchies, mostly in the Item dimension. For example, the system carried five virtual dimensions that were built on the vendor relationships in the Item dimension. So in the original design there were dimensions for Source Vendor, Return Vendor and Purchase Vendor, among others. In the final Project REAL design for SQL Server 2005, these have been replaced with the Item. Source Vendor, Item.Return Vendor, and other attribute hierarchies. The original design included a Department regular dimension. (This dimension was built from a view on top of the Item dimension table.) In the final design for SQL Server 2005, this has been replaced with the Item.Department attribute hierarchy.
Best Practice: Convert your SQL Server 2000 Analysis Services virtual dimensions to attribute hierarchies.
If a virtual dimension has more than one level, then convert it to a user-defined hierarchy with the corresponding attributes as levels. The Migration Wizard will do this for you automatically. However you should be aware of the automatic conversion when you are either doing migrations manually or redesigning the system in SQL Server 2005.
If there is just a single level in the virtual dimension, then delete the virtual dimension from your data model (and application code) and use the attribute hierarchy directly.
When you look at your 2005 designs, consider each and every dimension and ask yourself whether or not the entity is really an attribute of another dimension. If so, you can replace it with an attribute hierarchy. This reduces complexity and has more meaning to end users. They are exposed to the original dimension type, which adds to their understanding of the underlying multidimensional design.
For example, the original Project REAL design included a virtual dimension called Department. It identified the store department (such as hardback or bargain books) in which the book would have been purchased. The Department virtual dimension was a based on a member property of the Item dimension. Rather than leaving it as a physical dimension, we converted it to an attribute hierarchy in the Item dimension. Thus, end users clearly see that it is related to Item and not to Stores, or another dimension.
Possible naming conflicts
You could run into naming conflicts when working with user-defined hierarchies. This is because user-defined hierarchies share the same namespace as attribute hierarchies.
In SQL Server 2000 Analysis Services, there were two types of names: dimension/hierarchy names and level names. In Analysis Services 2005, there are three namespaces: dimensions, user-defined hierarchies, and attribute hierarchies. A potential naming conflict arises because user-defined hierarchies and attribute hierarchies have the same namespace.
For example, you have a dimension named Buyer. Since many front-end tools only expose hierarchy names, you are tempted to create a hierarchy that is also named Buyer. You could do this in Analysis Services 2000. The challenge is that if you create a hierarchy named Buyer, you cannot have an attribute hierarchy named Buyer. In Project REAL, we renamed the attribute hierarchy to Buyer Name. All of the objects that are circled in Figure 7 must be uniquely named.
Figure 7: Naming conflicts between attribute hierarchies and user-define hierarchies
You cannot have an attribute hierarchy named Return Vendor and a user-defined hierarchy named Return Vendor. Thus in our design, we extended the typical naming convention that is used in SQL Server 2000 Analysis Services. In complex situations where names are frequently reused, we named user-defined hierarchies “By <name>” (where name represented the attribute that is being analyzed in the user-defined hierarchy). There are exceptions to this rule. Sometimes a well-known relationship exists that does not need clarification. A good example of this is Time.Calendar and $12.50. To follow this “By <name>” convention does not feel right; Time.By Calendar and Time.By Fiscal seems unusual. But the naming convention is a useful trick for working with user-defined and attribute hierarchies that have the same namespace.
Promoting fields in relational tables to attributes in the multidimensional design
As we went through the Project REAL multidimensional design, we constantly asked ourselves whether or not to include a particular relational field in the dimension table of the data source view as an attribute hierarchy. We called this promoting the field in the data source view.
Figure 8: Promoting a field in the data source view to an attribute of the dimension.
For example, the Item dimension has more than 144 available fields in the dimension table in the data source view. Creating attributes for all of these fields is not an efficient use of resources since many of the fields will never be used for analysis. Ultimately we decided to include 44 of these fields as attributes in the dimension.
An interesting design question is whether every field in the dimension table should be added (or promoted) to an attribute hierarchy. By default, the Dimension Wizard does this. It moves every field as an attribute hierarchy. You have to manually unmark the check box next to a field to prevent that field from being promoted. Whether not promoting a field to be an attribute is a good choice or not depends on the complexity and richness of the relational data source.
We came up with some general guidelines for when and when not to promote a field to an attribute. You might find this information useful.
First, decide whether or not you need to do analysis on the field. Attribute hierarchies are the principal method that end users use to perform analysis. They click and drag, and they slice and pivot, using attribute hierarchies as the objects which are selected. If a field is not an attribute, then it cannot be manipulated and analyzed.
The true power of the Unified Dimensional Model (UDM) is that it assumes a rich set of attribute hierarchies that are available for analysis. This analysis may use either predefined objects, such as user-defined hierarchies, or those which are available for ad hoc manipulation. Attribute hierarchies are used for ad hoc analysis. For example, you might want to analyze the regional selling patterns of hardcover books compared to paperback books. The user-defined Store.ByGeography hierarchy allows you to drill down to the organization level you want, that is, into regions and districts. On the other hand, you could drag the Item.Department attribute hierarchy onto the grid and nest it inside a region to see subtotals based on hardcover verses paperback books. If you do not promote the Department field to an attribute hierarchy, you cannot perform this kind of ad hoc analysis.
If you might wish to later perform an analysis on a field, then you should promote it.
For example, the Item dimension table includes a field called Original EAN. This is the first EAN number assigned to a book. (This number might later be changed.) We decided that end users will not want to do an analysis on this field, so we did not display it. Since the current EAN number is all that was required, we did not promote it to be an attribute of the dimension. But, we might be wrong. If there is a good chance that in the future users will want to use this field for analysis, then we should promote it.
With a large table, you cannot promote everything. When there are many fields that could potentially be promoted, deciding which ones to promote can be challenging. For example, the Project REAL Item table has more than 144 fields that might be promoted to attribute status. This is far more than is reasonable to expect users to keep track of.
If an attribute is only conceptually useful in the context of a hierarchy, (in other words, it has little or no analysis potential by itself), then promote it to be an attribute hierarchy (so that you can add it to a user-defined hierarchy), but set it to hidden. That way it doesn’t confuse end users.
For example, we set the Buyer.Buyer Alpha attribute hierarchy to hidden. This attribute hierarchy is the first character of the buy name and has only navigational usefulness within the Buyer hierarchy. We would not expect end users to perform stand-alone analysis based on the first character of the buyer’s name.
The following guidelines apply to instances when you should not promote a relational field to an attribute hierarchy.
If a field has no functional dependency (in a 3NF context) on the dimension, do not promote it. For example, the Item dimension table has 15 or so fields that are not functionally dependent on Item. They are really denormalized fields (that is, properties) of the distribution center that is used for that item. They are not functionally dependent on Item. Thus, these denormalized fields should not be included as attribute hierarchies for Item. Ultimately we may build a distribution center dimension from this data, but currently that is not part of the Project REAL design. Just because a field is included in the source RDBMS tables that does not mean that it always needs to be included in the multidimensional data model.
Do not promote fields that do not have a business relationship with the dimension. It is not uncommon for tables to include fields that exist for administrative purposes. These fields do not really describe an entity (such as an item or store). Fields of this type might include a field that stores the date on which the data was last modified, or a field containing the name of the user who modified the data, or the date the record was created. Since they don’t have a relationship to the dimension, these fields should not be promoted.
Data type mismatches with tinyint keys
A well-known best practice in relational design is that you should always use the smallest data type possible to meet the domain of the object you are modeling. Thus if you know (for business reasons) that a particular value can never exceed a certain threshold, do not use a larger data type. Doing so wastes storage for the values.
While in general this is good advice, watch out in SQL Server 2005 for data type mismatches when dealing with tinyint keys. Tinyint keys are stored as a byte (between 1 and 255). When the data source view is created, if the tinyint key is an identity field (possible as a surrogate key), then the system converts it to an integer. Thus the dimension key is an integer. But, in your fact tables, the value is carried as a tinyint, resulting in a data type mismatch.
For example, in the original Project REAL design in SQL Server 2000 format, the Department dimension was a full dimension table. The Department key was a tinyint (1 through 13), which contained the type of the item, such as hardcover or paperback. The Department key had an Identity property, so the next department would be 14, then 15, etc. Because, relationally, a best practice is to always select the smallest data type that can naturally satisfy the requirement, tinyint was selected. In this case, the original designers didn’t expect to have more than 255 departments. Naturally, in the fact table there was a foreign key, which was also a tinyint. This resulted in a data type mismatch when we migrated the design to SQL Server 2005 Analysis Services.
The workaround in this situation is to create a computed column in the data source view on the dimension table, which specifically casts the key field to a tinyint. Then use that field rather than the original key field for the dimension key attribute hierarchy. While you could also convert the tinyint in the fact table to an integer, this results in wasted space.
While you might think this is a bug (we certainly did), it turns out that this is actually the result of some much earlier development work with Microsoft XML Core Services (MSXML). In MSXML there is the notion of an implicit data type conversion from unknown data types to their destination data types (not unlike what SQL Server RDBMS does). It turns out that tinyint is an old legacy data type for SQL Server and that MSXML actually does the data type conversion to integer. At this point there is too much existing code for MSXML to change its behavior so we have to live with the workaround.
Deciding how a system deals with unknown values is always an interesting design decision. In SQL Server 2000 this was easy. You created your own application-level unknown members. This involves adding an unknown member to your dimension and added ISNULL or some other relational technique on the source for the fact table. New in SQL Server 2005 is the ability to allow the system to generate its own unknown member and to automatically assign data to it.
In Project REAL we did both. We created application-level unknown members (what we called “missing”) on most of the keys. Most of this work was done in the views and typically involved returning a surrogate key of 0 to indicate that a lookup was missing. This approach allows the system to be more graceful when it sees unexpected data. We can track logical “missing” data (something we expected) with unexpected invalid data (that is, dirty data). The problem with using system-generated unknown members is that you can’t detect the difference between logically inconsistent data and dirty, unexpected data.
Best Practice: Where possible, create your own unknown members. Use the system-generated unknown member sparingly.
The following code is an example of how an unknown member is handled in the relational views.
CREATE VIEW [dbo].[vTbl_Dim_Store] AS SELECT store.SK_Store_ID ,store.Store_Num ,store.Order_Status ,store.Store_Desc ,COALESCE(store.Division_Num,0) as Merch_Div_Num ,store.Status ,COALESCE(store.Status_Desc,'Unknown') as Status_Desc ,store.Open_Date ,store.Close_Date ,store.Division_Num ,COALESCE(Store.Division,'Unknown') as Division ,store.Region_Num ,COALESCE(store.Region,'Unknown') as Region ,store.District_Num ,COALESCE(store.District,'Unknown') as District ,store.Market_Area_Num ,COALESCE(store.Market_Area,'Unknown') as Market_Area ,store.Market_Type ,store.Ad_Area_Num ,COALESCE(store.Ad_Area_Desc,'Unknown') as Ad_Area ,store.Center_Desc ,COALESCE(store.City,'Unknown') as City ,store.Zip_Code ,store.State ,store.Mgr_Name . . . FROM dbo.Tbl_Dim_Store store
The COALESCE functions are used to cast NULL values in the database to the logical unknown member within the dimension. In the case of the Merch_Div_Num column in the previous Transact-SQL code sample, the key value of 0 is the application convention for the unknown member. These examples are logical cleanup functions to the unknown member. We expect some NULLs to be present. We cast these NULLs to values that we have pre-established for those semantics—for example, the key value of 0.
The system-generated unknown member is new in SQL Server 2005. In the Project REAL design, we used it to handle missing customers. As it turns out, this particular dimension involved a lot of dirty data. After we designed the entire system, we noticed that roughly 10% of the sales data was failing because the surrogate key lookup for Customers was missing. As it turned out, this was caused by the natural delay between the time a sale occurred and the time a new customer showed up in the loyalty card system. It sometimes took a month or more for new customers to show up in the data feed from the Customer data source. It showed up immediately in the Sales data feed, but was delayed in the Customer data feed. We felt that this data feed mismatch was not an uncommon situation. You probably recognize it immediately. So we decided not to adjust the entire ETL and Analysis Services processing in order to track this type of activity. Rather we decided to define a system-generated unknown member. We then changed the error configuration for the Store Sales partitions so that foreign key lookup errors are converted to assignments to the unknown member.
Figure 9 shows how to configure the dimension.
Figure 9: How to configure a system-generated unknown member
Figure 10 shows how we changed the error configuration on the Store Sales partitions.
Figure 10: Changing error configuration settings on partitions
Figure 11 shows what it looks like when the system-generated unknown member is used in a query.
Figure 11: System-generated unknown members in a query
The Time Intelligence Wizard
We encountered a few challenges when using the Time Intelligence Wizard to create new time dimensions. New in SQL Server 2005 Analysis Services is the server-side time dimension. While it is tempting to use server-side time dimensions because of their ease of use, (just click a couple of times and bingo you have a time dimension), we don’t necessarily recommend this as a general best practice.
Best Practice: Where possible, create a stand-alone time dimension table.
Having a time dimension table gives you a lot more flexibility when dealing with complex time situations.
You can add your own time properties. For example, does the day fall on the weekend or a weekday? Is the day a company holiday? Is it an in-season or out-of-season day? Is it in the Christmas season (this is very important for many retailers)?
It is easy to identify and build multiple hierarchies. For example, when two companies merge, there may be a period of time when the new merged company needs to run two different fiscal calendars (one for each company).
Unusual hierarchies can be constructed. For example, if a company’s manufacturing month always starts on the first Monday of the calendar month, it can have an unusual day numbering system.
In complex situations, the business model may require that entire levels be skipped. For example, a special calendar might only have years, weeks, and days. Or, it might have only years and days.
It is possible to have calculated members that are added as needed, such as values for a logical “current day.” In some companies, the day the books close is independent of the data feed. The data flows every day, but the company needs a logical current day so that it can freeze the books before the books close.
In general, as a best practice we recommend always using surrogate keys between entities in your relational design. However, this is one place where this practice might not make sense. If you already keep a timestamp or some other representation of a date (for example, the integer 20050321 to indicate March 21, 2005), then it probably makes sense to continue to use that technique. There are two points to consider in this case. First, make sure that you use a date stamp and not a datetime stamp (remove time of day from your foreign key). Second, you can use the SQL Server RDBMS DATEPART function to automatically construct the foreign key value from a date stamp.
If your application needs a drilldown that is deeper than day, consider separating time into two dimensions. One dimension records time down to the day level and a second dimension records the time of day.
For example, you want to keep track of item sales minute by minute. At first glance you might be tempted to create a dimension with a user-defined hierarchy such as the following:
Year -> Quarter -> Month -> Week -> Day -> Hour -> Minute
Assuming that the dimension is implemented this way, how many members would there be in this dimension if you kept five years of history? The answer might surprise you. It turns out that this requires 2,675,795 members. Not only is that a large number, but having hours and minutes modeled like this makes it difficult to spot trends across weeks and months. Instead, consider using two dimensions such as the following:
Time: Year -> Quarter -> Month -> Week -> Day
Time of Day: Period -> Hour -> Minute
(period might be off-hours, morning, mid-afternoon, late afternoon, evening)
Five years of history results in fewer members (3,640 members). This approach also allows for a much richer analysis. You can slice sales by just the morning hours or you can do a comparison for the first week of the month that compares morning sales with afternoon sales (such as the morning after monthly paychecks).
There are two techniques for creating a time dimension table. One technique is to use the standard Dimension Wizard. Use the wizard to create all of the attributes, hierarchies, and other objects. Then go back and change the type of the various attributes to match the appropriate time semantics. For example, label your month attributes to indicate whether they are calendar, fiscal, reporting, or another type. Use this approach if most of your attributes and hierarchies do not have special time semantics, or if you don’t use time-sensitive functions at all.
Another technique is to use the Time Intelligence Wizard. To start this wizard, open the standard Dimension Wizard and indicate that this is a time dimension. The Time Intelligence Wizard takes over and allows you to enter time-specific information as shown in Figure 12.
Figure 12: Using the Time Intelligence Wizard
You will notice that this wizard is missing something. You can enter many time attributes, but you don’t have the opportunity to enter both a key and a member name at the same time. You must choose whether to enter key fields or name fields.
Best Practice: In the Time Dimension Wizard, enter key fields (not names) for the various attributes, such as year, month, and week.
The advantage of selecting keys is that later on, you only have to change the member names. By default, the wizard sets the Order By property to Key. Thus, if you enter the key field here, then all you need to change later on is the name field.
While we are on the subject of fixing up definitions after you run the wizard, remember the requirement for member key uniqueness. With time, it is very easy to be seduced into using traditional keys and names. For example, you might have names and keys such as Q1, Q2, Q3, Q4. Or you might use keys such as 1 for January, and 12 for December. While DBAs frequently remember that they need to have month keys of 1-12 to order the months properly (otherwise April becomes the first month of the year), we don’t always remember that having a quarter of Q2 or a month of 4 is not enough to uniquely specify the key. You almost always have to go back and create a collection for member keys and add the year to the collection to specify uniqueness.
Do not forget to create attribute relationships in the various hierarchies that you created. For example, create an attribute relationship from day to week, week to month, month to quarter, quarter to year. Like key uniqueness, this is critical if you are to have a well-running system.
Once you have fixed member names, ensured the key uniqueness of the time attribute hierarchies, and established the attribute relationships to support the hierarchies that you have defined, you can add the other time-related attributes, such as weekends, holidays, and season indicators.
Thus, while time seems to be an easy concept (after all, it is just a timestamp), it actually will take a considerable amount of your design time to construct all of the tables and Analysis Services objects, and to make sure that you have them configured properly.
The relational data model implemented for Project REAL has three fact tables: Store Sales, Store Inventory, and one for Distribution Center Inventory. Physically, these three logical fact tables will be partitioned weekly to handle the large volume of data (for example, 140-200 million inventory records per week). Partitioning is also important because:
It provides better performance. Only those partitions which cover the time period needed by the query are scanned. In general, the more partitions you have, the smaller each partition is and queries execute quicker.
It is easier to delete data. Simply delete the partition and the data is removed. You do not have to reprocess any cubes. Normally, deleting data from a fact table requires reprocessing the cube.
It makes it easy to implement rolling periods of time. For example, suppose a system is to keep the last three years worth of data. Rather than a fixed start date, you might keep rolling 36 monthly partitions. New partitions are created as the weeks or months move forward, older partitions beyond the three years are simply deleted.
A reader who is knowledgeable with best practices in SQL Server 2000 Analysis Services will notice an interesting fact in the Project REAL design: all of the measure groups are contained in a single cube. A measure group represents what in SQL Server 2000 Analysis Services is called a cube. A cube is a fact table that is linked to a subset of dimensions that are available in the database at a given grain. The grain of a fact table is the lowest levels in the various dimensions. SQL Server 2000 Analysis Services had a feature called a virtual cube. This allowed you to group disparate cubes together into a single framework. In SQL Server 2005 Analysis Services, a cube is the single framework. Measure groups now capture the subset of dimensions at various grains.
If you want to use a virtual cube, you can still create one by using linked measure groups between multiple cubes with a single measure group per cube (an old-style cube). In fact, this is what the Analysis Services Migration Wizard creates. It does this so that the number (and type) of objects is the same between a SQL Server 2000 Analysis Services database and a SQL Server 2005 database.
There are pros and cons to the metacube approach. In this approach, everything is thrown into a single cube. The most important advantage of the metacube approach is that it allows an end user to connect and query a single structure without having any artificial boundaries imposed because of the way the cube was constructed. End users just see a collection of measures. Depending on how they query the cube, the system dynamically adjusts the necessary measure groups.
Calculations are similar. You just create a single cube to contain all of your calculations and the system adjusts itself accordingly. To get a feeling for this kind of rich object space (where everything is thrown into a single “pot”), browse the sample AdventureWorksDW Analysis Services database. This database contains 18 dimensions, 154 attribute hierarchies, 18 user-defined hierarchies, over 100 calculations, and 9 measure groups (all at different grains)—all part of a single cube. Wow!
However, there can be a cost to this approach. The most obvious cost is that there are many objects to navigate. You need a client application that can handle the new SQL Server 2005 objects such as display folders and perspectives. These objects allow you to organize objects into functional groups so that end users are not overwhelmed with the multitude of different options available to them.
The second cost of this approach is that doing a full process of a dimension resets the processed state to any measure groups that use that dimension. This includes both the measure groups and all of their partitions. We found this out in a test case where we inadvertently did a full process of the Vendor Type dimension (which contains only five members). Doing so reset all of the Sales partitions. We had to go back and do a full reprocess of all of the 153 partitions in the Sales and Item Vendor measure groups. Needless to say, about ten hours later when this process had completed, we decided to not do that again unless we really had to. It could have been worse. If we had done a full process of the Measures dimension (with just one member), this would have affected all of the measure groups. Two terabytes and three days later, we would have the cube back.
As mentioned earlier, fact tables contain three types of information.
The granularity of a fact table is represented in its dimensionality. Each dimension has a foreign key which points to the dimension table at the appropriate level/attribute. In Project REAL, these are surrogate keys that are generated by the system so that we can implement Type 2 slowly changing dimensions. The level at which these surrogate keys are joined represents the grain in that dimension. For example, the time dimension key might be Month. Another fact table might be Day. For the other dimensions, the keys might be Item and Store.
Not all dimensions need to be represented in a measure group. For example, the Store Inventory and DC Inventory measure groups do not do analysis based on vendors. Thus, they do not include the Vendor or Vendor Type dimensions.
For the Project REAL cube, the grain shown in Figure 13 is used.
Figure 13: The dimension-to-measure group mapping panel in the cube editor
Fact tables contain the measures we typically want to report on for our solution. In the example, the Sales fact table has measures such as sales amount, sales quantity, discount amount, and coupon amount.
Optionally, fact tables may also contain additional information (called degenerate keys) that does not point to dimensions. Rather, this information contains the dimension itself. For example, for Store Sales, the fact table may contain the register number for the item sales. There were no degenerate keys in the Project REAL fact tables.
Internally, Analysis Services stores a lot of information in each measure group. The next section describes a subset of the information that is tied to a measure group (and included in the design details).
The system stores the fact tables in partitions. A partition has the same structure as the measure group. It contains the same fields with the same data types. However, a partition contains a subset of the data. For example, it might contain only a week’s worth of data.
Unless otherwise stated in this section, assume that partitions are kept at the week level. There is a relational table called vTbl_Fact_Store_Sales_WE_2004_11_27.
The partitions in the Project REAL database seem to violate one of the basic best practices of SQL Server 2000. There is no data slice set for the partitions. In SQL Server 2000, partitions must have the data slice set so that the run-time engine knows which partition to access. This is similar to specifying a hint to a relational query optimizer. In SQL Server 2005, this is no longer necessary. Processing the partition now automatically builds a histogram-like structure in the MOLAP storage. This structure identifies which members from all dimensions are included in the partition. Thus, so long as the storage method is MOLAP, the data slice is an optional (and unused) property. However, the data slice is used with ROLAP storage or when proactive caching involves a ROLAP access phase. In both of these circumstances, the actual fact data is never moved so the system does not have a chance to identify a member. In this case, setting the data slice for the partition remains a necessary and critical step if you expect the system to perform well.
Because the MOLAP structures dynamically determine the data slice, a new type of partitioning technique is possible with SQL Server 2005. The best way to describe this technique is via a simple example.
Suppose a system that you are designing has a product dimension of 1,000 products. Of these, the top 5 products account for 80% of the sales (roughly evenly distributed). The remaining 995 products account for the other 20% of the sales. An analysis of the end-user query patterns show that analysis based on product is a common and effective partitioning scheme. For example, most of the reports include a breakdown by product. Based on this analysis, you create six partitions. You create one partition each for the top 5 products and then one “catchall” partition for the remainder. It is easy to create a catchall partition. In the query binding, add a WHERE clause to the SQL statement as in the following code.
In the top five partitions (1 through 5) use the following code.
SELECT * FROM <fact table> WHERE SK_Product_ID = <SK_TopNthProduct#>
In the catchall partition use the following code.
SELECT * FROM <fact table> WHERE SK_Product_ID NOT IN (<SK_TopProduct#>, <SK_2ndTopProduct#> <SK_3rdTopProduct#> <SK_4thTopProduct#> <SK_5thTopProduct#>)
This technique requires a lot of administrative overhead in SQL Server 2000 Analysis Services. In SQL Server 2000, the data slice must identify each and every member in the partition—even if there are thousands and thousands of members. To implement the example, you would need to create the catchall partition data slice with 995 members in it. This is in addition to the administrative challenge of updating that list as new members are added to the dimension. In SQL Server 2005 Analysis Services, the automatic building of the data slice in the partition eliminates the administrative overhead.
Another challenge when creating a system with a large number of partitions is how to create several hundred partitions. In SQL Server 2000, partition were created either one-at-a-time by using Analysis Manager (a very time-consuming and potentially error-prone process) or by having a developer write an application program to automate the creation.
New in SQL Server 2005, SQL Server Management Studio can create multiple partitions, even hundreds of partitions, at the same time. Normally, when you create a partition you would expect to use the tables that are defined in the data source view. After all, that is why the data source view exists. However, you don’t have to. Create a partition in the Store Sales measure group (make sure that at least one partition is already present and processed) and then examine the dialog box, such as the one shown in Figure 14.
Figure 14: Creating a new partition
Notice that, while the dialog box defaults to using the data source view, you can also scan and locate tables and views in the data source. Select the data source and then click Find as in Figure 15.
Figure 15: Finding partition fact tables
The dialog box lists the tables in the data source that match the metadata (fields and data types) contained in the measure group “template” table (or view). See Figure 15. If you select multiple items, the system creates a partition for each table checked.
Figure 16: Selecting multiple partition fact tables
This has two important side effects. First, it means that not all objects have to come from the data source view. In Project REAL, this meant that we did not have to include all 231 weekly partitions. We included just the one “template” object to create the measure group. Secondly, and this is the time-saving part, you can create hundreds of partitions in a single step. Just select multiple objects and BANG you get lots of partitions from just one operation. The only requirement is that at least one partition must be processed so that the Find operation can locate matching tables. But other than that we can create hundreds of partitions at a click.
At this stage in our Project REAL work, all of our partitions use MOLAP storage. We are not actively testing this area, although our testing may be extended depending on what we find with our normal testing.
Aggregations and aggregation designs
Aggregations are the subtotals that are calculated by the system to speed up query time. Along with the aggregation itself (the subtotal), the system also stores the design of the aggregation. This is the internal representation of what the aggregate is. The design describes the combination of dimensions and levels for which subtotals should be calculated. Aggregations and aggregation designs are kept within the partition, which is within the measure group.
Unlike SQL Server 2000 Analysis Services, which keeps the aggregate design within the partition itself, in SQL Server 2005 Analysis Services, the aggregation design is a first-class citizen. If you look at the XMLA script that represents the database, you will see that the aggregation design is kept at the level of the measure group. You can have multiple aggregation designs present at the same time. Within a measure group, each partition points to the aggregation design that it uses. This could be no aggregation design (in which case aggregates will not be created for the partition), the same aggregation design for all partitions, or different aggregation designs across the partitions. Most applications use the first or second approach (either no aggregation design, or all partitions using the same aggregation design). In a more complicated design you might use the third approach (different aggregation designs across the partitions). For example, you might have a large number of aggregates for the current year partitions, a mildly aggregated design for the last three years partitions, and no aggregates for any partitions older than three years. It is easy to script out the partitions and point them to different designs—just edit the XMLA script.
Processing settings and error configurations
Along with keeping partitions in the measure group, the system also records the processing settings and error configurations for those partitions. For the Store Inventory and DC Inventory measure groups, we were able to use the default error configuration. Thus, the system stopped processing when it detected an invalid foreign key, or when it encountered NULLs or other data inconsistencies. This is because physically all of the data is assumed to be correct.
For the Store Sales measure group we had to use a different error configuration. If you will remember back to the measure group discussion, we talked about dirty data that we found between the Customer dimension and the Store Sales fact table. As this turned out to be a natural inconsistency between data feeds, we changed the default error configuration.
Proactive caching settings
One way of looking at MOLAP structures is that they represent a cache or an image of the relational data when it was processed. SQL Server 2005 has a new capability that controls when the MOLAP cache is reprocessed. There are several settings you can set. For example, you can set a hard coded interval (such as every 15 minutes), or a stored procedure return value (and then execute the stored procedure every 15 minutes). Or, you can have the system wait for data to be updated and then process it. These settings are kept within a partition. Thus, they are also included in the measure group.
In the current Project REAL system, we use automatic MOLAP proactive cache settings. We plan to adjust these settings when we conduct our performance tests.
Other cube objects
Besides the dimension usage, measure groups, and partitions, the cube has other objects in it. This section covers calculations, KPIs, and other objects that are stored within the Project REAL cube.
Currently the Project REAL system does calculations for various business measures, such as calculating average sales amounts and quantities on-hand. We are not actively testing this area, although our testing may be extended depending on what we find with our normal testing.
Key Performance Indicators
Currently the Project REAL design does not include any Key Performance Indicators (KPIs). We are not actively testing this area, although our testing may be extended depending on what we find with our normal testing.
Currently the Project REAL design includes only one action. This action is used as a bridge between the Analysis Services client tool and Reporting Services. The cell-level action invokes a Reporting Services report by passing the current coordinates as parameters (such as a specific item or vendor) to the report when the end user selects the action. We are not actively testing this area, although our testing may be extended depending on what we find with our normal testing.
Currently the Project REAL design does not include any perspectives. We are not actively testing this area, although our testing may be extended depending on what we find with our normal testing.
Custom assemblies, user-defined functions, and MDX scripts
Currently the Project REAL design does not include any custom assemblies, UDFs, or MDX scripts. We are not actively testing this area, although our testing may be extended depending on what we find with our normal testing.
Currently the Project REAL design does not change system-wide settings from their default values. The only exception is that on some of our smaller systems we set the value of the CoordinatorExecutionMode server property to 4. This is done so that no more than four parallel operations are attempted on a single CPU or a dual CPU system. You may encounter problems with Analysis Services because it attempts to parallelize many administrative operations, such as dimension and partition processing, which were sequential in SQL Server 2000 Analysis Services. With Analysis Services there are two ways to limit the degree of parallelization.
You can limit the degree of a parallelization on a request-by-request basis. First select your object, such a partition or dimension. You can select multiple objects in both BI Development Studio and in SQL Server Management Studio. Hold down the CONTROL key or the SHIFT key and select multiple items for an operation. Once you have the objects selected and the Process Object(s) dialog box is displayed as shown in Figure 17, click Change Settings.
Figure 17: Parallel processing of multiple selected objects
You can then limit the system, for example, to only process four requests at the same time as shown in Figure 18. Although the drop-down list contains only numbers in powers of two, you can enter any integer.
Figure 18: Setting the degree of parallelization in the processing dialog box
You can also control the degree of parallelization in an XMLA script that is used for processing. Figure 19 shows an example.
Figure 19: Setting the degree of parallelization in an XMLA script
The problem with this approach is that you have to remember to set it every time you process an object. This can be very difficult if there are many subobjects. For example, processing a cube with 231 partitions is quite difficult.
In fact, if you set the degree of parallelization to eight by using the processing dialog box and then view the contents of the batch statement, you will see the MaxParallel setting reflected in the XMLA batch script that is being executed.
Alternatively, you can set the maximum number of concurrent requests system-wide. This helps in some circumstances (for example, you no longer need to remember to set it each time). However, this is not helpful in other circumstances such as when you are processing a smaller number of concurrent requests or when the SQL statements that are being executed have little overhead.
To change the system-wide limit
In SQL Server Management Studio right-click the server and select Properties.
Change the value of the CoordinatorExecutionMode property.
Figure 20: Setting server properties using SQL Server Management Studio
You will notice that the Restart column in the Properties dialog box does not indicate that a restart of the service is needed for this parameter. The new value takes effect immediately.
This section examines two challenging design issues that we encountered. The first was how to represent vendors in the design. The modeling technique that was used in the original SQL Server 2000 Analysis Services design as it came to Project REAL had some serious defects. We will review these and outline five approaches to solving them.
Another challenge was how to perform calculations on the inventory data. We noticed that many of the inventory calculations involved quantity on-hand and on-order which are not additive (not based on SUM, MIN, MAX, COUNT). We solved this challenge by using the new semiadditive measures in SQL Server 2005 Analysis Services.
How vendors are represented
In the original design at the start of Project REAL, vendors were modeled in five SQL Server 2000 Analysis Services virtual dimensions called:
Return Vendor – the vendor where items are returned.
Purchase Vendor – the vendor where items are purchased.
Original Purchase Vendor – the vendor used for purchasing when the item was first sold.
Source Vendor – the vendor from which the item is shipped (which might not be where the item is purchased, nor where it would be returned).
DC Vendor – the vendor that supplies the distribution centers for the item.
In the next five sections we compare and contrast five ways to model these entities using SQL Server 2005 Analysis Services. What we found is that there is not a single best way to model these kinds of relationships in all circumstances. We found that the best method to use depends on your available storage, tolerance for performing calculations at run time, and the flexibility needed for performing vendor analysis.
Approach #1 – Create separate physical dimensions
The first approach we implemented was to create five physical dimensions (one for each vendor type) and load them from the Item dimension table. This technique is straightforward and easy to implement. However, it has several drawbacks. The first is that it causes additional complexity in the cube. There are five additional dimensions for the end user to work with. It requires five times more storage for the dimensions (since each dimension is totally independent from the others). But the most important drawback with this technique is that it is impossible to do any cross-vendor analysis. For example, take the vendor “Abrams, Harry N., Inc.” Because there are five members for Abrams, there is no way to directly slice by that vendor, unless you make the assumption that Abrams is spelled the same in all five dimension and you make sure that the same slicers are selected for all five dimensions.
Approach #2 – Use attribute hierarchies instead of physical dimensions
This approach makes the relationship between item and vendor more direct than in the previous one. In this approach, we remove the physical dimensions and replace them with five attributes or user-defined hierarchies in the Item dimension. Thus the five dimensions are now Item hierarchies: Item.Return Vendor, Item.Purchase Vendor, Item.Original Purchase Vendor, Item.Source Vendor, and Item.DC Vendor. There are several good things about this approach. First, since all five are built from the same attribute key, there is only one storage location. That is in the Item dimension. Thus, the dimension requires less storage. Second, this approach reduces the complexity (dimensionality) of the final cube since we changed five dimensions into five attribute hierarchies (or user-defined hierarchies). For some front-end tools this may be important, making it easy for end users to navigate the cube. However, this approach does not allow cross-vendor analysis.
Approach #3 – Create a many-to-many dimension for Vendor and Vendor Type
A third approach is to create a many-to-many dimension between three dimensions, Item, Vendor, and Vendor Type. This is an unusual approach and it requires a lot of overhead, but it has much richer semantics for analysis. The Item dimension is exactly the same as we’ve seen all along. The Vendor dimension is a union of all possible vendors (removing duplicates) in the five different types. Lastly, Vendor Type has five members; one each for Return, Purchase, Original Purchase, Source, and DC. Once the dimensions are in place, create a measure group that represents one record for each item and its five types. For example, the measure group looks like the one shown in Figure 21.
Figure 21: A many-to-many dimension between the Item, Vendor, and Vendor Type dimension tables
As you can see, this measure group can be fairly large. In the full dimensions, we have roughly 6 million items, five types for each item, and ~40,000 vendors. This means that there are 30 million records in the many-to-many measure group (called Item Vendor in the cube design).
The principal drawback to this approach is that to do vendor analysis, you must perform a million-to-million cross join between the partition data that is being analyzed and the many-to-many measure group (which also has millions of records). This means that a query, which takes several seconds when the second approach is used, is significantly slower in this approach (by tens of seconds). However, this approach comes with a great benefit. Now that you have a single member representing a member, it is possible to easily slice by that member and directly do comparison, such as shown in Figure 22.
Figure 22: Viewing sales by vendor type for a single vendor
This is impossible with any of the other techniques unless you manually set all five slicers to the same value.
Approach #4 – Use reference or role-playing dimensions
A fourth technique would be to create a single cube dimension called Vendor, which has a single member per union’ed vendor from the Item dimension table, as in the many-to-many approach. Then add the dimension to the cube five times (once for each role) as a reference or role-playing dimension. The first time, the Vendor dimension is called Return Vendor; second time it is called Purchase Vendor; and so on. You still have the cross-vendor analysis drawback, but this approach requires minimal storage and is very straightforward to implement.
If you use this technique, be aware that like the many-to-many approach, the actual calculation is done at run time. There is a join formed between the appropriate partitioned data and the role-playing dimension. Then the rollups are performed. Thus, this approach has the same performance issues as the many-to-many approach but without the additional analysis capability.
Approach #5 – Use reference or role-playing dimensions but make them materialized
The fifth technique for modeling vendors is to create role-playing dimensions as in approach #4, but when you specify the reference dimension set the type as “materialized.” When this setting is put into place, a separate copy of the dimension is kept for each role. This improves performance (there is no run-time join) at the cost of dimension memory.
Summary of techniques for modeling vendors
In summary, the five techniques for modeling vendors are as follows.
Implement five real dimensions—one for each type of vendor. This increases the number of dimensions, but this approach is very fast and aggregates can be precalculated.
Implement five attribute hierarchies in the Item dimension. This reduces dimensional complexity and associates vendors with the Item dimension (which is more clear than in the first approach). Approach #1 and approach #2 both have drawbacks in the area of cross-vendor analysis because there is no single member which represents a unique vendor, regardless of its role.
Implement a many-to-many dimension between Item, Vendor, and Vendor Type. This is slower, but conceptually cleaner. It allows you to increase the number of vendor roles without changing the metadata. It results in more flexible structures, which allow cross-vendor analysis. This is the slowest approach because of the large many-to-many measure group, which requires large query-time joins.
Implement five reference (or role-playing) dimensions with a single Vendor dimension. This is easy to do and it is conceptually cleaner than the first two approaches. This approach has performance implications since the joins that do rollups must be done at run time. There are no physical aggregations to precalculate.
Implement approach #4 but make the role-playing dimensions materialized. This has the same ease of implementation as approach #4, but each dimension member is actually created. This means that aggregations can be designed and precalculated. Approaches #4 and #5 both have the same cross-vendor analysis drawbacks as #1 and #2 since no single member represents a unique vendor, regardless of its role. If you will frequently want to add vendor types (such as adding a sixth or seventh vendor type), then approaches #4 and #5 are more straightforward than #1 and #2 since the base objects are already there. All you need to do is add a sixth or seventh reference, or role-playing, dimension. The Vendor dimension is already in place. But all four approaches require metadata changes and this will impact your existing reports and MDX queries. Only approach #3 allows you to add new vendor types without any metadata changes.
In Project REAL, we implemented approaches #2 and #3. We used this combination approach so that end users can select the right approach depending on the tradeoff they require between performance and rich analysis.
Semiadditive inventory measures
There is a well-known problem embedded in the Store and DC inventory data. Normally when dealing with multidimensional data such as sales amount and sales quantity, measures are naturally additive. To get the total sales amount for the WA district, all you do is add the sales amounts for each of its cities and stores. We call this calculation additive. SQL Server 2000 Analysis Services supports four additive functions. These are Sum, Count, Min, and Max. The arithmetic mean, or average, can be calculated as Sum divided by Count.
For example, let us look at how the data in the Sale Amt column in Figure 23 rolls up over time for the Kirkland, WA store. The aggregate function for Sales Amt is Sum.
Figure 23: The Kirkland store sales and inventory data
Notice that the total sales for the week is the sum of the days and that the total sales for the month is the sum of the weeks. Thus we call total sales an additive measure. Is this true with a store’s On-Hand Qty measure? Obviously, the answer is no. Quantity on-hand, quantity on-order, and similar measures are examples of semiadditive measures. The value reported over time is not based on the sum, but rather on the last value (or some other calculation, which varies by semiadditive function) that is reported within a period of time. In this example, store inventories are taken on Saturdays. If the number of books in the Kirkland store is 10,000 at the start of the month, and the inventory remains the same 10,000 books for each of four Saturdays until the end of the month (we will assume that every book sold during the week is replenished on Friday evening), what is the inventory at the end of the month? Is it 40,000 (the sum) or 10,000 (the last reported inventory)?
To solve this problem we use a new semiadditive aggregate function in SQL Server 2005 Analysis Services. Since we already have a time dimension, all we do is change the aggregate function from Sum to LastNonEmpty and the rollups act as we expect.
Since SQL Server 2000 Analysis Services only supports additive measures, the original Project REAL design had complex calculations to manually perform these rollups. With SQL Server 2005 Analysis Services, semi-addition operations are supported natively. All we had to do was change the aggregate function as shown in Figure 24.
Figure 24: Specifying the aggregation function for a measure by making it additive (the left measure) or semiadditive (the right measure)
Best Practice: If you use semiadditive measures, be sure that cubes have no more than one time dimension.
There are some limitations on the use of semiadditive measures. First, you must have a dimension marked as type Time to use semiadditive measures. See Figure 25 for an example. The Time Intelligence Wizard sets the attribute and dimension type properties appropriately when it runs. If you create your time dimensions using the standard Dimension Wizard, then you must manually set the appropriate type properties.
Figure 25: Specifying the time dimension as Time
The LastNonEmpty function cannot be calculated over all types of dimension. It can only be done over a time dimension. And you must have only one time dimension in the cube. You can have many hierarchies in that time dimension, but you can have only one time dimension in the cube. If you have more than one time dimension in the cube, then the system picks the first one that it finds. That is not necessarily a good, predictable choice, but that is what it does.
This white paper provides a detailed technical discussion of a number of SQL Server 2005 Analysis Services designs and best practices that were developed in Project REAL. We reviewed many best practices and observations that were part of the work in Project REAL and presented information on different types of objects, such as data sources, data source views, dimensions, hierarchies, attributes, measure groups, and partitions.
For a review of a SQL Server 2005 Integration Services package that is used to synchronize the relational partitioning scheme with the Analysis Services measure group partitioning schema, see Appendix A.
For more information:
Did this paper help you? Please give us your feedback. On a scale of 1 (poor) to 5 (excellent), how would you rate this paper?
Appendix A: Automating Partition Creation
The Project REAL design uses partitioning quite heavily. The production system has more than 220 extremely large partitions. The sample data uses over 125 partitions that are only tens of thousands of records per partition. The full production system has 180 to 200 million records per partition. With so many partitions, extensive typing was required to create each partition every time we generated a new schema.
So, as the saying goes, “When the going gets rough, a programmer writes a program.”
This appendix documents the BuildASPartition SQL Server 2005 Integration Services package that we created to automate the building of Analysis Services measure group partitions in SQL Server 2005 Analysis Services databases. This package synchronizes the relational partition scheme with the Analysis Services partition scheme. It loops through the relational database looking for a weekly fact table partition (by using a table naming convention). If a relational table is found, it looks to see if an Analysis Services measure group partition already exists (using the same naming convention). If not, it constructs and executes a XMLA script that creates it.
Figure 26 shows what the package looks like.
Figure 26: The BuildASPartition package (as seen in the package editor)
Figure 27 shows the variables and their default values.
Figure 27: The list of variables in the BuildASPartition package
The following table lists each variable and explains how it is used.
A TSQL LIKE fragment that identifies the fact table weekly partitions.
Contains one of the table records as the “ForEach Partition” loop iterates through the Tables rowset.
A rowset which contains the list of tables in the RDBMS connection that match the Mask variable.
The XMLA script that the Partition Already Present? task generates.
A Boolean that the Partition Already Present? task sets if the partition already exists. If this is true, then the Execute DDL task runs. If not then the ForEach Partition loop gets another table to check.
The package consists of the following four tasks.
Get Partition List (an Execute SQL task)
This task takes the Mask variable and uses it as a Transact-SQL LIKE pattern match against the views that are defined in the RDBMS database connection. The output rowset is bound to the Tables variable. Figure 28 shows how the Get Partition List task is configured within the package editor.
Figure 28: The Get Partition List task properties
ForEach Partition (a For Each Loop task)
This task loops through the tables, setting the Table variable with each table name as it iterates through the rowset. Figure 29 shows how the ForEach task is configured within the package editor.
Figure 29: The ForEach task properties
Partition Already There? (a Script task)
This script has two outputs. It sets the IsNotPresent Boolean variable if the partition does not exist and it constructs an XMLA script that can be used to create the partition (saving it in the XMLA Script variable).
There are several interesting things about this script.
It shows how to reference AMO and use it to determine if a partition already exists in the Analysis Services database.
It shows how to extract the server and database name from a connection (so these can be used later in the script to create an AMO command against the connection).
Figure 30 shows how the Partition Already There? task is configured within the package editor.
Figure 30: The Partition Already There? task properties
Create Partition (an Analysis Services Execute DDL task)
This task executes the XMLA script that was constructed by the Partition Already There? task. Figure 31 shows how the Create Partition task is configured within the package editor.
Figure 31: The Create Partition task properties
There is a precedence constraint between the Partition Already There? and the Create Partition tasks. This constraint ensures that the Execute DDL task only runs if the IsNotPresent Boolean variable is set to true by the script in the Partition Already There? task. To view the precedence constraint, double-click on arrow between the two tasks in the package editor. Figure 32 shows how the precedence constraint is configured within the package editor.
Figure 32: Precedence constraint configuration
The package uses the following two connection objects.
This connection object contains the SQL Server 2005 Analysis Services database where measure group partitions will be checked and created if they do not already exist.
This connection object contains the SQL Server 2005 relational database where the weekly fact tables are looked for.
Finally, we want to comment on the aggregation design. After you run the package you will see that there are no aggregations designed for the aggregation designs. It is quite simple to create aggregation designs for many partitions at the same time. We’ve done this with hundreds of partitions at a time.
First, select the partitions that are listed in the cube window, by pressing either the CONTROL key or the SHIFT key and selecting. Then right-click and run the Design Storage Wizard to create the aggregations. The wizard defines the aggregations in all of the selected partitions. Do this once for Store Inventory; once for Store Sales; once for DC Inventory, and you are done.
Note: While this package assumes that the relational database partitioning scheme is done by week and is encoded in the name of the tables, it is easy to see how you could modify the package to adjust to a different scheme. Simply modify the Get Partition List task (or extend it into a series of tasks) and build the Tables rowset against a different scheme. So long as the Tables rowset (however constructed) is in the same format, then the rest of the package can be re-used as is.
The following is the source code for the Partition Already There? script.
' Microsoft Data Transformation Services (DTS) Script Task ' Write scripts using Microsoft Visual Basic ' The ScriptMain class is the entry point of the Script Task. Imports System Imports System.Data Imports System.Math Imports Microsoft.SqlServer.Dts.Runtime ' NOTE: ' In order to get AMO to be imported you must copy the following ' file from: ' C:\Program Files\Microsoft SQL Server\90\SDK\Assemblies\ ' Microsoft.AnalysisServices.dll ' to the folder ' <windows folder>\Microsoft.NET\Framework\v2.0.<build#> ' And then you must add a reference to the assembly in the project Imports Microsoft.AnalysisServices Public Class ScriptMain Public Sub Main() ' Get Server and Database name from DTS connection object Dim oConn As ConnectionManager oConn = Dts.Connections("AS database") Dim sServer As String = _ CStr(oConn.Properties("ServerName").GetValue(oConn)) Dim sDatabase As String = _ CStr(oConn.Properties("InitialCatalog").GetValue(oConn)) ' By convention, we know that the database, data source and ' cube are the same name Dim sDataSource As String = sDatabase Dim sCube As String = sDatabase Dim oTable As Variable = Dts.Variables("Table") Dim sTable As String = CStr(oTable.Value) Dim sPartition As String = GetPartition(sTable) Dim sMeasureGroup As String = GetMeasureGroup(sTable) ' We have all of the information about the partition -- use AMO ' to see if it is present. Save in a variable for later reference Dts.Variables("IsNotPresent").Value = _ Not IsPartitionThere(sServer, sDatabase, sCube, _ sMeasureGroup, sPartition) ' Generate and save the XMLA script (will be executed in a ' Execute Analysis Services DDL task later on). Save in a ' variable for later reference Dim sXMLA As String = GenerateXMLAScript(sDatabase, sDataSource, _ sCube, sMeasureGroup, sPartition, sTable) Dts.Variables("XMLA_Script").Value = sXMLA MsgBox(sXMLA, MsgBoxStyle.OkOnly, "XMLA Script") Dts.TaskResult = Dts.Results.Success End Sub Private Function GetMeasureGroup(ByVal sMG As String) As String ' All tables are in the format: vTbl_Fact_<mgname>_WE_YYYY_MM_DD ' e.g. vTbl_Fact_Store_Sales_WE_2003_12_27 ' Measure group names (from this) are: <mgname>, e.g. Store Sales ' and the name may have embedded undorscores (_) which need to ' be replaced with spaces Dim i_WE_location As Integer = InStr(sMG, "_WE_") Dim i_FACT_location As Integer = Len("vTbl_Fact_") sMG = Left(sMG, i_WE_location - 1) sMG = Right(sMG, (Len(sMG) - i_FACT_location)) sMG = Replace(sMG, "_", " ") Return sMG End Function Private Function GetPartition(ByVal sPart As String) As String ' All tables are in the format: vTbl_Fact_<mgname>_WE_YYYY_MM_DD ' e.g. vTbl_Fact_Store_Sales_WE_2003_12_27 ' Partition names (from this) are: <mgname> WE YYYY MM DD, ' e.g. Store Sales WE 2003 12 27 ' and the name may have embedded undorscores (_) which need to ' be replaced with spaces Dim i_FACT_location As Integer = Len("vTbl_Fact_") sPart = Right(sPart, (Len(sPart) - i_FACT_location)) sPart = Replace(sPart, "_", " ") Return sPart End Function Public Function GenerateXMLAScript(ByVal sDatabase As String, _ ByVal sDataSource As String, ByVal sCube As String, _ ByVal sMeasureGroup As String, ByVal sPartition As String, _ ByVal sTable As String) As String Dim sX As String sX = "" ' ' XMLA script is missing the following clauses: ' 1) annotations (not needed) ' 2) physical storage, e.g. proactive caching settings, etc. ' Note: not needed since default values used ' 3) linkage to aggregation designs (TBD) -- we need a management ' utility to control aggregation usage ' ' Note: because of quoting rules conflict between VB.NET and ' XMLA scripts, all double-quotes (") are replaced with ' uparrows (^) ' sX = sX & "<Create xmlns=^http://schemas.microsoft.com/" sX = sX & "analysisservices/2003/engine^>" & vbCrLf sX = sX & " <ParentObject>" & vbCrLf sX = sX & " <DatabaseID>" & sDatabase & _ "</DatabaseID>" & vbCrLf sX = sX & " <CubeID>" & sCube & "</CubeID>" & vbCrLf sX = sX & " <MeasureGroupID>" & sMeasureGroup & _ "</MeasureGroupID>" & vbCrLf sX = sX & " </ParentObject>" & vbCrLf sX = sX & " <ObjectDefinition>" & vbCrLf sX = sX & " <Partition xmlns:xsd=^http://www.w3.org/" sX = sX & "2001/XMLSchema^ xmlns:xsi=^http://www.w3.org/2001/" sX = sX & "XMLSchema-instance^>" & vbCrLf sX = sX & " <ID>" & sPartition & "</ID>" & vbCrLf sX = sX & " <Name>" & sPartition & "</Name>" & vbCrLf sX = sX & " <Source xsi:type=^TableBinding^>" & vbCrLf sX = sX & " <DataSourceID>" & sDataSource & _ "</DataSourceID>" & vbCrLf sX = sX & " <DbSchemaName>dbo</DbSchemaName>" & _ vbCrLf sX = sX & " <DbTableName>" & sTable & _ "</DbTableName>" & vbCrLf sX = sX & " </Source>" & vbCrLf sX = sX & " </Partition>" & vbCrLf sX = sX & " </ObjectDefinition>" & vbCrLf sX = sX & "</Create>" & vbCrLf sX = sX & " " & vbCrLf ' replace all up-arrows with double-quotes sX = Replace(sX, "^", """") Return sX End Function Public Function IsPartitionThere(ByVal sServer As String, _ ByVal sDatabase As String, ByVal sCube As String, _ ByVal sMeasureGroup As String, _ ByVal sPartition As String) As Boolean ' By default, we will assume that it isn't there ' Only if we get all of the way to the end and everything is ' found, will we set it true Dim bIsPartitionThere As Boolean = False Dim oServer As New Microsoft.AnalysisServices.Server ' connect to the server and start scanning down the ' object hierarchy oServer.Connect(sServer) Dim oDB As Database = oServer.Databases.FindByName(sDatabase) If oDB Is Nothing Then MsgBox("Did not find expected database: " & sDatabase, _ MsgBoxStyle.OkOnly, "Error looking for partition") GoTo Done Else Dim oCube As Cube = oDB.Cubes.FindByName(sCube) If oCube Is Nothing Then MsgBox("Did not find expected cube: " & sCube, _ MsgBoxStyle.OkOnly, "Error looking for partition") GoTo Done Else Dim oMG As MeasureGroup = _ oCube.MeasureGroups.FindByName(sMeasureGroup) If oMG Is Nothing Then MsgBox("Did not find expected Measure Group: " & _ sMeasureGroup, _ MsgBoxStyle.OkOnly, _ "Error looking for partition") GoTo Done Else '-- This is the real test -- to see if the partition ' is really there Dim oPart As Partition = _ oMG.Partitions.FindByName(sPartition) If Not oPart Is Nothing Then ' This is the only place to set the value to TRUE bIsPartitionThere = True End If End If End If End If Done: oServer.Disconnect() ' disconnect from the server -- we are done NoConnection: Return bIsPartitionThere End Function End Class
Microsoft Word file