Data Warehousing 2.0 and SQL Server: Architecture and Vision
SQL Server Technical Article
Writer: W.H. Inmon
Published: October 2009
Applies to: SQL Server 2008 R2
Summary: Architecture and data warehousing are not static. From the first notion of a data warehouse to a full-blown analytical processing architecture that includes data marts, ETL, near line storage, exploration warehouses, and other constructs, data warehousing and its associated architecture continue to evolve. In 2008, the book on the latest evolution of data warehousing appeared – DW 2.0: The Architecture for the Next Generation of Data Warehousing (Morgan Kaufman). In that book the general architecture for data warehousing in its highest evolved form appeared.
Architecture and data warehousing are not static. From the first notion of a data warehouse to a full-blown analytical processing architecture that includes data marts, ETL, near line storage, exploration warehouses, and other constructs, data warehousing and its associated architecture continue to evolve. In 2008, the book on the latest evolution of data warehousing appeared – DW 2.0: The Architecture for the Next Generation of Data Warehousing (Morgan Kaufman). In that book the general architecture for data warehousing in its highest evolved form appeared.
Figure 1: Diagram of the granular data found in DW 2.0
Among other things, DW 2.0 recognizes the life cycle of data within the data warehouse, recognizes the need for including textual data in the data warehouse, and recognizes that metadata is an essential component of the data warehouse environment. Along the way, DW 2.0 recognizes that data warehouses attract large amounts of data, store that data over a lengthy period of time, support a wide variety of processing, and finally – data warehouses can become very costly if you choose to make design and infrastructure decisions that are expensive.
SQL Server in Evolution
While architecture has been evolving, so Microsoft® SQL Server® has also been evolving. From the humble origins as a database that served small amounts of data on a personal computer with very basic functions, SQL Server now is prepared to serve as a database foundation for mid-size and very large amounts of data for data warehousing.
It is said that to increase the capacity or performance of a system that tuning the system can get up to a 10% improvement. But to get an order or two magnitude of performance and capacity improvement, a change in architecture of the system is required. And indeed that is what SQL Server has undergone – a fundamental change in architecture from the early days of SQL Server.
Just as data warehouse and architecture has evolved, so SQL Server has also been evolving. And whether by chance or by design, SQL Server has turned into the preferred technology platform for the most advanced form of data warehouse architecture – DW 2.0.
This means that SQL Server has advanced mightily up the evolutionary curve for serving the data warehouse community as the database foundation for large and sophisticated data warehouses. No longer is SQL Server limited to small amounts of data and personal computers. With the architectural enhancements of SQL Server it is ready to become the infrastructure of choice when implementing advanced data warehouse and analytical architectures such as DW 2.0.
Aspects of DW 2.0
There are many aspects to the DW 2.0 architecture. Not all of them can be addressed in the space of this white paper. However, some of the more profound and more important aspects of DW 2.0 will be discussed in the context of SQL Server.
Basic Access of Data
For years the preferred storage medium for data has been disk storage. Disk storage appeared at the time that online transaction processing was first being done. In fact, in many ways it was the advent of disk storage that allowed online transaction processing to become a reality. The way that online transaction processing enables access of data for transaction processing is to access online storage randomly. To this end data is loaded onto disk storage either by hashing the data as it is placed on storage or by the creation of an index (or both). In accessing data for online transactions, a random and rapid access of small amounts of data is required.
For many online applications and for many usages of data, a pattern of rapid, random access of small amounts of data on disk storage works just fine. But when it comes to DSS, analytic processing, the basic pattern of access of data is quite different. Most DSS, analytical processing is done by means of SQL. SQL operates on sets of data, not records of data. Therefore, for analytical processing, an access mode where the first record being sought is accessed randomly and then the remaining records in the set are accessed sequentially fit optimally with data warehouse analytical processing. Furthermore the sets of data accessed by DSS processing may not be small at all. On many occasions very large sets of data are accessed. In other words, a sequential access mode for most of the data to be accessed in a data warehouse is optimal, not a random mode of access for every record of data to be sought. Figure 2 shows this difference in the fundamental mode of access in the OLTP environment and the data warehouse analytic environment.
Figure 2: Illustrating the difference between random and sequential I/O
In the latest release of SQL Server the first random record then a sequential mode of access is the one that is supported. This means that at the most basic level of processing, SQL Server holds a major performance advantage over their competition.
A Data Mart Migration Path
Another recurring problem with data warehouses and analytic processing is the fact that many organizations prefer to build data marts first, before they build an actual data warehouse. Then one day the organization wakes up and discovers that in addition to their data marts, they need a data warehouse. It is at this point that there is no easy or graceful migration plan to go from multiple data marts to a data-warehouse-centric environment. Many organizations start out in the hopes that a data mart or two are going to meet their analytical needs. But over time the problems with a data mart centric architecture start to appear – there is no definitive source of corporate data, there is the need to build every data mart from scratch when a new need for data appears, data marts are terribly brittle and need to be destroyed and rewritten when basic business requirements change, and so forth.
In DW 2.0, the granular data found in DW 2.0 forms what is called the “system of record” and becomes the “single version of the truth” for the organization. And from the system of record, data marts are created just as they were created in classical first-generation data warehouses.
Now SQL Server supports the easy migration from data marts to data warehouses. SQL Server offers the ability to build small data warehouses or data marts in its FastTrack option. Then, when the volume of data grows and the need to create a full scale data warehouse arises, SQL Server offers their SQL Server 2008 R2 Parallel Data Warehouse (previously codenamed Project “Madison”) option.
Figure 3: Parallel Data Warehouse offers an easy migration path from data marts and smaller data warehouses to enterprise-scale data warehouses
In the SQL Server 2008 R2 Parallel Data Warehouse edition of SQL Server there is the opportunity to manage as much data as is possible – up to petabytes of data. There is the opportunity to support this volume of data in a parallel fashion. There is the opportunity for redundancy of components so that the system can process in an efficient and in a failsafe manner.
But perhaps most importantly there is the opportunity to synchronize – automatically – data residing in the FastTrack data mart or mini data warehouse with the data managed centrally by SQL Server 2008 R2 Parallel Data Warehouse. While there are other aspects to the conversion of a data mart to a data warehouse, SQL Server has solved some of the hardest aspects of the problem very nicely. If an organization commits to SQL Server as a basis for data warehouse processing, many of the problems of migration from a data-mart-centric environment to a real data warehouse are mitigated.
Data Warehouse Costs
Another sensitivity of DW 2.0 is the recognition that the cost of the data warehouse is an issue. And if it is not an issue today, it will be an issue tomorrow. As data warehouse volumes of data grow, so grows the cost associated with data warehousing. And the costs of the data warehouse grow along with the rise in the volume of data.
When the discussion of cost arises, it must be noted that the more centralized the components of technology become, the more expensive they become. For example, suppose that an organization needs a total of n units of processing power. The most expensive thing an organization can do is to buy one central processor that provides n units of power. The most cost-effective thing an organization can do is to break up the n units of power into many different units. n units of power are needed. Suppose that the cost of n units in a single processor is X. Now suppose that the n units are divided into ten units – n/10. Suppose that each n/10 unit costs Y. Then:
10 x Y < X.
In fact, 10 x Y is FAR less than X. The most expensive processing cycles are those that are found in the largest machines. The more the workload can be divided, the less expensive the processing cycles become. Using the above equation, it is not unreasonable that Y would be 1/100th of X. Using this equation, 10 x Y = 1/10X. Therefore, from an economic standpoint, it makes sense to take work that needs to be done and distribute that work over many different processors. This has the effect of greatly reducing the costs of the data warehouse environment.
DW 2.0 is aware of this general cost equation. In fact DW 2.0 starts with this basic hypothesis as the basis for all following architectural decisions. And SQL Server is also aware of this fundamental fact of life regarding technology costs.
SQL Server accommodates the need for the distribution of processing across the data warehousing environment. Figure 4 shows this basic understanding of the costs of technology.
Figure 4: The unique hub-and-spoke architecture for SQL Server’s Parallel Data Warehouse
SQL Server distributes the processing workload in several ways. The first way that SQL Server supports the distribution of work across multiple locations is in support of the hub-and-spoke architecture. Basic data management is done in the data warehouse hub, where massive amounts of data can be handled. And end-user analytical processing is handled in the different spokes of the architecture. In fact, within the hub processor there is a distribution of the workload. The hub processes data differently in different places, thus avoiding a large queue that can and will negatively impact performance.
In doing so, the costs of the infrastructure for SQL Server are held to a minimum, thus allowing the organization to easily and cost effectively grow their data warehouse and to achieve good and consistent performance at the same time.
Another aspect of data warehouses – recognized by both DW 2.0 and SQL Server - is the need to store and manage a large volume of data. There are many ways that large volumes of data can be managed. A simple way to manage volumes of data is through compression. In compression, extraneous data is removed or stored in a minimized fashion. The techniques for compression are especially applicable for a data warehouse because data warehouses – built properly – do not allow data to be updated. Compression actually harms performance when update of data is allowed because it is costly for the system to go and find data, decompress it, update it, recompress the data and then try and replace the data in the database efficiently. But because updating of data does not occur in the data warehouse environment, compression of data makes a lot of sense. And indeed SQL Server allows data to be compressed.
But the biggest gain in the management of volumes of data that is now part of SQL Server (SQL Server 2008 R2 Parallel Data Warehouse) is that of the parallel management of volumes of data. In parallel processing of data, data is stored on more than one device so that more than one processor can access and manage data at the same time. In order to understand the value of parallel management of data, consider what a buggy driver must do when the weight of the buggy becomes too heavy for the horse. One alternative is to go from a regular sized horse to an oversized horse such as a Percheron or a Clydesdale. Percherons were bred years ago to allow knights in armor to ride them into battle or jousting contests. And a knight in armor weighs a lot. This strategy works well as long as there is a Percheron that is available and is less than ten years old. But what happens if there is no Percheron available? Or what happens if the load is too heavy for a Clydesdale to pull? Then, a team of horses – not a single horse - is needed. And up to a point more horses can be added as the load to be pulled grows.
The same analogy applies to managing a lot of data. If a single server is overwhelmed by its load of data, then multiple servers can be used at the same time and the load of data can be divided over more than one server. Such an approach is called a parallel approach because different sets of data are operated on in parallel independently. In doing so adding more parallel servers increases the total throughput that a system can handle.
And the SQL Server 2008 R2 Parallel Data Warehouse option handles data in a parallel fashion.
Figure 5: Parallel management of data in the SQL Server hub
Probability of Access of Data
But compression and parallel approaches to the management of data are not the only way that large volumes of data can be managed. DW 2.0 calls for the physical separation of data based on the probability of access of the data. Very highly accessed data needs to be placed in high performance storage. In this regard, a data warehouse built under SQL Server is like any other database management system. But as the volume of data grows and the probability of access of data drops, it no longer makes sense to store all data on high-performance storage. Not only is data that is not accessed very expensive to place on high-performance storage, the unused data gets in the way of accessing data whose probability of access is indeed high. By placing all data on high-performance storage, the organization has the worst of all worlds – large expense and poor performance.
In order to understand why data with a low probability of access should be removed from high-performance storage, consider that an information system is in many ways like the blood pumping through the human body. In a young athlete running a marathon, there is very little cholesterol. The heart pumps blood efficiently through the blood vessels of the athlete. But now consider a lethargic couch potato. The couch potato has a lot of cholesterol in his/her body. The heart has to work hard to pump blood through the cholesterol-clogged arteries of the couch potato.
Dormant, unused data in a data warehouse is like cholesterol in the body of an athlete. The less cholesterol there is, the more efficiently the heart pumps. The less unused data there is in high-performance storage, the more efficient it is to find data that is being looked for in high-performance storage.
DW 2.0 recognizes this basic fact of life and SQL Server also recognizes this fact.
Figure 6: Logical separation of data based on probability of access
SQL Server allows data to be divided according to its probability of access. In SQL Server data can be hot, warm, or cold. By physically dividing data into different sectors, performance of data is greatly enhanced.
But there is another very important feature of SQL Server that sets it apart from any other database management system. That capability is the ability to handle streaming data.
DW 2.0 calls for the separation of online high performance from integrated data processing. DW 2.0 recognizes what is termed the “interactive sector”. It is in the interactive layer that data can be entered in a continuous, high-performance fashion. Figure 7 shows that there are two basic divisions of data – static data and streamed data.
Figure 7: Static and streamed data
Most database management systems manage static data. Static data is data that is recorded as a byproduct of some event occurring on an event-by-event basis. The event that occurs usually happens in a relaxed manner. At 10:01 am there is a bank deposit. At 11:03 am there is an ATM activity. Throughout the day, static data enters the system in a random, somewhat relaxed manner. Streamed data differs from static data in that streamed data occurs and enters the database system very rapidly and very predictably.
One way to think of streamed data is that a single event occurs, then multiple fast measurements are made because of the occurrence of the event.
As an example of streamed data, consider an electronic eye that measures the output of steel in a steel manufacturing environment. The event that occurs is the production of a batch of steel. The output steel from the batch is rolled out in a bar very rapidly as the smelting process is finished. An electronic eye takes a measurement of the steel every 12 inches, or about 1,000 times a minute. As long as the mill is producing steel, the measurements are made quickly and consistently.
The electronic eye measures lots of variables as the steel passes by, such as:
- temperature of the steel
- chemical composition of the steel
- width of the steel bar, and so forth
In such a manner, very large volumes of data are captured in a rapid and predictable manner triggered by a single event. This mode of capturing data into a database is called the streaming mode. Streaming mode data fits very nicely with the interactive layer of data within DW 2.0. Static data does not fit well or comfortably within the interactive layer of data within DW 2.0.
Historical Data - What Does That Mean?
One of the database variables that must be addressed in database technology is the meaning of historical data or archival data. When most people think of archival data, they often think of large archives of data that are 5 to 10 years old. Indeed that kind of data is archival data. But when one considers the meaning of historical data, one is faced with a dilemma. Data that is one second old is historical data. But is data that is one second old also archival data? The answer is that if all data that is historical data is also archival data, then indeed data that is one second old is archival data.
But in most environments calling data that is one second old archival data simply is very misleading. So let us call data that is still very freshly newly created historical data and let us call data that is older than that true archival data.
Figure 8 shows that historical data can be divided up into two classes – newly created historical data and true archival data.
Figure 8: Two classes of historical data
This distinction of what is meant by historical data is necessary for understanding what kind of data needs to be placed in the interactive sector. In DW 2.0 there is a sector of data called the interactive sector. The interactive sector contains newly created historical data such a streamed data, not archival data.
This distinction is important the face of understanding the true nature of DW 2.0’s interactive sector data and processing. One vendor has created what is called the “active data warehouse”. This vendor claims that up-to-the second processing and analysis can be done with the active data warehouse in the data warehouse. Unfortunately the data that is being managed under the active data warehouse is static data, not streamed data. And therein lies the problem. When static data in the interactive sector is treated as streamed data, much confusion and much inefficiency occurs. Static data requires an infrastructure that demands integrity – integrity of transaction processing and integrity of database processing. By trying to treat static data as if it were streamed data in the interactive sector, there is a very high and unnecessary cost of the infrastructure. In addition the vendor that supports the active data warehouse tries to query its static/streamed data using the query tools designed for static data. Streamed data requires an entirely different means of query and analysis. The net result of active data warehousing is confusion, waste, and the inability to handle streamed data as it needs to be handled. Active data warehousing is not a form of data warehousing at all. Only when the data that is streamed is managed by techniques fit for streaming is the interactive sector actually created.
An active data warehouse with all of its very expensive costs and technological limitations is no substitute for handling streamed data. It is streamed data that belongs in the interactive sector of DW 2.0, not static data.
Unlike the vendor that promulgates active data warehousing, SQL Server handles streamed data. Figure 9 shows how SQL Server handles streamed data.
Figure 9: How SQL Server handles streamed data
Figure 9 shows that streamed data is handled separately from static data. Figure 9 shows that streamed data is captured in the form of newly created historical data. The data found in the steaming component/the DW 2.0 interactive sector is accessed and manipulated by a language suited for accessing and manipulating streams of data, not static data. The output of processing can be a report, an answer, or the movement of streamed data into a static environment. Microsoft will release its new streaming engine, StreamInsight, in the same timeframe as SQL Server 2008 R2.
The Fit Between DW 2.0 and SQL Server
These then are merely the highlights of how there is a very good architectural fit between the architecture of the future of data warehouse – DW 2.0 – and SQL Server. DW 2.0 and SQL Server recognize –
- the need to handle very large volumes of data
- the need to be constantly aware the costs of data warehousing
- the need to separate interactive streamed data and processing from other parts of data and processing
- the need to manage data in a parallel manner
- the need to divide the workload onto as many smaller components as possible
- the need for basic sequential access of sets of data
- the need to have a rational migration path from data marts and mini data warehouses to a large centralized data warehouse
- the place and position of streamed data
- the need to physically separate data based on the differences in the probability of access of the data.
DW 2.0 – Architecture for the Next Generation of Data Warehousing, Morgan Kaufman, 2008
For more information:
http://www.microsoft.com/madison:SQL Server 2008 R2 Parallel Data Warehouse
http://www.microsoft.com/sqlserver/: SQL Server Web site
http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter
http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example:
Are you rating it high due to having good examples, excellent screen shots, clear writing, or another reason?
Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback will help us improve the quality of white papers we release.