The CAP Theorem: Why “Everything is Different” with the Cloud and Internet

This is a continuation of last Wednesday's post, which featured the intro section to my forthcoming white paper, "Data Architecture in a Cloudy World".

In that overview, I briefly mentioned the bewildering array of choices facing developers today in the data architecture area. But the array of choices is actually much longer than I was able to describe in a brief overview.

A Bewildering Array of Choices

There is not only a wide variety of possible data stores, but there are also a number of choices for physical architecture.

Here is a short list of the possible data stores that could be used in an application:

  • A relational database such as Microsoft SQL Server, MySQL, or another relational database product

  • Data Warehouses such as SQL Server Analysis Services

  • Different kinds of “NoSQL” databases:

    • Key/Value stores such as Windows Azure Tables;
    • Document Databases;
    • Graph Databases;
    • Column Stores;
  • “Big Data” solutions such as HDInsight (aka Hadoop on Windows), with its many components built on top of Hadoop

  • Messaging Solutions:

    • Windows Azure Storage Queues
    • Windows Azure Service Bus
    • SQL Server 2012 StreamInsight for real time “push” event processing
  • Windows Azure Blobs

And here is a list of possible choices for physical architecture:

  • Cloud services, such as Windows Azure, which include several modes of operation:
    • Platform as a Service (“PaaS”)
    • Infrastructure as a Service (“IaaS”)
  • On premise: in your own data center, which may or may not be a “private cloud”
  • Hybrid combinations of the above

The above is only a partial listing. PaaS is somewhat circumscribed by the services offered by the cloud service provider. So for example, with Windows Azure you have choices like Windows Azure Table Storage, or SQL Azure Database. On the other hand, the IAAS architecture is virtually unlimited in that you are free to install whatever you want on a VM.

In the interests of full disclosure, while this paper addresses architecture in general, the cloud service provider focus is mainly on Windows Azure.

Another dimension to consider is that for some products, cloud versions and on-premise versions both exist, but the feature sets of the products might not be identical across the two platforms. Two examples of this are Microsoft SQL Server, and HDInsight. For both of these products there is a Windows Azure version, and an on-premise version. Over the long run you can expect the feature sets to converge, but at present you must take feature differences into account.

Why “Everything is Different” with the Cloud and Internet

Two major differences have emerged:

  • Data is distributed across many independent nodes that are independent structurally and dispersed geographically; this has led to the “death” of the classical Two Phase Commit method of maintaining database consistency across transactions, and has also led to the emergence of a whole category of non-relational “NoSQL” databases.
  • The Internet has made large amounts of data available to applications, in the petabyte range on up, thus the emergence of “Big Data”, and “Internet Scale” applications like Facebook . The data used by these applications are of such size that it cannot be realistically processed with a relational database, which has led to the emergence of “Big Data” solutions, notably Hadoop.

Several consequences flow from the above:

  • Geographic dispersion of nodes implies the need to deal with latency. One way this is done is horizontal partitioning of relational databases.
  • The large number of nodes means that any “internet scale” application is faced with the inevitability of random node failure at any time, thus leading to the need for redundancy and resistance to the effects of node failure.
  • The need to scale out horizontally, the effects of latency, and the requirement for node redundancy all combine to make the use of commodity hardware more cost effective than using the highest performance premium hardware.

Two-Phase Commit and the “CAP” Theorem

In a distributed system, the basic mechanism that enables complex transactions in relational databases, two-phase commit, no longer works well at extreme scaling. This has been summed up by the “CAP Theorem”, which says that in a large distributed system, you can provide only 2 of the following items, and not all 3:

  • Consistency
  • Availability
  • Partition Tolerance

Consistency refers to the internal consistency of a database’s data. Relational database transactions are expected to satisfy the ACID properties:

  • Atomicity, meaning that transactions are either “all or nothing": if part of a transaction fails, the entire transaction fails and the database state is rolled back to its initial state.
  • Consistency requires that the result of a successful transaction is that the database transitions from one valid state to another.
  • Isolation requires that multiple transactions execute in isolation, with no “side effects” dependent on their order.
  • Durability requires that once a transaction is committed, it remains so, even in the event of system failures.

With the two-phase commit protocol, you can ensure that your data is always consistent and satisfies the ACID properties. The stereotypical example is moving $10 from your bank’s checking account to the savings account. You first subtract $10 from checking, and next add $10 to savings. If the system goes down, you don’t want $10 to be subtracted from checking only. The “two-phase commit” process ensures that a transaction either completely succeeds, or is discarded.

Availability means that every transaction or request receives a response indicating whether it succeeded or failed.

Partition tolerance means that the overall system is unaffected when any piece of hardware fails, such as a disk, a computer, or an entire rack in the data center. This is ensured by means of massive hardware redundancy. Windows Azure provides at least 3 copies of every resource, and transparently recovers from hardware failures.

The item that has the highest performance cost is “Consistency”, and particularly the two-phase commit process that relational databases use to enforce it. Consequently it is the commonest item to be dropped in large distributed systems, in favor of availability and partition tolerance, since you can only get two out of three. Typically ACID consistency is relaxed to require “eventual consistency”: sooner or later the database is guaranteed to be in a consistent state.

For more information on Distributed Two-Phase Commit and the CAP theorem see: