There is No Spoon - How Cassandra Shatters Relational Beliefs Held Dear
The world of NoSQL has moved ahead with tremendous speed. It has shattered Relational beliefs we held so dearly once upon a time. It is, in fact, a true innovation. Challenged with wrapping their head around this new super concept, Relational Junkies are having a hard time. But think again: how hard can it really be? All you need is an open mind and a good fireside chat. I can provide the latter. Can you bring the former?
The intended audience for this article are leaders/ business evangelists/ strategists/ decision makers. Developers who have already crossed the RDBMS-NOSQL bridge may derive some philosophical value though!
Is that a table? Is that a row? is that a column?
Are you the kind of person whose blood pressure returns to normal when you see this?
If you are, let me explain to you why this is no longer enough for the internet world.
Imagine an ad recommendation engine crawling through your browsing habits (3rd party cookies created from websites that you visit regularly). It is collecting whatever data it can - about your hobbies and interests. This data is worth its weight in gold, as the key to monetization in the internet world is serving personalized content. However, there is absolutely no guarantee that all the fields above will be available for every website found. Sometimes, unexpected fields will be found (like "MAC address" or "products browsed") because there are always custom data fields exchanged between ad companies and host websites. The list of visited websites could be 1, 100 or 1000 or even millions. There will be comma separated strings. There will be statistics about those visits like frequency, click counts, wish list/ cart items, etc. There may be SO MUCH INFORMATION that boggles the mind - and good luck trying to fit that in a table like that. In addition to real time ad-serving, that crawler may also need to get all of that back, store it somewhere and feed it to a downstream system which then parses and makes sense of it. Do you think these applications will exchange the connection strings for a SQL Server or MySQL database? It will be more like one of the several No-SQL data stores.
There are, broadly, four kinds of NoSQL databases serving different kinds of use cases:
- Key Value Databases (like Redis or Couchbase)
- Document Databases (like MongoDB or Azure DocumentDB)
- Column Family Databases (like Cassandra or HBase) and
- Graph Databases (like Neo4j or GraphDB which is built on .NET)
Explaining all the above kinds of No-SQL databases is out of scope for this article. That is textbook stuff and is readily available on the internet. Instead, I want to gear this article towards my relationally inclined friends out there and establish equivalency/ point out differences in approach between RDBMS and No-SQL. For that purpose, I have picked Cassandra as my representative of NoSQL databases. Let us see how the above table (or its equivalent) will look like in Cassandra.
In the world of Cassandra, a table is called a "Column Family" (or a "Super Column Family" if it is more complex). The equivalent Cassandra column family (read "table") will look like this in JSON representation (we can choose any representation we want - XML would do as well):
What do you notice? There are some glaring differences:
Every column has a name, but that name is not across the top any more. Why? Because every row can have different columns! The first row has the column "macAddress". The 2nd row has "productsBrowsed". The 3rd row has "foo" which none of the others have. For this reason, the neat tabular structure disappears, as you can no longer arrange the data in symmetric real estate with uniform column names declared ahead of time at the top
A more subtle observation: if so many of these fields are optional, then that could be very harmful on relations - as In the relational world, a missing primary key that has a reference in a related table (as a foreign key) would be an integrity violation - dirty data. But the whole premise of NoSQL is to get away from rigid relations. You do your data modeling in such a way that relations only exist loosely as a concept, not as a sacrosanct rule whose violation would throw exceptions
Have you noticed something about the rows? They are beginning to look more like free text - there is still some structure to it (they are made of column names and values), but there is really no saying how long a particular row value would be - what would be the delimiters, what would be the parsing rules. Does that turn you off? Do you prefer order instead of chaos? Welcome to the internet world of social media. Facebook exposes API-s which let 3rd parties do sentiment analysis. Rows are used to store all of our Facebook comments! Do you think we can store Facebook comments in an RDBMS table? No, that would be a disaster. Why? Well, you just encountered the first reason: you essentially need a shapeless container that can fit almost anything. But that is not the most important reason - which happens to be scale. The rows can be very big, and there can be billions of such rows. Facebook collects 500 TB of data a day (and that is a modest 2012 estimate). RDBMS cannot scale to those limits. We will come back and talk about scale later in this article
Data Modeling - Throwing Conventional Wisdom Out of the Door
When you are modeling your data for Cassandra, your model should be based on queries even if data ends up being de-normalized. Normalization is not the goal. (Normalization means removing duplicate data, i.e., storing data only once). As opposed to that, normalization of data is usually an important goal while modeling for RDBMS.
A simple example should make this clear. Say you have an Employees table with fields like Employee Id (Primary Key), Title, City, State, etc.
When modeling this for Cassandra, the first question you should ask is what are the lookup queries going to look like? Is my app going to query the database to retrieve Employees by Title? City? State?
Say the answer is - yes, the app is going to query by title (Find all the employees whose Title is 'Analyst') and also by City (Find all the employees whose City is 'Chicago').
In that case, the Cassandra data model should look like this (using CQL aka Cassandra Query Language):
CREATE TABLE employees_by_city ( employee_city text PRIMARY KEY, employee_id uuid, employee_title text, employee_state text ) CREATE TABLE employees_by_title ( employee_title text PRIMARY KEY, employee_city text, employee_id uuid, employee_state text )
See how there is significant duplicate data across these two tables? That violates the established RDBMS concept of normalizing data as a best practice.
Let us see what advantage this gives us. In Cassandra, we have a cluster of nodes and each node stores some records (similar to "sharding" in RDBMS). Rows are spread around the cluster based on a hash of the "partition key", which is the first part of the PRIMARY KEY. In the above example, the entire Primary Key is the partition key, as there is only one column per primary key. Reading from multiple partitions during query execution is very costly in Cassandra. That is why the data model should analyze the query first, asking: how can I arrange the data so that I read only minimum number of partitions during this query?
With the above data model, this is how the rows are stored: they are spread across partitions in such a way that all records with the same title go to the same partition, and all records with the same city go to the same partition as well. Therefore, when we query
Select * from employees_by_title where employee_title='Analyst'
It hits only one partition - the one that contains all the records for the title 'Analyst'. Similarly, when the app queries by City, that hits one partition as well. Hence reads are very fast.
Obviously, though read performance is now optimized, storage is not. The same employee record now exists in duplicate partitions - one for its City and one for its Title. It is redundant data, and to an RDBMS Junkie, plain sloppy.
This has two potential issues: (1) Costs more storage (2) When something gets updated/ deleted/ inserted, the write operations will have to write in multiple places, so write latency will be higher - and it is also error-prone, possibly leading to inconsistent data if someone forgets to change one of the places
I could have gone ahead and tried to solve the same problem in the traditional SQL Server way - using a clustered index on either title or city. The read performance would not have been terrible - and we could have avoided the potential issues listed above. But I will not do that. Because this is not the perfect example case study for Cassandra. The simplicity of this example almost makes it more suitable for RDBMS.
Instead, let us talk about the true goals Casandra wants to achieve, and how it achieves them. Once that is out of the way, we will come back to this example. We will explore Cassandra's answer to the above issues.
Availability versus Consistency - the Paradigm Shift
The internet has made many people very rich. There is a lot of money in eyeballs and clicks. It pays for our cars and homes and air conditioning (and roads and schools too - when Mark Zuckerberg pays his taxes). What do all these eyeballs want?
A V A I L A B I L I T Y + P E R F O R M A N C E
Yes, a website must be always available. What happens if clicking on that nice pair of glasses on Walmart.com takes too long to show you the details? You go and buy it on Amazon.com. What happens when you try to add an item to the cart, it tells you: "Partition detected across Atlantic. I could not decrement the quantity of this gumball machine by 1 in our European Data Center. Try again in 2 minutes", will you calmly say "Oh okay" and try again in 2 minutes?
The point is - in today's world, Availability (the ability to respond successfully to your user's HTTP requests) is of prime importance, and directly equates with money. But in today's world of geo-distributed data centers and inter-continental latencies, there are always partitions (read network failures) - either present or perceived. According to Eric Brewer's widely known (and widely misunderstood) CAP Theorem, when a partition (network failure) occurs in a distributed system, you must choose between Availability and Consistency.
In the Availability versus Consistency battle, Availability wins in today's world because availability means money.
Following natural evolution of the free market, products like Cassandra have evolved to favor Availability over Consistency. When you add something to the cart and the inventory database quantity must be decremented by 1, the application tries to keep the replicas in USA and Europe consistent by replicating the change everywhere. But if a partition appears, and it cannot ensure Consistency of the data (i.e., cannot update the remote replica), it then faces a choice - it can either show that friendly error message to the user (favoring Consistency), or let the user complete the transaction and, under the hood, keep on trying to reach an eventually consistent state later on (favoring Availability over Strong Consistency). In essence, I just explained Brewer's CAP theorem in a simple manner.
This is exactly where the paradigm shifts. Yesterday's reality suddenly disappears today - the perceived need for consistency to the degree that ACID purists stand for is gone in the internet world. The very basis of RDBMS is ACID, where the sacred second alphabet stands for Consistency. Rules have changed - and the cost is not zero. Due to this temporary inconsistency that we allowed, someone is probably going to add a gumball machine to the cart in Europe only to find that it is actually not in stock. It will be one pissed off customer if it at all happens, versus a lot of pissed off customers within a 30 second network partition window if we kick them all out because we could not update our replicas. The market rules have favored many at the cost of few. We have evolved from snobbish purists who would scorn at an error-allowing design to "that's a trade-off" mentality! Our priorities have changed!
Availability and scalability/ throughput together pose a massive engineering problem (one that Microsoft Azure solves very well). Partitions is the answer to the scalability problem (think of it this way: if there is too much of anything, you will need many containers, and it is easy to add containers as needed). It is not new - we used to apply sharding to scale an RDBMS solution as well. Sharding is nothing but horizontal partitioning. In Cassandra, partitioning makes each node responsible for only a portion of the keys, increasing write throughput drastically. With proper data modeling (partitioning strategy), Cassandra's inherently high write and read performance guarantees will be realized - and the website it is backing up will be considered "highly available"!
So what if that recommended data model results in duplicate (de-normalized) data? Programmers will be extra careful to update everything on writes. So what if we use some extra space? Disk is cheap now-a-days. Cassandra serves a greater purpose - and these compromises are like collateral damage. Cassandra does these 4 things REALLY well:
- Fast read / write (read "Highly Available" at the cost of "Strongly Consistent" - you now know what they mean when they say "Cassandra is eventually consistent")
- Ability to add more nodes/ clusters as you need more capacity (read "Incrementally Scalable", "mean throughput-churning-machine")
- Reliable cross-datacenter replication (read "Disaster Recovery")
- Configurable Consistency Model (yes it is possible to turn the knob and make Cassandra behave with Strong Consistency in the "Quorum" mode where we can specify how many nodes must have replicated data)
That's what the eyeball-crazy world wants! Whatever we have to sacrifice at the altar of speed is worth it - and good architects and developers never give up, they will bend their back to improve that consistency that we sacrificed. It is not a Boolean, it is a sliding scale. When Eric Brewer explained his CAP theorem as it is applicable in today's world, he sounds frustrated by how misunderstood the theorem is. The theorem is widely (and incorrectly) understood to claim, "There! You have chosen A and P (Availability and Partition Tolerance), so no C (Consistency) for you! See ya!". It is hardly like that. We can still make things consistent pretty quickly with the right design. The theorem merely describes that Perfect Consistency is not achievable if you choose A and P. But nothing is perfect anyway!
Microsoft Azure offers a Document DB storage option along with its blobs and SQL Servers. This is very nice and keeping in perfect pace with today's crazy world. Any enterprise should, in my opinion, adopt the strategy of using different kinds of storage for different use cases. Trying to adopt an "one-size-fits-all" strategy may backfire. There are use cases where RDBMS fits well. There are use cases for Cassandra (High Availability, Eventual Consistency), HBase (Strong Consistency, serialized updates), Aerospike (Immediate consistency in-memory key value store) and other No-SQL databases. Some data stores need replication, some don't. Some need backups, some don't. The hybrid approach to persistence has a new name - Polyglot Persistence.
RDBMS purists are known to be particularly fond of banks and financial institutions. They usually consider the financial domain as the use case bringing out how important consistency is - after all, money matters are so important that should not all transactions result in a consistent state immediately? Eventual consistency models must not be good enough? Rigid ACID concepts must be followed by the financial domain? Therefore, they must be using RDBMS?
Turns out that even that assumption may not be true. I will not explain this myself as Eric Brewer himself does the lucid explanation of ATM machines and banks in general. Check this out (it is an excerpt from the longer video which is linked on that page as well).
In conclusion, like the Matrix spoon, we must bend ourselves with the bending rules of today's Big Data world. We must preserve the sanctity of the purist mindset but also focus on getting more done by willing to bend the rules.
My views are my own, not my employer's.