6 - Implementing a Column-Family Database
|On this page:||Download:|
|What is a Column Family? | Designing a Column-Family Database | Designing Column Families for Efficient Data Storage and Retrieval | Using Wide Column Families | Indexing Columns | Analyzing Data | How Adventure Works Plan to Store Information about Website Traffic | Implementing a Column-Family Database to Maximize Scalability, Availability, and Consistency | Maximizing Scalability | Ensuring Availability | Maximizing Consistency | Summary | More Information|
Key/value stores and document databases are very row focused. What this means is that they tend to be optimized to enable an application to retrieve the data for complete entities that match one or more criteria. However, sometimes an application needs to retrieve data from a subset of fields across a collection of documents, in a manner comparable to performing a projection operation in SQL. Column-family databases enable you to store information in a more column-centric approach than is possible with most key/value stores or document databases.
In a column-family database, you can structure the rows as collections of columns. A single row in a column-family database can contain many columns. You group related columns together into column families and you can retrieve columnar data for multiple entities by iterating through a column family. The flexibility of column families gives your applications a wide scope to perform complex queries and data analyses, similar in many ways to the functionality supported by a relational database.
Column-family databases are designed to hold vast amounts of data (hundreds of millions, or billions of rows containing hundreds of columns), while at the same time providing very fast access to this data coupled with an efficient storage mechanism. A well-designed column-family database is inherently faster and more scalable that a relational database that holds an equivalent volume of data. However, this performance comes at a price; a typical column-family database is designed to support a specific set of queries and as a result it tends to be less generalized than a relational database. Therefore, to make best use of a column-family database, you should design the column families to optimize the most common queries that your applications run.
|When you design a column-family database, focus on the queries that your application needs to perform and optimize the structures to support these queries.|
This chapter provides information on how to design and implement the schema for a column-family database to best meet the needs of your applications.
What is a Column Family?
As the name implies, the column family is the feature that distinguishes a column-family database from other forms of NoSQL databases. A column family is simply a collection of columns that hold the data for a set of entities. Chapter 1, "Data Storage for Modern High-Performance Business Applications" described how, in the simplest of cases, you can think of a column family as being conceptually similar to a table in a relational database because data is organized as a set of rows and columns. However, unlike a table in relational database, the columns in a column family do not have to conform to a rigidly defined schema for every row. In fact, it is preferable to think of a column family as a map of name/value pairs where the contents of this map can vary on a row by row basis. For example, if you need to store the names of customers in a column-family database, you could create the Identity column family shown in Figure 1. This column family holds the title, first name, middle names, and last name of each customer. In this diagram, note that each customer is identified and sorted by using a unique key (CustomerID), but other than that the structure of the data for each customer can be different. If a row does not require a particular column, you can simply omit it from that row. This approach enables a columnfamily to store sparse data very efficiently compared to a table in a relational database.
Figure 1 - A simple column family with a fixed set of column names holding customer information
Figure 1 uses Title, FirstName, MiddleName1, MiddleName2 (continuing this sequence to as many middle names a particular customer might have), and LastName as a set of domain-specific column names. Column-family databases do not enforce any type of schema on the rows in any given column family, and you can give the columns in each row completely different names. However, depending on the queries that your applications need to perform, it is advisable to establish a convention for the column names otherwise it can be difficult to interrogate the database (if an application does not know the names of the columns, it will need to discover them somehow).
Some scenarios are more suited to using generated names for columns rather than domain-specific identifiers. As an example, Figure 2 shows a column family created as part of a shares portfolio management system for a financial institution that monitors the stocks and shares held by its customers. The portfolio for a customer can contain stocks and shares for any number of publicly listed companies. The Portfolio column family uses stock tickers as the column names and records the number of shares held as the value of each column. If a customer does not hold shares in a particular stock, then it is not listed as part of that customer's portfolio. If a customer invests in a new stock, it is easy to add a column identified by using the appropriate stock ticker symbol to the list of stocks and shares for that customer.
Figure 2 - A column family with a dynamic set of column names based on stock ticker symbols
Most column-family databases only support simple, scalar types as the values of columns. However, a few column-family databases allow an individual column be structured as a collection of child columns, leading to a hierarchical arrangement. Columns that contain collections of child columns are sometimes referred to as super columns.
As described earlier, the data for an entity in a column-family database can span a large number of columns, and you should use column families to group logically related data together. In the examples shown in Figures 1 and 2, both column families could be part of the same database, each column family storing data about a different aspect of a customer (identity information and the portfolio of shares that a customer holds). Both column families use the same row key (CustomerID), and an application can use this key to retrieve the data for a specific customer. The collections of columns in a column family are also ordered by using this row key to enable the database to find data for a specific row quickly.
Many column-family databases also sort the columns within a row by the column name. In Figure 2, the stock ticker symbol columns for each row are stored in alphabetical order. A few column-family databases enable you to customize the way in which columns are sorted.
In most column-family databases, a column family also defines the physical storage for the data; different column families are usually stored as separate files, possibly on different disks (the actual implementation depends on the column-family database that you are using). What this means in practical terms is that when an application queries the data for an entity, the database software only needs to read data from the column families that actually contain the data being retrieved. In contrast, if you are using a document database or a key/value store, the database software typically either has to fetch an entire document or value
|Remember that in most NoSQL databases, the unit of I/O is the aggregate being read or written.|
Many document databases implement projections which enable an application to specify the fields that a query should return. However, the document database still needs to read the entire document from storage in order to extract this information and return it to the application. Similarly, updating the data in a document may necessitate rewriting the entire document to disk.
A few document databases enable you to create covering indexes that can remove the requirement to read an entire document for queries that only fetch data in indexed fields. Refer to Chapter 5, "Implementing a Document Database" for more information.
Similarly, if an application changes the value in a column, the database software only needs to write data back to the affected column family; data for the same row held in other column-families remains unaffected. If you are using a document database or a key/value store, modifying the information in a single field in a row may necessitate writing the entire document or value back to disk. Figure 3 illustrates the process of querying data in a column-family database compared to a document database.
|The physical structure of a column-family database enables you to partition your data vertically. This strategy helps to minimize the amount of data that the database software actually needs to read from disk (or memory) to satisfy a query, or that it needs to save to disk (or memory) to update information.|
Figure 3 - Retrieving data from a column-family database compared to a document database
The syntax shown for QUERY1, QUERY2, QUERY3, and QUERY4 in Figure 3 is for illustrative purposes only. It is not intended to be an example of any particular query language for a column-family database or a document database.
In Figure 3, QUERY 1 retrieves customer names from the column-family database and only accesses data in the Identity column family. Similarly, QUERY 2 fetches information about the shares that a customer holds and only accesses data in the Portfolio column family (the term Portfolio.\* means all columns in the Portfolio column family). Only QUERY 3 that combines the data to retrieve the names of customers and the shares that they hold needs to access both column-families. QUERY 4 that retrieves customer names from a document database holding identity and portfolio information actually has to read the entire document (including the shares data) to fetch just the customer details.
Most column-family databases and document databases cache data in memory, so the amount of physical I/O performed by a database to retrieve data may be less than that indicated by the previous discussion. However, the same principles apply to cached data. If you need to cache an entire document in memory rather than just the relevant parts of the data defined by a column family, you may be wasting precious memory resources which will, in turn, affect the performance of the database.
Designing a Column-Family Database
A column-family database is a suitable repository for capturing large amounts of sparse, volatile information very quickly, and providing efficient query access to this information. However, to make the best use of a column-family database you need to design your column families carefully. You need to understand the data that your applications capture, and the queries that they need to perform over this data. The following sections describe some of the factors that you need to think about.
Designing Column Families for Efficient Data Storage and Retrieval
Remember that in a typical column-family database, a column family fulfills two purposes:
- At the logical level, it groups related columns for an entity together.
- At the physical level, it defines the file and storage structure of the database.
A well-designed column-family database enables an application to satisfy the majority of its queries by visiting as few column families as possible, while ensuring that the data held by each column family is relevant to the queries that reference them. In other words, you should partition the data that constitutes an entity vertically into sets of columns, where each set fully satisfies one or more queries but does not contain columns that are not required by most queries. Each set of columns is a candidate to become a column family. Ideally each column family should be non-overlapping (you should try and avoid duplicating the same data for the same row across different column families), although if the data is relatively static storing multiple copies of a column in different column families may help to optimize some queries.
|Design your column families to optimize the I/O and caching performed by the database.|
As an example, consider a column-family database that stores census information. A typical census database contains information such as the name, address, gender, date of birth, occupation, ethnicity, and religion for every member of the population, and it could contain hundreds of millions of rows. Additionally, a census database has to support a varied set of queries such as "How many people are aged between 40 and 50?," "What is the most common occupation for women in their 20s?," or "What is the ethnic mix of the population?" It would be possible to store all this data in a single column family as shown in Figure 4 below.
Figure 4 - Implementing the census database as a single column family
A column-family database often sorts the columns by column name within a row, so the order of the columns in the column family for the census database are likely to be different from that shown in Figure 4.
This strategy results in a structure that is very similar to that of a document database. A query that only needs to find the age of a person requires the database software to retrieve the entire set of columns for that person, possibly resulting in a significant amount of disk I/O and negating an important advantage of using a column-family database as described in the previous section. The same is true for queries that simply need to find the name or occupation of a person.
|If a column-family database caches column families in memory, then caching data for a column family containing a lot of irrelevant or infrequently accessed data is a waste of resources.|
An alternative (and probably impractical) approach is to store each column in its own column family, as shown in Figure 5. This time, the database software only needs to read the data that satisfies each query performed by the application, but because the data is spread across multiple column families retrieving this data may require performing a separate I/O operation for each column.
Figure 5 - Implementing the census database as a large collection of column families
In the census database, it is more optimal to divide this data into separate column families based on the query requirements of the application. Figure 6 shows the same data segregated into Name, Profile, and Address column families. Each column family groups data that is regularly queried together into the same vertical partition, but isolates it from other data that is not commonly required by these queries.
Figure 6 - Implementing the census database as a Name, Profile, and Address column families
It is important to establish the column families that your applications require early on in the development cycle. Adding a new column family to an existing database may require you to take the database offline while the operation is in progress (information about column families has to be propagated across all database servers if you are implementing sharding and replication). Migrating columns between column families is also a non-trivial issue, especially if you have hundreds of millions of rows.
Using Wide Column Families
The census database illustrates an example where there may be many hundreds of millions of rows, but each row only consists of column families that contain only a few columns. These types of column families are referred to as narrow column families.
The purpose of defining the column families in this way was to avoid the I/O overhead of retrieving irrelevant data from disk. However, performing a few large I/O requests that fetch relevant data is always going to be more efficient than performing lots of small requests for the same information, and you can exploit this mechanism to good advantage if your applications frequently need to process entities that contain significant amounts of information. The Portolio column-family example shown earlier in this chapter described how the columns in a column family can be dynamically generated to list the volume of stock that each customer has purchased. There is effectively no limit to the number of different stocks that a customer can hold, so each row could potentially contain a large number of columns. If the customer is an institution representing a large pension fund rather than an individual person, it is possible that it could hold shares in many hundreds or thousands of different stocks. If an application needs to find the shares held by a given customer, all of these items are located in the same row in the same column family, and they can be retrieved very quickly by performing a small number of contiguous I/O operations. Column families where the rows contain a large number of columns are called wide column families.
|Wherever possible, you should design a column-family database to take advantage of wide column families.|
As a second example, suppose that you are creating a database that needs to hold information about the prices of stocks as they are traded on the stock market. This information can change with great rapidity. You might also need to retain a history of previous prices for applications that analyze trends in stock market data. Figure 7 shows how you can store this data in a wide column family.
Figure 7 - Storing stock price data in a wide column family
In this column family, the stock ticker symbol is the row key, and the columns themselves consist of the date and time that a price change occurred as the column name, and the new price as the value. If the columns in the StockPrices column family are sorted, the prices are listed in date/time order (remember that this might happen automatically, depending on the way in which the column-family database stores the data for columns). As new prices are captured for a stock item, they can be quickly added as a column to the row for the stock item. Similarly, because each row holds the entire stock price history for an item, an application can quickly retrieve the historical data for any given stock.
Some column-family databases also enable you to add an expiration time or time to live (TTL) to a column. When this time elapses, the column can be removed (the database does this in the background). TTLs are useful if the data in a column has a natural lifetime beyond which it is no longer relevant or useful. For example, in the StockPrices column family, the second-by-second movements in the value of a stock become less important as time passes, and a stock market analyst is going to be more interested in the trend in the value of that stock over a weekly or monthly period. An application that captures and stores the prices could specify a TTL for each price of two months. At the end of each month, the application could sweep through the database and archive summary stock price information (maybe just capture the daily or weekly prices to a different data store) for prices that are due to expire in the following month.
Each row in column-family database is identified by using a unique key, and all column families that contain data for a row share the same key value. Keys are automatically indexed and rows in a column family are ordered by the key values. It is therefore very quick for an application to retrieve data by using the row key and you should pay special attention to the values that you use for row keys. For example, if your data has a natural unique identifier (such as a social security number in a personnel database), and your applications regularly query data by using this identifier, then use this identifier as the row key.
In many circumstances, it is advisable to avoid using a monotonic increasing value as the row key in a column-family database as this can lead to hotspots and I/O contention resulting in poor scalability. For example, if you are building a database for a retail application that creates and stores information about orders using the order number as the row key, and the order number is generated by using a monotonic increasing sequence (the first order is order 1, the second is order 2, and so on), then the most recent orders are all likely to be stored together in the same physical region of the database. These are the orders that are most likely to be the focus of attention and will be accessed frequently by users running the application. As more orders are created, the focus shifts, but the phenomenon is of users hitting one region in the database very hard before moving on to the next. In these circumstances, it is better to generate a random (and unique) value such as a GUID for the row key and store the order number as a column value.
Some column-family databases provide the option to hash the row keys, which can help to avoid hotspots in your column families but at the cost of losing the ordering of rows. This mechanism is highly suitable for applications that perform point queries (queries that need to retrieve data from a single row quickly), but it can slow the performance of range queries (queries that fetch a contiguous set of rows) as logically adjacent rows are physically distant from each other.
However, an important strength of a column-family database is the ability to query data based on the values in the various columns other than the row key. As an example, in the census database, a query such as "What is the most common occupation for women in their 20s?" requires that an application can quickly find all rows in the Profile column family where the value in the DOB (Date Of Birth) column falls within a given range and the value in the Gender column is Female, and then extract the value in the Occupation column. To support this type of processing, many column-family databases enable you to create secondary indexes over columns (distinct from the primary index over the row key).
A secondary index spans the data held by one or more columns in a single column family. In the census database, you could create an index over the DOB column in the Profile column family to support queries that retrieve data based on the date of birth. In many column-family databases, you can create composite indexes that span more than one column. A composite index can further improve the speed of a query if it covers the entire set of columns required by that query. For example, if you create a composite index that spans the DOB, Gender, and Occupation columns (in that order), an application that needs to determine the most common occupation for women in their 20s can find the information directly from the index and may not need to access the underlying column family.
Indexes have to be maintained as rows are added to and removed from a column family. This can be an expensive and time-consuming operation, especially if the column family contains a very large number of highly dynamic rows. To minimize this overhead, many column-family databases maintain secondary indexes by using background threads that do not block any read or write requests performed by applications running in the foreground. Therefore, it is possible that a query that retrieves data by using an index might not find a row that has been recently added.
The factors that determine whether creating a secondary index over a column will improve query performance depend on the number of distinct values that the column can contain and how the column-family database actually implements secondary indexes. For example, some column-family databases implement indexes as balanced B-Trees (or they use a similar hierarchical strategy). In these cases you should consider creating indexes over columns that have a large number of distinct values. The hierarchical nature of a B-Tree index enables the database to quickly home in on rows where the indexed column contains a specific value.
Figure 8 shows an example of a column family containing 100,000,000 rows of information about car drivers and their licenses. The rows are identified by DriverID, but in this scenario it is common for queries to retrieve information by driver’s License number (for reasons outside the scope of this example, the database designers chose not to use the driver’s License number as the row key). Without an index, these queries would necessitate scanning through all 100,000,000 rows. If the column-family database provides balanced B-Tree indexes, then it is possible to find a row that contains a specified driver’s license number by starting at the root node of the index and following the path where the license number falls in the range of values indicated by the LicenseRange data in each node. The leaf nodes of the index contain the row keys. The I/O and processing associated with accessing data through the index should be significantly less than reading all 100,000,000 rows of the data and performing a linear search.
Figure 8 - A balanced B-Tree index over unique data values
Other column-family databases implement flat indexes, often by creating a separate column family that holds the index data. Remember that column families are especially suitable for retrieving data from wide column families. In this case, indexes over columns that have a relatively small number of different values will work better than indexes over columns that are highly distinct.
Figure 9 shows the License index from Figure 8 implemented as a column family. The row keys are the driving license numbers, and the values are the driver IDs. This is a very narrow column family. The index is useful because the data is sorted by driving license number and it is quicker for the database to search through this data than to perform a linear search in the DriverDetails column family, but the repeated piecemeal I/O overhead associated with retrieving and searching through a large set of narrow columns might actually be more costly than simply fetching every row from the DriverDetails column family.
Figure 9 - An index implemented as a narrow column family over unique values
In contrast, suppose an application needs to query drivers' licenses according to the city that issued them. Figure 10 shows a flat index over the CityObtained column in the DriverDetails column family (this column contains the city in which the driver obtained their license). In this case, assuming that there are 20,000 city authorities that can issue licenses, each city will account for 5,000 driving licenses on average. Creating an index over this column results in the index utilizing a wide column family. The index is sorted by CityObtained (the columns in each index entry may also be sorted by DriverID), and an application can quickly use this index to retrieve the list of drivers that obtained their license in a specific city. Furthermore, because the list of drivers for each city is all held in the same row, this data will likely be physically grouped together on disk optimizing the I/O effort required to retrieve the information.
Figure 10 - An index implemented as a wide column family over non-unique values
|If your column-family database does not support secondary indexes, you can create your own column families that simulate them. However, it is your responsibility to maintain these indexes as rows are added to and removed from the various column families that they reference.|
Even with an appropriate index in place, locating and retrieving data can consume some processing power and take a little time. In a large database that has to support many thousands of concurrent users, this effort can add up and become a significant factor that affects the performance of the system. Therefore, many column-family databases also implement Bloom filters to help determine whether a row that matches a specified set of criteria actually exists before attempting to retrieve it.
A Bloom filter is essentially an array of bits and an associated set of hash functions. The bits in the array are all initialized to 0. When the database stores a value in a column family, it hashes a copy of this value by using each associated hash function in turn to calculate a set of locations in the bit array. The database stores a 1 in each specified location in this array. To determine whether the column family contains a specific value, the database runs the same set of hash functions over that value, and uses the results as a collection of indexes into the bit array. If the value at any location is a 0, then there is no matching value in the column family and the search can finish immediately. If the value at every location is a 1, then the column family might contain a corresponding value. However, bear in mind that two values can hash to the same location, so you cannot be sure until you actually retrieve the data, but if you select the hash functions carefully, and have a big enough array of bits, then these cases should be a tiny minority.
An index can help an application to locate rows that match a given set of criteria, but many applications also need to analyze the data held in these rows. In the census database query "What is the most common occupation for women in their 20s?," a composite index over the DOB, Gender, and Occupation columns can help to find a list of occupations for females in their 20s, but it does not help the application determine which is the most common occupation. If the column-family database supports a map/reduce framework, you can use this functionality to generate summary information.You write code that extracts the data values from each column that you wish to summarize (the map phase), and then perform whatever summary functions you application requires over this data (the reduce phase) and store the results in the database. The map/reduce code runs automatically whenever rows are added to or removed from a column family. When your application needs to summarize data, it simply fetches the appropriate summary value from the database.
For more information about using map/reduce frameworks, visit the MapReduce.org website.
Some column-family databases support materialized views that you can use to store more comprehensive summary information. Depending on the way in which the column-family database implements them, a materialized view can either be generated each time it is queried by an application by using code stored in the database, or the data that matches the view can be identified and copied to a separate location when the view is first defined, and as rows are added to or removed from a column family. The first approach is suitable for volatile data, while the second is more efficient if the data is relatively static.
If your column-family database does not support materialized views, you can simulate them by using a separate column family (you can write application code that extracts the data for the view from the database, and then processes and stores the data in the column family.) How up to date you keep the materialized view depends on the nature of the data in the view. Maintaining the information in a materialized view implemented in this way can be an expensive process if the data is dynamic, and it may be sufficient to refresh the view on a daily or even weekly basis rather than every time a change occurs in the underlying data.
Some column-family databases enable you to store and access different versions of data in a column. This can be useful if you need to analyze trends in data, and examine how values have been modified over time.
In these databases, each value can have a version number (sometimes implemented as a date/time field). When you modify the data in a column, the database saves the new value together with a new version number, but leaves the old value in the database as well. In this way, you can think of a column family as being a three-dimensional structure of rows, columns, and versions.
How Adventure Works Plan to Store Information about Website Traffic
To help assess the performance of the Shopping application, and also to gather information about how customers use the application, the developers at Adventure Works wish to analyze the website traffic for their system. This enhancement is not currently implemented, but to perform this analysis, they need to collect information about the location of each user browsing the website and the frequency with which users access each page on the website. They do not need to capture any personal information about users. To support this requirement, the developers designed a column-family database containing the PageAccess column family shown in Figure 11.
Figure 11 - Storing information about website traffic
The PageAccess column family records the location from which each page was accessed, using the date and time that the page was read as the column name. The column family stores the data for each page as a separate row, keyed by using the relative URL of the page within the website.
The functionality that captures this information is a future development and is not included in the sample application provided with this guide
Implementing a Column-Family Database to Maximize Scalability, Availability, and Consistency
Historically, column-family databases were explicitly designed to store large amounts of data and enable applications to read and process this data very quickly. As a result, they are inherently scalable and provide high availability, but often at the cost of immediate consistency. This section summarizes the ways in which most column-family databases implement scalability and availability, and describes the tradeoffs commonly made against consistency to help maximize performance.
In common with most NoSQL databases, the underlying structure of the majority of column-family databases lend themselves to partitioning, both vertically by placing different column-families on different servers, and horizontally by sharding a column family based on the row key. For example, in the census database described earlier in this chapter, it is possible to shard the rows and distribute the data horizontally across servers, based on the PersonID key. Additionally, each shard could be partitioned vertically, and the data for the Name, Address, and Profile column families for each shard could be stored on separate disks to reduce I/O contention, as shown in Figure 12. In some cases, it may be possible to distribute the column families for a single shard across multiple servers, spreading the load and increasing scalability even further.
Figure 12 - Sharding and partitioning the data for the census database
Different column-family databases provide varying strategies for determining in which shard the data for a particular row is stored. By default, many column-family databases enforce a strict ordering, so that the data for row 2 is placed after row 1, and the data for row 3 is placed after row 2, and so on. This mechanism populates a single shard, and then overflows onto the next when the first shard has reached its maximum capacity. The rationale behind this strategy is that it preserves the key order of adjacent rows and optimizes applications that perform range queries based on the row key. If you need to analyze and summarize data, using a map/reduce framework can help to optimize calculations by isolating updates to summary values to the shard in which rows are inserted; the summary information in other shards will remain unaffected. However, as described earlier, storing data in row key order can lead to hotspots occurring. In this case, a single shard becomes the focus of all new inserts and this can result in that shard having to handle a disproportionate volume of requests while other shards remain relatively inactive. For this reason, some column-family databases hash the row key, effectively distributing the data randomly across shards. This mechanism spreads the load, but can reduce the performance of applications that perform range queries.
|If possible, select the partitioning mechanism that your database uses to shard data to match the type of operations that your application performs. If your application uses range queries extensively, partition data by row key. If your application is insert intensive, then partition data by hashing the row key.|
Most column-family databases implement auto sharding, where the sharding strategy and location is transparent to the applications storing and retrieving data. This approach requires configuring each server with information that it can use to locate other servers participating in the sharding cluster. An application can send requests to store to any servers in the cluster, and the servers negotiate with each other to determine where the data should be saved. Similarly, an application can send requests to retrieve data to any server, which communicates with the other servers to determine the location of the data and return it to the application. In this scheme, an administrator can scale the database out to additional servers simply by installing the database system software on these servers and adding the details of these servers to the cluster configuration. If the column-family database software supports automatic detection, new servers may be discovered by existing servers in the cluster.
To prevent data loss and help ensure that the entire database is always available, most column-family databases implement replication. Each server holding one or more partitions should be replicated, and the servers in a single replication cluster should ideally be located on different sites.
Due to the size of the rows (remember that a single row can contain many hundreds of columns), most column-family databases adopt the peer-to-peer replication strategy. In the alternative primary/secondary replication model, the primary server would be responsible for transmitting every insert to all other servers in the replication cluster. Each row could easily be several megabytes or even gigabytes in size, and this approach could result in the primary server becoming a bottleneck. The peer-to-peer model divides the responsibility across all servers in the replication cluster. However, inserts, updates, and deletes may take a little longer to propagate across the entire cluster (the servers may use the Gossip Protocol to share information with each other periodically) and there is consequently more scope for inconsistency.
To optimize insert and update operations many column-family databases attempt to cache changes in memory and use append-only forward logging to record the changes (this form of I/O is much faster than performing the various seek and write operations required to save a row to a specific location in the file for a column family). If a new or modified row has to be flushed from cache, then the data can be persisted to the column family on disk.
To ensure that delete operations are performed quickly, many column-family databases use tombstoning. The row is not actually deleted, but simply marked as no longer present. A compaction process can remove tombstoned rows as a batch job when the column family needs to be defragmented or cleaned-up.
In common with other types of NoSQL databases, to maximize performance and scalability most column-family databases only provide a very limited form of transactional consistency. In nearly all column-family databases, only write operations that affect a single row in a single column family are atomic. However, depending on the database software, you might also find that you can group writes that affect multiple rows in a single column family together as an atomic operation (as long as all the affected rows are located within the same shard), and you might even be able to write the data for a single row that spans multiple column-families as a transaction (again, if the column-families are all located on the same server).
Some column-family databases provide row-level locking; an application can lock an entire row to prevent it from being changed by another application. However, row locking also requires that locks are released in a timely manner otherwise concurrency may be severely limited,
If the database implements caching and forward-only logging as described in the previous section, a write operation completes successfully only when the log record for that operation has been saved to disk.
If you require transactional consistency across column-families and rows, you may need to implement this functionality as part of the application code that accesses the database or use a third-party library synchronization service. Chapter 8, "Building a Polyglot Solution," provides more information.
If you are replicating data to ensure availability, the peer-to-peer replication model typically depends on the servers in a cluster being grouped into appropriate read and write quorums to maximize consistency. However, if applications connect to different servers in a cluster to read and write data, write conflicts are highly probable because the more servers that have been added to a replication cluster, the more likely it is that write conflicts will occur. For this reason, many column-family databases also implement vector clocks or other forms of versioning information that can be used to detect and resolve conflicts. Several popular column-family databases automatically add timestamp information to individual columns, and automatically detect and resolve conflicts (the database designer may be able to specify the strategy to use to resolve conflicts). To maximize the performance of write operations, conflict detection and resolution is often deferred and only performed when an application reads data (or data is exchanged between servers during the Gossip Protocol). The rationale behind this approach is that if an application writes conflicting data, but that data is never read subsequently, then there is no need to detect or resolve the conflict.
The section "Improving Consistency" in Chapter 1 provides more information about how read and write quorums work, and the common forms of versioning strategies that many NoSQL databases use.
This chapter has described how to use a column-family database to store and retrieve structured information. You have seen how the data is organized as a tabular structure with data divided into rows and the columns for each row spanning one or more column families. The data for a row in a column family can comprise hundreds or even thousands of columns, and the structure of different rows in the same column family can vary (different rows do not have to contain the same columns).
Column families define the logical and physical structure of the database; the data for a single column family is typically stored together in the same file, and different column families are contained within different files. You should define the column families for a database to optimize the I/O performed by the database.
Each row is identified by a row key, and all rows for an entity that spans multiple column families share the same row key. The row key is usually indexed, enabling an application to retrieve information by using the row key very quickly. Most column-family databases also enable a developer to create indexes over the other columns in a column family, speeding the performance of queries that reference these columns as query criteria.
In common with most other NoSQL databases, column-family databases support sharding, enabling the rows in a column family to be distributed across servers to spread the load and improve scalability. Column families also provide natural vertical partitioning, and you can place different column families on different physical disks, again to spread the load and improve performance.
Many column-family databases also support replication to help ensure that the data for each column family is always available. Column-family databases tend to provide minimal transactional consistency and in most cases they only support atomic writes for operations that store data in a single column family for a single row. Eventual consistency is typically provided by using read quorums and row versioning with conflict resolution.
All links in this book are accessible from the book's online bibliography on MSDN at: http://msdn.microsoft.com/en-us/library/dn320459.aspx.
- You can find more information about using map/reduce techniques to summarize data on the MapReduce.org website at http://mapreduce.org/.
- MSDN Magazine published a series of articles on getting started with Cassandra, a popular column-family database. You can find these articles online at: