July 2009

Volume 24 Number 07

Scale Out - Distributed Caching On The Path To Scalability

By Iqbal Khan | July 2009

This article discusses:

  • Distributed caching
  • Scalability
  • Database synchronization
This article uses the following technologies:
ASP.NET, Web services, HPC applications

Contents

Distributed Caching
Must-Have Features in Distributed Cache
Expirations
Evictions
Caching Relational Data
Synchronizing a Cache with Other Environments
Database Synchronization
Read-Through
Write Through, Write Behind
Cache Query
Event Propagation
Cache Performance and Scalability
High Availability
Performance
Gaining Popularity

If you're developing an ASP.NET application, Web services or a high-performance computing (HPC) application, you're likely to encounter major scalability issues as you try to scale and put more load on your application. With an ASP.NET application, bottlenecks occur in two data stores. The first is the application data that resides in the database, and the other is ASP.NET session state data that is typically stored in one of three modes (InProc, StateServer, or SqlServer) provided by Microsoft. All three have major scalability issues.

Web services typically do not use session state, but they do have scalability bottlenecks when it comes to application data. Just like ASP.NET applications, Web services can be hosted in IIS and deployed in a Web farm for scalability.

HPC applications that are designed to perform massive parallel processing also have scalability problems because the data store does not scale in the same manner. HPC (also called grid computing) has traditionally used Java, but as .NET gains market share, it is becoming more popular for HPC applications as well. HPC applications are deployed to hundreds and sometimes thousands of computers for parallel processing, and they often need to operate on large amounts of data and share intermediate results with other computers. HPC applications use a database or a shared file system as a data store, and both of these do not scale very well.

Distributed Caching

Caching is a well-known concept in both the hardware and software worlds. Traditionally, caching has been a stand-alone mechanism, but that is not workable anymore in most environments because applications now run on multiple servers and in multiple processes within each server.

In-memory distributed caching is a form of caching that allows the cache to span multiple servers so that it can grow in size and in transactional capacity. Distributed caching has become feasible now for a number of reasons. First, memory has become very cheap, and you can stuff computers with many gigabytes at throwaway prices. Second, network cards have become very fast, with 1Gbit now standard everywhere and 10Gbit gaining traction. Finally, unlike a database server, which usually requires a high-end machine, distributed caching works well on lower cost machines (like those used for Web servers), which allows you to add more machines easily.

Distributed caching is scalable because of the architecture it employs. It distributes its work across multiple servers but still gives you a logical view of a single cache. For application data, a distributed cache keeps a copy of a subset of the data in the database. This is meant to be a temporary store, which might mean hours, days or weeks. In a lot of situations, the data being used in an application does not need to be stored permanently. In ASP.NET, for example, session data is temporary and needed for maybe a few minutes to a few hours at most.

Similarly, in HPC, large portions of the processing requires storing intermediate data in a data store, and this is also temporary in nature. The final outcome of HPC might be stored in a database, however. Figure 1 shows a typical configuration of a distributed cache in an enterprise.

Figure 1

Figure 1 Distributed Cache Shared by Various Apps in an Enterprise (Click the image for a larger view)

Must-Have Features in Distributed Cache

Traditionally, developers have considered caching only static data, meaning data that never changes throughout the life of the application. But that data is usually a very small subset—maybe 10%—of the data that an application processes. Although you can keep static data in the cache, the real value comes if you can cache dynamic or transactional data—data that keeps changing every few minutes. You still cache it because within that time span, you might fetch it tens of times and save that many trips to the database. If you multiply that by thousands of users who are trying to perform operations simultaneously, you can imagine how many fewer reads you have on the database.

But if you cache dynamic data, the cache has to have certain features designed to avoid data integrity problems. A typical cache should have features for expiring and evicting items, as well as other capabilities. I'll explore these features in the following sections.

Expirations

Expirations are at the top of the list. They let you specify how long data should stay in the cache before the cache automatically removes it. You can specify two types of expirations: absolute-time expiration and sliding-time (or idle-time) expiration.

If the data in your cache also exists in a master data source, you know that this data can be changed in the database by users or applications that might not have access to the cache. When that happens, the data in the cache becomes stale. If you're able to estimate how long this data can be safely kept in the cache, you can specify absolute-time expiration—something like, "Expire this item 10 minutes from now" or "Expire this item at midnight tonight."

One interesting variation to absolute expiration is whether your cache can reload an updated copy of the cached item directly from your data source. This is possible only if your cache provides a read-through feature (see later sections) and allows you to register a read-through handler that reloads cached items when absolute expiration occurs. Except for a few commercial caches, most caches do not support this feature.

You can use idle-time (sliding-time) expiration to expire an item if it is not used for a given period of time. You can specify something like, "Expire this item if nobody reads or updates it for 10 minutes." This approach is useful when your application needs the data temporarily, but once the application is finished using it, you want the cache to automatically expire it. ASP.NET session state is a good candidate for idle-time expiration.

Absolute-time expiration helps you avoid situations in which the cache has a stale copy of the data or a copy that is older than the master copy in the database. Idle-time expiration is meant to clean up the cache after your application no longer needs certain data. Instead of having your application keep track of necessary cleanup, you let the cache take care of it.

Evictions

Most distributed caches are in-memory and do not persist the cache to disk. This means that in most situations, memory is limited and the cache size cannot grow beyond a certain limit, which could be the total memory available or much less than that if you have other applications running on the same machine.

Either way, a distributed cache should allow you to specify a maximum cache size (ideally, in terms of memory size). When the cache reaches this size, it should start removing cached items to make room for new ones, a process usually referred to as evictions.

Evictions are made based on various algorithms. The most popular is least recently used (LRU), where those cached items that have not been touched for the longest time are removed. Another algorithm is least frequently used (LFU). Here, those items that have been touched the least number of times are removed. There are a few other variations, but these two are the most popular.

Caching Relational Data

Most data comes from a relational database, but even if it does not, it is relational in nature, meaning that different pieces of data are related to one another—for example, a customer object and an order object.

When you have these relationships, you need to handle them in a cache. That means the cache should know about the relationship between a customer and an order. If you update or remove a customer from the cache, you might want the cache to also automatically update or remove related order objects from the cache. This helps maintain data integrity in many situations.

But again, if a cache cannot keep track of these relationships, you have to do it, and that makes your application much more cumbersome and complex. It's a lot easier if you just tell the cache at the time data is added that an order is associated with a customer, and the cache then knows that if that customer is updated or removed, related orders also have to be updated or removed from the cache.

ASP.NET Cache has a really cool feature called CacheDependency, which allows you to keep track of relationships between different cached items. Some commercial caches also have this feature. Figure 2 shows an example of how ASP.NET Cache works.

Figure 2 Keeping Track of Relationships Among Cached Items

using System.Web.Caching; ... public void CreateKeyDependency() { Cache["key1"] = "Value 1"; // Make key2 dependent on key1. String[] dependencyKey = new String[1]; dependencyKey[0] = "key1"; CacheDependency dep1 = new CacheDependency(null, dependencyKey); Cache.Insert("key2", "Value 2", dep2); }

This is a multilayer dependency, meaning that A can depend on B and B can depend on C. So, if your application removes C, A and B have to be removed from the cache as well.

Synchronizing a Cache with Other Environments

Expirations and cache dependency features are intended to help you keep the cache fresh and correct. You also need to synchronize your cache with data sources that you and your cache don't have access to so that changes in those data sources are reflected in your cache to keep it fresh.

For example, let's say your cache is written using the Microsoft .NET Framework, but you have Java or C++ applications modifying data in your master data source. You want these applications to notify your cache when certain data changes in the master data sources so that your cache can invalidate a corresponding cached item.

Ideally, your cache should support this capability. If it doesn't, this burden falls on your application. ASP.NET Cache provides this feature through CacheDependency, as do some commercial caching solutions. It allows you to specify that a certain cached item depends on a file and that whenever this file is updated or removed, the cache discovers this and invalidates the cached item. Invalidating the item forces your application to fetch the latest copy of this object the next time your application needs it and does not find it in the cache.

If you had 100,000 items in the cache, 10,000 of them might have file dependencies, and for that you might have 10,000 files in a special folder. Each file has a special name associated with that cached item. When some other application—whether written in .NET or not—changes the data in the master data source, that application can communicate to the cache through an update of the file time stamp.

Database Synchronization

The need for database synchronization arises because the database is being shared across multiple applications, and not all those applications have access to your cache. If your application is the only application updating the database and it can also easily update the cache, you probably don't need the database synchronization capability. But in a real-life environment, that's not always the case. Whenever a third party or any other application changes the database data, you want the cache to reflect that change. The cache reflects changes by reloading the data, or at least by not having the older data in the cache.

If the cache has an old copy and the database a new copy, you now have a data integrity problem because you don't know which copy is right. Of course, the database is always right, but you don't always go to the database. You get data from the cache because your application trusts that the cache will always be correct or that the cache will be correct enough for its needs.

Synchronizing with the database can mean invalidating the related item in the cache so that the next time your application needs it, it will fetch it from the database. One interesting variant to this process is when the cache automatically reloads an updated copy of the object when the data changes in the database. However, this is possible only if your cache allows you to provide a read-through handler (see the next section) and then uses it to reload the cached item from the database. However, only some of the commercial caches support this feature, and none of the free ones do.

ASP.NET Cache has a SqlCacheDependency feature (see Figure 3) that allows you to synchronize the cache with a SQL Server 2005/2008 or Oracle 10g R2 or later version database—basically any database that has the .NET CLR built into it. Some of the commercial caches also provide this capability.

Figure 3 Using SqlDependency to Synchronize with a Relational Database

using System.Web.Caching; using System.Data.SqlClient; ... public void CreateSqlDependency(Customers cust, SqlConnection conn) { // Make cust dependent on a corresponding row in the // Customers table in Northwind database string sql = "SELECT CustomerID FROM Customers WHERE "; sql += "CustomerID = @ID"; SqlCommand cmd = new SqlCommand(sql, conn); cmd.Parameters.Add("@ID", System.Data.SqlDbType.VarChar); cmd.Parameters["@ID"].Value = cust.CustomerID; SqlCacheDependency dep = new SqlCacheDependency(cmd); string key = "Customers:CustomerID:" + cust.CustomerID; Cache.Insert(key, cust, dep); }

ASP.NET Cache SqlCacheDependency allows you to specify a SQL string to match one or more rows in a table in the database. If this row is ever updated, the DBMS fires a .NET event that your cache catches. It then knows which cached item is related to this row in the database and invalidates that cached item.

One capability that ASP.NET Cache does not provide but that some commercial solutions do is polling-based database synchronization. This capability is helpful in two situations. First, if your DBMS does not have the .NET CLR built into it, you cannot benefit from SqlCacheDependency. In that case, it would be nice if your cache could poll your database on configurable intervals and detect changes in certain rows in a table. If those rows have changed, your cache invalidates their corresponding cached items.

The second situation is when data in your database is frequently changing and .NET events are becoming too chatty. This occurs because a separate .NET event is fired for each SqlCacheDependency change, and if you have thousands of rows that are updated frequently, this could easily crowd your cache. In such cases, it is much more efficient to rely on polling, where with one database query you can fetch hundreds or thousands of rows that have changed and then invalidate corresponding cached items. Of course, polling creates a slight delay in synchronization (maybe 15–30 seconds), but this is acceptable in many cases.

Read-Through

In a nutshell, read-through is a feature that allows your cache to directly read data from your data source, whatever that may be. You write a read-through handler and register it with your cache, and then your cache calls this handler at appropriate times. Figure 4 shows an example.

Figure 4 Example of a Read-Through Handler for SQL Server

using System.Web.Caching; using System.Data.SqlClient; using Company.MyDistributedCache; ... public class SqlReadThruProvider : IReadhThruProvider { private SqlConnection _connection; // Called upon startup to initialize connection public void Start(IDictionary parameters) { _connection = new SqlConnection(parameters["connstring"]); _connection.Open(); } // Called at the end to close connection public void Stop() { _connection.Close(); } // Responsible for loading object from external data source public object Load(string key, ref CacheDependency dep) { string sql = "SELECT * FROM Customers WHERE "; sql += "CustomerID = @ID"; SqlCommand cmd = new SqlCommand(sql, _connection); cmd.Parameters.Add("@ID", System.Data.SqlDbType.VarChar); // Let's extract actual customerID from "key" int keyFormatLen = "Customers:CustomerID:".Length; string custId = key.Substring(keyFormatLen, key.Length - keyFormatLen); cmd.Parameters["@ID"].Value = custId; // fetch the row in the table SqlDataReader reader = cmd.ExecuteReader(); // copy data from "reader" to "cust" object Customers cust = new Customers(); FillCustomers(reader, cust); // specify a SqlCacheDependency for this object dep = new SqlCacheDependency(cmd); return cust; } }

Because a distributed cache usually lives outside your application, it is shared across multiple instances of your application or even multiple applications. One important capability of a read-through handler is that the data you cache is fetched by the cache directly from the database. Hence, your applications don't have to have database code. They can just fetch data from the cache, and if the cache doesn't have it, the cache goes and takes it from the database.

You gain even more important benefits if you combine read-through capabilities with expirations. Whenever an item in the cache expires, the cache automatically reloads it by calling your read-through handler. You save a lot of traffic to the database with this mechanism. The cache uses only one thread, one database trip, to reload that data from the database, whereas you might have thousands of users trying to access that same data. If you did not have read-through capability, all those users would be going directly to the database, inundating the database with thousands of parallel requests.

Read-through allows you to establish an enterprise-level data grid, meaning a data store that not only is highly scalable, but can also refresh itself from master data sources. This provides your applications with an alternate data source from which to read data and relieves a lot of pressure on your databases.

As mentioned earlier, databases are always the bottleneck in high-transaction environments, and they become bottlenecks due mostly to excessive read operations, which also slow down write operations. Having a cache that serves as an enterprise-level data grid above your database gives your applications a major performance and scalability boost.

However, keep in mind that read-through is not a substitute for performing some complex joined queries in the database. A typical cache does not let you do these types of queries. A read-through capability works well for individual object read operations but not in operations involving complex joined queries, which you always need to perform on the database.

Write Through, Write Behind

Write-through is just like read-through: you provide a handler, and the cache calls the handler, which writes the data to the database whenever you update the cache. One major benefit is that your application doesn't have to write directly to the database because the cache does it for you. This simplifies your application code because the cache, rather than your application, has the data access code.

Normally, your application issues an update to the cache (for example, Add, Insert, or Remove). The cache updates itself first and then issues an update call to the database through your write-through handler. Your application waits until both the cache and the database are updated.

What if you want to wait for the cache to be updated, but you don't want to wait for the database to be updated because that slows down your application's performance? That's where write-behind comes in, which uses the same write-through handler but updates the cache synchronously and the database asynchronously. This means that your application waits for the cache to be updated, but you don't wait for the database to be updated.

You know that the database update is queued up and that the database is updated fairly quickly by the cache. This is another way to improve your application performance. You have to write to the database anyway, but why wait? If the cache has the data, you don't even suffer the consequences of other instances of your application not finding the data in the database because you just updated the cache, and the other instances of your application will find the data in the cache and won't need to go to the database.

Cache Query

Normally, your application finds objects in the cache based on a key, just like a hash table, as you've seen in the source code examples above. You have the key, and the value is your object. But sometimes you need to search for objects based on attributes other than the key. Therefore, your cache needs to provide the capability for you to search or query the cache.

There are a couple of ways you can do this. One is to search on the attributes of the object. The other involves situations in which you've assigned arbitrary tags to cached objects and want to search based on the tags. Attribute-based searching is currently available only in some commercial solutions through object query languages, but tag-based searching is available in commercial caches and in Microsoft Velocity.

Let's say you've saved a customer object. You could say, "Give me all the customers where the city is San Francisco," when you want only customer objects, even though your cache has employees, customers, orders, order items, and more. When you issue a SQL-like query such as the one shown in Figure 5, it finds the objects that match your criteria.

Figure 5 Using a LINQ Query to Search Items in the Cache

using Company.MyDistributedCache; ... public List<Customers> FindCustomersByCity(Cache cache, string city) { // Search cache with a LINQ query List<Customers> custs = from cust in cache.Customers where cust.City == city select cust; return custs; }

Tagging lets you attach multiple arbitrary tags to a specific object, and the same tag can be associated with multiple objects. Tags are usually string-based, and tagging also allows you to categorize objects into groups and then find the objects later through these tags or groups.

Event Propagation

You might not always need event propagation in your cache, but it is an important feature that you should know about. It's a good feature to have if you have distributed applications, HPC applications, or multiple applications sharing data through a cache. What event propagation does is ask the cache to fire events when certain things happen in the cache. Your applications can capture these events and take appropriate actions in response.

Say your application has fetched some object from the cache and is displaying it to the user. You might be interested to know if anybody updates or removes this object from the cache while it is displayed. In this case, your application will be notified, and you can update the user interface.

This is, of course, is a very simple example. In other cases, you might have a distributed application where some instances of your application are producing data and other instances need to consume it. The producers can inform the consumers when data is ready by firing an event through the cache that the consumers receive. There are many examples of this type, where collaboration or data sharing through the cache can be achieved through event propagation.

Cache Performance and Scalability

When considering the caching features discussed in the previous sections, you must not forget that the main reasons you're thinking of using a distributed cache, which are to improve performance and, more important, to improve the scalability of your application. Also, because your cache runs in a production environment as a server, it must also provide high availability.

Scalability is the fundamental problem a distributed cache addresses. A scalable cache is one that can maintain performance even when you increase the transaction load on it. So, if you have an ASP.NET application in a Web farm and you grow your Web farm from five Web servers to 15 or even 50 Web servers, you should be able to grow the number of cache servers proportionately and keep the same response time. This is something you cannot do with a database.

A distributed cache avoids the scalability problems that a database usually faces because it is much simpler in nature than a DBMS and also because it uses different storage mechanisms (also known as caching topologies) than a DBMS. These include replicated, partitioned, and client cache topologies.

In most distributed cache situations, you have two or more cache servers hosting the cache. I'll use the term "cache cluster" to indicate two or more cache servers joined together to form one logical cache. A replicated cache copies the entire cache on each cache server in the cache cluster. This means that a replicated cache provides high availability. If any one cache server goes down, you don't lose any data in the cache because another copy is immediately available to the application. It's also an extremely efficient topology and provides great scalability if your application needs to do a lot of read-intensive operations. As you add more cache servers, you add that much more read-transaction capacity to your cache cluster. But a replicated cache is not the ideal topology for write-intensive operations. If you are updating the cache as frequently as you are reading it, don't use the replicated topology.

A partitioned cache breaks up the cache into partitions and then stores one partition on each cache server in the cluster. This topology is the most scalable for transactional data caching (when writes to the cache are as frequent as reads). As you add more cache servers to the cluster, you increase not only the transaction capacity but also the storage capacity of the cache, since all those partitions together form the entire cache.

Many distributed caches provide a variant of a partitioned cache for high availability, where each partition is also replicated so that one cache server contains a partition and a copy or a backup of another server's partition. This way, you don't lose any data if any one server goes down. Some caching solutions allow you to create more than one copy of each partition for added reliability.

Another very powerful caching topology is client cache (also called near cache), which is very useful if your cache resides in a remote dedicated caching tier. The idea behind a client cache is that each client keeps a working set of the cache close by (even within the application's process) on the client machine. However, just as a distributed cache has to be synchronized with the database through different means (as discussed earlier), a client cache needs to be synchronized with the distributed cache. Some commercial caching solutions provide this synchronization mechanism, but most provide only a stand-alone client cache without any synchronization.

In the same way that a distributed cache reduces traffic to the database, a client cache reduces traffic to the distributed cache. It is not only faster than the distributed cache because it is closer to the application (and can also be InProc), it also improves the scalability of the distributed cache by reducing trips to the distributed cache. Of course, a client cache is a good approach only when you are performing many more reads than writes. If the number of reads and writes are equal, don't use a client cache. Writes will become slower because you now have to update both the client cache and the distributed cache.

High Availability

Because a distributed cache runs as a server in your production environment, and in many cases serves as the only data store for your application (for example, ASP.NET session state), the cache must provide high availability. This means that your cache must be very stable so that it never crashes and provides the ability to make configuration changes without stopping the cache.

Most users of a distributed cache require the cache to run without any interruptions for months at a time. Whenever they have to stop the cache, it is usually during a scheduled down time. That is why high availability is so critical for a distributed cache. Here are a few questions to keep in mind when evaluating whether a caching solution provides high availability.

  • Can you bring one of the cache servers down without stopping the entire cache?
  • Can you add a new cache server without stopping the cache?
  • Can you add new clients without stopping the cache?

In most caches, you use a specified maximum cache size so that the cache doesn't exceed the amount of data. The cache size is based on how much memory you have available on that system. Can you change that capacity? Let's say you initially set the cache size to be 1GB but now want to make it 2GB. Can you do that without stopping the cache?

Those are the types of questions you want to consider. How many of these configuration changes really require the cache to be restarted? The fewer, the better. Other than the caching features, the first criteria for having a cache that can run in a production environment is how much uptime the cache is going to give you.

Performance

Simply put, if accessing the cache is not faster than accessing your database, there is no need to have it. Having said that, what should you expect in terms of performance from a good distributed cache?

The first thing to remember is that a distributed cache is usually OutProc or remote, so access time will never be as fast as that of a stand-alone InProc cache (for example, ASP.NET Cache). In an InProc stand-alone cache, you can probably read 150,000 to 200,000 items per second (1KB object size). With an OutProc or a remote cache, this number drops significantly. In terms of performance, you should expect about 20,000 to 30,000 reads per second (1KB object size) as the throughput of an individual cache server (from all clients hitting on it). You can achieve some of this InProc performance by using a client cache (in InProc mode), but that is only for read operations and not for write operations. You sacrifice some performance in order to gain scalability, but the slower performance is still much faster than database access.

Gaining Popularity

A distributed cache as a concept and as a best practice is gaining more popularity. Only a few years ago, very few people in the .NET space knew about it, although the Java community has been ahead of .NET in this area. With the explosive growth in application transactions, databases are stressed beyond their limits, and distributed caching is now accepted as a vital part of any scalable application architecture.

Iqbal Khan is president and technology evangelist at Alachisoft. Alachisoft provides NCache, an industry-leading .NET distributed cache for boosting performance and scalability in enterprise applications. Iqbal has an MS in computer science from Indiana University, Bloomington. You can reach him at iqbal@alachisoft.com.