Best Practices for using Windows Azure Cache/Windows Server Appfabric Cache
We frequently get asked on best practices for using Windows Azure Cache/Appfabric cache. The compilation below is an attempt at putting together an initial list of best practices. I’ll publish an update to this in the future if needed.
I’m breaking down the best practices to follow by various topics.
Using Cache APIs
1. Have Retry wrappers around cache API calls
Calls into cache client can occasionally fail due to a number of reasons such as transient network errors, cache servers being unavailable due to maintainance/upgrades or cache servers being low on memory. The cache client raises a DataCacheException with an errorcode that indicates the reason for the failure in these cases. There is a good overview of how an application should handle these exceptions in MSDN
It is a good practice for the application to implement a retry policy. You can implement a custom policy or consider using a framework like the TransientFault Handling Application Block
2. Keep static instances of DataCache/DataCacheFactory
Instances of DataCacheFactory (and hence DataCache instances indirectly) maintain tcp connections to the cache servers. These objects are expensive to create and destroy. In addition, you want to have as few of these as needed to ensure cache servers are not overwhelmed with too many connections from clients.
You can find more details of connection management here. Please note that the ability to share connections across factories is currently available only in November 2011 release of the Windows Azure SDK (and higher versions). Windows Server appfabric 1.1 does not have this capability yet.
Overhead of creating new factory instances is lower if connection pooling is enabled. In general though, it is a good practice to pre-create an instance of DataCacheFactory/DataCache and use them for all subsequent calls to the APIs. Do avoid creating an instance of DataCacheFactory/DataCache on each of your request processing paths.
3. WCF Services using Cache Client
It is a common practice for WCF services to use cache to improve their performance. However, unlike asp.net web applications, wcf services are susceptible for IO-thread starvation issues when making blocking calls (such as cache API calls) that further require IO threads to receive responses (such as responses from cache servers).
This issue is described in detail in the following KB article – The typical symptom that surfaces if you run into this is that when you’ve a sudden burst of load, cache API calls timeout. You can confirm if you are
running into this situation by plotting the thread count values against incoming requests/second as shown in the KB article.
4. If app is using Lock APIs - Handle ObjectLocked, ObjectNotLocked exceptions
GetAndLock can fail with “< ERRCA0011>: SubStatus<ES0001>: Object being referred to is currently locked, and cannot be accessed until it is unlocked by the locking client. Please retry later.” error if another caller has acquired a lock on the object.
The code should handle this error and implement an appropriate retry policy.
PutAndUnlock can fail with “<ERRCA0012>:SubStatus<ES0001>:Object being referred to is not locked by any client” error.
This typically means that the lock timeout specified when the lock was acquired was not long enough because the application request took longer to process. Hence the lock expired before the call to PutAndUnlock and the cache server returns this error code.
The typical fix here is to both review your request processing time as well as set a higher lock timeout when acquiring a lock.
You can also run into this error when using the session state provider for cache. If you are running into this error from session state provider, the typical solution is to set a higher executionTimeout for your web app.
Session State Provider Usage
The session state provider has an option to store the entire session as 1 blob (useBlobMode=”true” which is the default), or to store the session as individual key/value pairs.
useBlobMode="true" incurs fewer round trips to cache servers and works well for most applications.
If you've a mix of small and large objects in session, useBlobMode=”false” (a.ka. granular mode) might work better since it will avoid fetching the entire (large) session object for all requests. The cache should also be marked as nonEvictable cache if useBlobMode=”false” option is being used. Because Azure shared cache does not give you the ability to mark a cache as non evictable, please note that useBlobMode=”true” is the only supported option against Windows Azure Shared cache.
Performance Tuning and Monitoring
Connection management between cache clients and servers is described in more detail here. Consider tuning MaxConnectionToServer setting. This setting controls the number of connections from a client to cache
servers. (MaxConnectionsToServer * Number of DataCacheFactory Instances *Number of Application Processes) is a rough value for the number of connections that will be opened to each of the cache servers. So, if you have 2 instances of your web role with 1 cache factory and MaxConnectionsToServer set to 3, there will be 3*1*2 = 6 connections opened to each of the cache servers.
Setting this to number of cores (of the application machine) is a good place to start. If you set this too high, a large number of connections can get opened to each of the cache servers and can impact throughput.
If you are using Azure cache SDK 1.7, maxConnectionsToServer is set to the default of number of cores (of the application machine). The on-premise appfabric cache (v1.0/v1.1) had the default as one, so that value might need to be tuned if needed.
Adjust Security Settings
The default security settings for on-premise appfabric cache is to run with security on at EncryptAndSign protection level. If you are running in a trusted environment and don’t need this capability, you can turn this off by explicitly setting security to off.
The security model for Azure cache is different and theabove adjustment is not needed for azure cache.
There is also a good set of performance counters on the cache servers that you can monitor to get a better understanding of cache performance issues. Some of the counters that are typically useful to troubleshoot issues include:
1) %cpu used up by cache service
2) %time spent in GC by cache service
3) Total cache misses/sec – A high value here can indicate your application performance might suffer because it is not able to fetch data from cache. Possible causes for this include eviction and/or expiry
of items from cache.
4) Total object count – Gives an idea of how many items are in the cache. A big drop in object count could mean eviction or expiry is taking place.
5) Total client reqs/sec – This counter is useful in giving an idea of how much load is being generated on the cache servers from the application. A low value here usually means some sort of a bottleneck
outside of the cache server (perhaps in the application or network) and hence very little load is being placed on cache servers.
6) Total Evicted Objects – If cache servers are constantly evicting items to make room for newer objects in cache, it is usually a good indication that you will need more memory on the cache servers to hold
the dataset for your application.
7) Total failure exceptions/sec and Total Retry exceptions/sec
Lead host vs Offloading
This applies only for the on-premise appfabric cache deployments. There is a good discussion of the tradeoffs/options in this blog - As noted in the blog, with v1.1, you can use sql to just store config info and use lead-host model for cluster runtime. This option is attractive if setting up a highly-available sql server for offloading purposes is hard.
Here are a set of blogs/articles that provide more info on some of the topics covered above.
1) Jason Roth and Jaime Alva have written an article providing additional guidance to developers using Windows
2) Jaime Alva’s blog post on logging/counters for On-premise appfabric cache.
3) MSDN article about connection management between cache client & servers.
4) Amit Yadav and Kalyan Chakravarthy’s blog on lead host vs offloading options for cache clusters.