Health Monitoring Tools (AppFabric 1.1 Caching)

Article
10/26/2012

This section describes the various tools and commands available for monitoring the health of a Microsoft AppFabric 1.1 for Windows Server cache cluster. These tools include the following.

Performance Monitor
Event Tracing for Windows (ETW)
System Center Operations Manager
Windows PowerShell

Performance Monitor

The AppFabric Caching features install several Performance Monitor counters. For more information about the available counters, see Performance Coutners for AppFabric Caching. You can observe or log some counter values to determine a baseline of typical cache cluster behavior. For example, in the AppFabric Caching:Cache category, you might observe that the Total Client Requests / sec value stays within general ranges that varies with the time of day. You can use this baseline to identify a trend of increasing client requests to the cache cluster that might necessitate adding additional cache hosts.

For more general information about using Performance Monitor, see Using Performance Monitor.

Event Tracing for Windows (ETW)

The AppFabric Caching features use Event Tracing for Windows (ETW) to provide status and error information related to the cache cluster. You can use the Event Viewer to examine the ETW logs for AppFabric Caching features.

Open the Event Viewer on a cache host. For instructions on how to launch the Event Viewer, see Start Event Viewer.
In the left navigation pane, expand the Applications and Services Logs folder.
Then expand Microsoft, Windows, and Application Server-System Services.
Select the Admin log.

The Admin log contains informational updates, such as when the AppFabric Caching Service starts or stops. It also contains warnings and errors. Note that these logs can contain events from other AppFabric features, such as hosting and monitoring. You can choose to filter the log to just the Microsoft-Windows Server AppFabric Caching source to focus on events related to the AppFabric Caching features.

The Application Server-System Services folder, also contains an Operational log. By default, this log is disabled. To enable it, right-click on the Operational log in the navigation pane, and then click Enable Log. The Operational log contains other events, such as low memory conditions.

When evaluating the health of the cache cluster, it is important to examine the event logs on each of the cache hosts that belong to the cluster. A problem with one cache host can have a negative effect on the entire cache cluster.

The event viewer is useful to regularly monitor the health of the cache cluster. However, when troubleshooting an error, it is possible to get an even more detailed log of the cache cluster activities. This can be done with the tracelog.exe tool. The tracelog.exe tool creates a detailed ETL trace log from the command-line. You can download the tracelog utility as a part of the Windows Software Development Kit. The following command begins logging to the cachedebugtrace.etl file:

tracelog -start debugtrace -f cachedebugtrace.etl -guid "C:\Program Files\Windows Server AppFabric\Manifests\ProviderGUID.txt" -level 5 -cir 512

The following command stops the logging:

tracelog -stop debugtrace

The following command converts the log cachedebugtrace.etl file into a text file named cachedebugtrace.csv.

tracerpt .\cachedebugtrace.etl -o cachedebugtrace.csv -of CSV

Note

Although the traceprt tool enables you to view the contents of the log file generated by tracelog, you may need to work with Microsoft support to fully interpret the information.

System Center Operations Manager

You can use System Center Operations Manager to monitor the health of the AppFabric cache cluster. For more information, see Windows Server AppFabric Management Pack for Operations Manager 2007.

Windows PowerShell

There are several Windows PowerShell commands that indicate the current status and health of a cache cluster. This section demonstrates how to use the following commands.

Get-CacheHost
Get-CacheClusterHealth
Get-CacheStatistics

Note that these commands provide dynamic information based on the current state of the cache cluster. It is often useful to also look at the configuration details with the following commands: Get-CacheConfig, Get-CacheHostConfig, and Export-CacheClusterConfig. These commands are covered in the section Common Cache Cluster Management Tasks (AppFabric 1.1 Caching).

Note

For more information about how to get started with Windows PowerShell, see Common Cache Cluster Management Tasks (Windows Server AppFabric Caching). For a complete list of commands, see Using Windows PowerShell with AppFabric Caching.

Get-CacheHost

Use the Get-CacheHost command without any parameters to quickly view the status of the cache hosts in the cache cluster. Some problems occur when one or more cache hosts in a cluster are not running. For example, consider the following output from Get-CacheHost.

PS C:\> Get-CacheHost

HostName : CachePort      Service Name            Service Status Version Info
--------------------      ------------            -------------- ------------
CacheServer1:22233        AppFabricCachingService UP             1 [1,1][1,1]
CacheServer2:22233        AppFabricCachingService DOWN           1 [1,1][1,1]
CacheServer3:22233        AppFabricCachingService UP             1 [1,1][1,1]

This output shows that there are three cache hosts in the cluster: CachServer1, CacheServer2, and CacheServer3. The Service Status column indicates that the cache cluster is running, because at least one cache host has a status of UP. However, CacheServer2 is currently stopped with a status of DOWN. This could indicate a problem with CacheServer2, or you might simply need to start the cache host with the Start-CacheHost command. The Get-CacheHost command is often the first command you should run to get a high-level overview of the state of the cache cluster.

Get-CacheClusterHealth

Use the Get-CacheClusterHealth to get detailed information about the health of the cache hosts and the caches residing on those cache hosts. For example, consider the following sample output from the Get-CacheClusterHealth command.

Cluster health statistics
=========================

HostName = CacheServer1
-------------------------

    NamedCache = default
        Healthy              = 0.00
        UnderReconfiguration = 0.00
        NotPrimary           = 0.00
        NoWriteQuorum        = 0.00
        Throttled            = 25.00

    NamedCache = Cache1
        Healthy              = 0.00
        UnderReconfiguration = 0.00
        NotPrimary           = 0.00
        NoWriteQuorum        = 0.00
        Throttled            = 25.00


HostName = CacheServer2
-------------------------

    NamedCache = Cache1
        Healthy              = 25.00
        UnderReconfiguration = 0.00
        NotPrimary           = 0.00
        NoWriteQuorum        = 0.00
        Throttled            = 0.00

    NamedCache = default
        Healthy              = 25.00
        UnderReconfiguration = 0.00
        NotPrimary           = 0.00
        NoWriteQuorum        = 0.00
        Throttled            = 0.00


Unallocated named cache fractions
---------------------------------

Internally, the cache cluster uses a concept of partitions to organize and manage memory. The numbers displayed in the output of the Get-CacheClusterHealth command are the percentages of the total number of cache cluster partitions. For example, on CacheServer2 the named cache Cache1 is using 25.00 percent of the total partitions and all of those partitions are healthy. However, the specific percentages are not as important as the categories in which those percentages reside. Adding more caches or cache hosts may reduce Cache1 from 25.00 percent to 10.00 percent, but as long as that 10.00 percent is still in the Healthy category, the cache is still healthy. In the previous example, note that CacheServer1 is showing both caches as Throttled. This is a low-memory condition on that server. For more information about how to troubleshoot this low-memory condition, see Throttling Troubleshooting (Windows Server AppFabric Caching).

The following table describes each category in the Get-CacheClusterHealth output.

Health Category	Description
`Healthy`	The cache is operating normally. This is the target state for all caches.
`UnderReconfiguration`	The cache is under reconfiguration. This is an internal state that may have several causes, but it should be temporary and resolve to healthy.
`NotPrimary`	The cache is not currently available. This can happen when secondary copies are promoted to primary. During this transition, the cache may temporarily have a state of `NotPrimary`. This state should typically resolve to healthy.
`NoWriteQuorum`	The cache is read-only, because the cache is unable to create the required number of replicas on secondary cache hosts. This occurs when the cache has the high availability option enabled (`Secondaries` = 1). In this scenario, there must be at least two running cache hosts in the cluster, one for the primary copy of the cached item and another for the secondary copy.
`Throttled`	The cache is read-only, because the cache host is in a throttled memory state. This is a low-memory condition.

The Unallocated named cache fractions represents the percentage of cache partitions that have not been allocated to a specific cache host yet. This state normally appears when the cache cluster is started or when a cache host is started or stopped on the running cluster. This state should typically resolve to healthy.

Get-CacheStatistics

The Get-CacheStatistics Windows PowerShell command provides basic information about the contents of a cache. The following example demonstrates how to display the cache statistics for a cache named Cache1.

Get-CacheStatistics Cache1

This is sample output from the previous command.

Size         : 12408186
ItemCount    : 1200
RegionCount  : 714
RequestCount : 1200
MissCount    : 1200

The previous output shows that there are 1200 items in Cache1 for a total size of 12408186 bytes. There are 714 regions, which could be user-created or system-created. There have been 1200 requests and the same number of misses. However, it is important not to see the MissCount as a problem indicator in isolation. When the cache cluster is restarted, applications must repopulate the cache. This involves checking to see whether the cached item exists, which increments the MissCount. A high MissCount could indicate that the items in the cache have been unexpectedly evicted or that the expiration time on cached items is too low, but these conditions cannot be confirmed with the cache statistics alone. For example, if you use the Put method to add an item that is not in the cache, it increments the MissCount, but this is not an error condition.

This command can be used together with the Get-CacheConfig command. For example, if the Get-CacheStatistics command showed that Cache1 had an unexpectedly large size of 1 GB, you could examine the cache configuration with Get-CacheConfig to see the eviction and expiration settings.