Monitoring Applications using Windows Server AppFabric

Article
01/20/2010

This post aims to provide you with an introduction into monitoring the health and activity of your WCF- and WF-based applications with Windows Server AppFabric. More specifically, the post will outline the AppFabric tooling features that are built into IIS Manager as well as outline some basic strategies in using these features to monitor your applications.

Introduction

At the center of AppFabric’s monitoring tools is the Dashboard, which provides a centralized gateway to view the health of WCF and WF services deployed locally or to a server farm. It exposes real-time data for durable WF services and historic data for both WCF and WF services. The Dashboard is designed to provide a holistic summary of all positive and negative metrics on your services in a hierarchical form, starting from a high level and allowing you to drill down incrementally to an atomic level via one of our query-able enumeration pages. Consistent with other IIS Manager features, the Dashboard can be viewed from the server, site or application scopes via the tree view in the Connections Pane on the left hand side of the IIS Manager UI.

Before I explain further, it is important to note that the Dashboard sources data from one or more persistence and monitoring databases. In order for metrics of a particular service to be surfaced on the dashboard, it needs to be configured to utilize persistence (storage of persistence data in one or more Persistence databases) and/or configured to utilize event collection (storage of events in one or more Monitoring databases).

Dashboard Structure and Navigation Flow

The Dashboard is divided into three primary sections: Persisted WF Instances, WCF Call History, and WF Instance History. Each section provides a summary of a particular data pivot and drilling down within each section will lead you to the section’s own respective enumeration page. The first section (Persisted WF Instances) presents ‘live’ data while the subsequent sections provide historic metrics that are constrained to a particular time period. The time period can be modified via the ‘Time Period’ drop-down on the Dashboard menu with both predefined and custom options available.

Each section within the Dashboard can be collapsed or expanded. The collapsed view only allows the section’s summary bar to be visible, providing users with aggregate counts of all positive and negative metrics associated with the subject area (e.g. WCF Call History). Expanding the section will display a series of metrics that breakdown the aggregate counts shown on the section’s summary bar into key contributing factors/sources. For example, expanding the WF Instance History section will display a breakdown of activations and failures by the top 5 services as well as a count of the number of instance failures that have been recovered versus unrecovered. All metrics on the Dashboard are clickable, allowing you to drill-down into the counts to see details on each enumerated item via each section’s respective query page.

Monitoring the health of WCF Services

The AppFabric in this release only support persistence on WF services. As such, monitoring of the health of WCF services will be enabled by AppFabric’s event collection capabilities. With event collection enabled the Dashboard provides visibility into WCF calls and service exceptions via the WCF Call History section.

The summary bar of the WCF Call History section within the Dashboard is aimed at providing an aggregate count of all successfully completed calls and WCF service exceptions over a given period of time. Expanding the section provides some key breakdowns that allow you to:

1. Identify services in high demand: The first column lists the top 5 services (when applicable) with the highest number of completed calls over a given period.

2. Identify top exception-causing services: The center column lists the top 5 services (when applicable) that have encountered the highest number of WCF service exceptions over a given period.

3. Gain breakdown of key causes of WCF service exceptions: The purpose of the third column is to provide a numeric breakdown on the key causes of service exceptions: faulted calls and failed calls. It is important to note that service exceptions can also be caused by issues other than failed or faulted calls, such as service activation errors.

All metrics within the WCF Call History section can be clicked on, allowing you to drill-down into the aggregate count to view an enumerated list via the Tracked Events enumeration page. Depending on the metric you selected, the Tracked Events enumeration page will display the corresponding items via running a prepopulated query.

Monitoring the health of WF Services

The Dashboard provides varying levels of monitoring capabilities for WF services. All WF services regardless of durability can be configured to utilize AppFabric’s event collection capabilities, allowing data at varying verbosity to be collected for monitoring and troubleshooting purposes. This data is surfaced on the Dashboard via the WCF Call History and WF Instance History sections. Durable WF services can also utilize AppFabric’s persistence infrastructure which will allow the Dashboard to also provide live visibility into the health of persisted workflow instances. This feature is provided by the Dashboard’s Persisted WF Instances section.

Using historic data for Health Monitoring

Any WF-based service that is configured to utilize Dublin’s event collection capabilities set at ‘Health Monitoring’ level or above will be able to make visible on all historic metrics on the Dashboard. Since WF-based services also use WCF for communication, the WCF Call History section will also expose monitoring data on these services. For the purpose of this sub-topic, I will focus on discussing the WF Instance History section as the WCF Call History section has already been discussed earlier.

The purpose of the WF Instance History section is to provide a historic overview of all workflow instance activations, failures and completions over a given period. These three key metrics are presented in the summary bar of the section. Expanding the section provides some key breakdowns that allow you to:

1. Identify WF services in high demand: The first column lists the top 5 services (when applicable) with the highest number of instance activations over a given period.

2. Identify WF services with most instance failures: The center column lists the top 5 services (when applicable) that have experienced the greatest number of instance failures over a given period.

3. Understand recovered versus unrecovered instances: The purpose of the third column is to put in context the aggregate failure count in terms of what items are potentially still actionable.

All metrics within the WF Instance History section can be clicked on, allowing you to drill down and view an enumerated list via the Tracked WF Instances page. In addition to the instance information available on the page, you are also able to navigate or view all tracked events for a given instance, assuming that event collection is enabled for the parent service.

Using Persistence data for Health Monitoring

For durable WF services that are configured to utilize AppFabric’s persistence capabilities, the Dashboard provides live visibility into running and suspended persisted instances via the Persisted WF Instances section. Sourced by one or more persistence databases, the section offers an overview of what is happening with your durable workflows.

The summary bar of the Persisted WF Instances section contains a numeric breakdown of all running (Active or Idle) and suspended instances currently associated with your environment. When further context is required, expanding the section provides some key breakdowns that allow you to:

1. Identify durable WF services with highest current demand: The first column lists the top 5 services (when applicable) that currently have the most number of active or idle instances.

2. Identify services with most suspended instances: The center column lists the top 5 services (when applicable) that currently have the most number of suspended instances.

Again, like other sections, all metrics within the Persisted WF Instances section can be clicked on, allowing you to drill down and view an enumerated list via the Persisted WF Instances page. The enumeration page not only provides details on each persisted WF instance that satisfy the query conditions, but also supports instance control operations (i.e. Resuming a suspended instance). Similar to the Tracked WF Instances page, you can also navigate to and view all tracked events for a given persisted instance, assuming that event collection is enabled for the parent service.

Summary and Additional Resources

AppFabric’s monitoring tooling is predominantly delivered via four features within IIS Manager: Dashboard, Persisted WF Instances enumeration, Tracked WF Instances enumeration and Tracked Events enumeration. Starting from the Dashboard, AppFabric’s feature set is aimed to surface the health of WCF and WF services and provide incremental drill-downs via query-able enumeration pages to assist in investigation and problem-diagnosis activities.

Next week’s post will focus in more detail on using AppFabric tools to troubleshoot applications. Also for more information on AppFabric monitoring and troubleshooting tools in general, view the endpoint.tv episode with demonstration here.

Monitoring Applications using Windows Server AppFabric

Additional resources