Windows Server AppFabric Monitoring – Purging of AppFabric monitoring data, don’t get caught
Recently, while working on a customer performance lab with Windows Server AppFabric, I came across an interesting behavior after purging the AppFabric monitoring data that I thought is worth sharing.
In our previous post on AppFabric Monitoring, we briefly described how tracking data gets from ETW to the ASWcfEventsTable and ASWfEventsTable tables located in the AppFabric monitoring database. These two tables, along with a few other supporting look-up (or reference) tables represent the source of aggregated information displayed on the AppFabric Dashboard. One of these supporting tables is ASEventSourcesTable. Let's have a look at what data is stored in each of these tables:
- ASEventSourcesTable – the name of the table and its fields (Computer, Site, VirtualPath, ApplicationVirtualPath, and ServiceVirtualPath) make it easy to guess what data we can expect to find in it. Each record describes a service that had generated one or more tracking records
- ASWcfEventsTable – this table stores records that represent the actual tracked events captured from the execution of various WCF services
- ASWfEventsTable – this table stores tracked records generated from executing workflow instances, such as workflow instantiation, activity start and completion events, etc.
After a period of activity with monitoring turned on, at the health level for example, let's look at the content of the above tables. We'll see that the number of records in the ASWcfEventsTable and ASWfEventsTable tables is rather large and dynamically growing (assuming active processing of WCF service operations and WF instances). The ASEventSourcesTable, on the other hand, has a small record set that doesn't dynamically change – this record set has metadata describing the services that have generated tracking events since the database was last purged, or from empty in the case of a fresh deployment.
If we issue a purge command using the Clear-ASMonitoringSqlDatabase PowerShell cmdlet and flush all the data from the monitoring database, all rows from the above tables will be deleted, resulting in 0 rows in each table.
Given continuous load of the WCF/WF services in the environment, the tables will be populated again – the ASEventSourcesTable with metadata describing the executing services (same as before as we haven't deployed new services to the environment), and the ASWcfEventsTable and ASWfEventsTable capturing the newly generated tracked events data.
And Soon After…
As tracking data accumulates again in the WCF and WF events tables, refreshing the AppFabric Dashboard will start taking longer and longer, and eventually it will start timing out.
Hmm, what's going on here? To answer this question, we'll take a step sideways and look at what it all means to SQL Server, especially in regards to updating statistics. In summary, SQL Server uses the following rules for "deciding" when to automatically update statistics on a table:
- For a table that has up to 500 rows in it, statistics are updated after every 500 changes in it (inserts, updates, and deletes)
- For tables that have more than 500 rows, statistics are updated when more than 20% if the records have changed
Let's assume that we have 10 services in our environment. After all these services are called at least once, the ASEventSourcesTable will have that many records (10), each describing the metadata for one of our services. One month later, it's time to purge. This takes us down to 0 records in ASEventSourcesTable. Now we will have 20 changes in the ASEventSourcesTable – 10 inserts initially + 10 deletes. No auto-update for us this time around, even though the content of the table has changed 100%.
For the purpose of this article, without going further into the SQL auto-update statistics logic, it's sufficient to say that at this point the AppFabric Dashboard queries will get a new and unfortunately more inefficient query plan for joining the ASEventSourcesTable to ASWcfEventsTable and ASWfEventsTable. With a large and quickly growing number of records in the WCF and WF events tables, executing a dashboard query may take 60 seconds or longer.
The Side Effects
The dashboard queries are taking longer, the administrator is patient (hopefully), and they can wait…eventually they'll get the data. Or will they? J The AppFabric Dashboard UI (in InetManager) uses the monitoring connection string to run the dashboard queries. By default, the connection string does not define a Timeout value, meaning that we are down to the default 15 seconds timeout for the SQL provider. So, if the dashboard query is taking longer than 15 seconds (whether due to the update statistics issue described above or a different problem), it will time out and we'll get 0's for the dashboard monitoring data statistics – these are the stats in the "WCF Call History" and "WF Instance History" groups.
How Do I Fix This?
It's easy. And I'll keep it short and sweet:
To fix the slowly executing dashboard queries after a purge, simply run an UPDATE STATISTICS T-SQL statement against the ASEventSourcesTable table immediately after running the clear-ASMonitoringSqlDatabase cmdlet – for the copy/paste fans out there:
UPDATE STATISTICS ASEventSourcesTable
(please don't get used to it though!)
To fix the dashboard timing out on you, add a Connect Timeout=300 parameter to the monitoring connection string, at whatever web.config level you've defined it - by default, it's the root web.config located in C:\Windows\Microsoft.NET\Framework\v4.0.30319\Config for x86 systems, or C:\Windows\Microsoft.NET\Framework64\v4.0.30319\Config for the x64 platform.
Note: For the observant reader, if you're wondering why the Connect Timeout would influence the execution timeout of a query (which in the ADO.NET world is controlled via the CommandTimeout property of the SqlCommand class), here's the answer – the AppFabric dashboard sets the SQL commands' timeout property to the timeout of the connection, which based on the guidance above, is now defined as part of the connection string.
Voila, now you have correct query execution plans for the dashboard at all times, even after a purge J, and a more reliable dashboard that doesn't show 0's every time a query takes a bit longer than usual!