No Traffic Cops Here?

Keith Pleas and I were discussing strategies for modeling SLA's in Management Models. Our goal was to enable monitoring of specific SLA's defined by the business.

In our discussion, we wanted to know what the time between receiving an order from a customer via a web site, and making that order available for processing to the order fulfillment company. We thought that would be easy to implement with a performance counter, we would record the time we received the order, and the time the order was sent to the fulfillment service and report that value via a performance counter. Simple?

We quickly started brainstorming, and uncovered a whole bunch of questions.

When do we consider we have received the order, is it when the web site posts the request? Is it when our web service receives the request? or when the web site receives a response to the posting of the order? Are we to measure A-B, A-C, A-D, B-D, B-C??

Our thoughts were that the SLA would be defined to mean exactly one of those things, but it may be that we would have to report on all measurable points and use an external tool to calculate the response time based upon all the measures. This lead us to our next thought.

Should the application report how well it is doing itself?

Put it another way, should the application police itself?

It would be easy to put together some code that recorded the start and end time of the request, calculated the duration and exposed that information via a performance counter. The counter could be called "our service response" and would show up in the standard windows performance monitoring tools. Simple, but would users trust it enough?

The other method would be to simply write these "auditing" events out to an application specific log, which a third party tool could use to figure out the response time. This would mean a 3rd party tool would have to have a way to correlate the "start" and "end" events from the log. This can be easily acheived via the new Windows Eventing technology in Windows Vista and Longhorn using the CorrelationID property of the event. The problem I have with this is we now need to log a kazillion (big) number of events for a finite duration, so some lazy monitoring tool can have a look at how fast the app may be running. I also have a problem with having to write that 3rd party tool to figure out the execution time.

Another way, could be to use WMI events - these are never collected unless something "subscribes" to them. This would again require a custom tool that would collect the events and figure out the running time of particular actions and report them. Again, I have trouble with the "having to write a monitoring tool" aspect.

So I'm now back to the performance counter. Why... the code is easy to implement. I already have a viewer tool built into the operating system. Enterprise Management tools already know how to "use" performance counters. Nothing has to maintain GB's worth of data to monitor me.