Troubleshooting 101- Defining the Problem

This is the first in a series of blogs I am developing that will revolve mainly around the art of Troubleshooting. While these topics will not be limited to Exchange Server, most of the examples I give will be since that is my primary specialty.

Why Troubleshooting?

For many IT Professionals and Support Engineers, the ability to effectively troubleshoot an issue is becoming a lost art. And why shouldn't it be? Exchange 2003 has become a very mature product where the odds of running into a problem that no one has seen are slim. Exchange 2007 is quickly approaching maturity as well with SP2 and the 2010 on the horizon. In truth, most problems can be quickly solved by a quick search through the Knowledge Base or a Bing query. After all, that's what they are there for. So what happens those queries come up blank? It could be that you've come across a brand new problem. It could be that you didn't query on the right words. Most often though the answer is that you don't really know the nature of the problem you are dealing with.

The Importance of a Clear Problem Statement

When a support incident is opened at Microsoft, the most important question that gets asked is: "What is the problem you are experiencing?"

Everything that happens with the call, direction and troubleshooting from that point on is directly related to that answer. A vague problem statement can, at best, get you routed to the wrong group where you have to go through the whole definition again. At worst, it can completely confuse the people who are there to help and send the case in the wrong direction. When that happens you can expect the time to solution to at least double or triple. Nobody wants that.

I recently worked a long running, high profile case where the problem statement was "Issue with Exchange Server." I jumped on a conference call with the customer and the topics of conversation were ranging from message traffic, to server performance to Outlook service packs. There was no real plan or direction for the troubleshooting to go because the initial statement was so vague. Of course, without a plan there's no attack. With no attack, there's no victory! (points to whoever gets that reference!)

The very first thing we did was to hash out the actual problem the customer was experiencing. In this case, it was similar to "All Outlook Clients Randomly Disconnect From 3 Exchange 2003 Active/Passive Clusters."

Things to note in that problem statement:

1. ALL Outlook Clients. Not a particular version. Not a particular segment on the network. Not OWA clients.

2. "Randomly Disconnect." - The issue is not reproducible at will and transient in nature.

3. This affects multiple Exchange 2003 servers.

4. No mention of errors or warnings because there weren’t any.

This is the kind of issue where you can’t just plug an error code into a browser and hope for the best. For this issue you have to know how to troubleshoot.

Based on that problem statement we were able to draw up a solid action plan.

Since its all clients, that narrows us down to Network or something Server side.

Since the clients randomly disconnect, that further indicates Network or something Server Side.

Since it affects multiple servers, that indicates Network or Active Directory Connectivity.

After knowing these details the action plan became simple:

1. Concurrent Netmon traces of a client and server while the issue was occurring.

2. All performance counters from the servers while the issue was occurring and before that to establish a baseline.

3. Memory dumps of Store.exe and Lsass.exe while the issue was occurring.

I won’t get into specifics here since this blog is focused on defining the problem. However, from that data we were quickly able to identify the problem as an overworked Global Catalog server and implement some changes to increase performance.

I know this whole blog sounds simple and a bit like a self-help book, but my experience is that long running problems are most often ill defined at the outset. When you define a problem statement, you have to be precise and clear.

Who or What is being specifically affected?

How are they being specifically affected?

How often are they being specifically affected?

Its fine if you don’t know immediately know the answers to these questions. Many times you won’t. Getting the answers is part of the troubleshooting process. The point is that when troubleshooting a problem the absolute first thing you have to do is agree to a clear, well defined problem statement.

More to come later!