The Expert Problem Solver

The Expert Problem Solver

 

I've spent years watching some of the absolute best problem solvers in action. These guys/girls never cease to amaze me in their ability to slice and dice a complex problem down to its root cause in the most efficient way possible. So when I started thinking about what personality traits make them such experts a pattern started to emerge:

-Ability to think holistically – See the big picture in terms of impact, time restraints and value.

-Ability to visualize the workflow – to see how systems are built upon each other to accomplish a goal.

-Familiarity with features – You need to know what you are troubleshooting.

-Knowledge of Support Boundries – Knowing what you can and can’t do.

-Knowledge of Resources – Knowledge Bases, Repro Labs…ext.

-Not Ego Driven – Knows when to ask for help or bring in other resources.

-Natural Curiosity and Skepticism – Gets that feeling when something doesn’t sound right. Double checks things.

An expert troubleshooter uses their experience before breaking down a problem. They first act upon any hunches they have gathered from the problem statement that relate to previous experience or similar issue. This is a very important step since a seasoned troubleshooter may be able solve the problem quickly and efficiently.

This is a best case scenario, but limited time should be allocated to it if the quick fix doesn’t work. A common pitfall that troubleshooters hit is that they spend too much time focusing on something familiar and stop listening to the customer who may be giving them valuable data.

Important Tips:

-Don’t hold onto a hunch when other options are presented.

-Don’t try to solve multiple problems at once. Pick one and work it until you hit a road block.

-What works for one customer may not work for all customers. Every IT environment is different and have different needs.

-Don’t give a definitive statement without data to back it up.

-Hit up your resources whenever possible. Just because you’ve never seen something doesn’t mean they haven’t.

The first question that needs to be asked after you have your problem statement is “What is the Expected Behavior?”

When dealing with a technical issue you can be dealing with one of three things.

1. Its Broken

2. Its Misconfigured

3. It’s Slow

A broken application/server doesn’t work at all. The hardware or software process constantly and consistently crashes or generates a hard fault. A blue screen is a good example.

A misconfigured application/server continues to run, but doesn’t generate the expected behavior. These are the hardest to troubleshoot.

A slow application/server give the expected results, but does so in an unacceptable amount of time.

It’s Broken

When an application or piece of hardware is truly broken that means there is a bug in the software or a malfunctioning piece of hardware. The most common symptom that is associated with a broken application is an exception. An exception is the response you see when an error has occurred. There are 2 types of exceptions: 1st and 2nd Chance Exceptions.

Think of it like taking a test at school. You spend days studying for the test and programming your mind. When you take the test, that programming successfully gets you through a few questions until you get to one that is completely unfamiliar.

At that point you can either skip the question and come back to it or completely freak out, quit the test, quit school and move back in with your parents. A crash is a second chance exception.

A first chance exception would be skipping the question and coming back to it. You may get the question the second time around or you may freak out again.

A 2nd chance exception can also be referred to as an “Unhandled Exception.” When these occur we would generally expect that either a crash dump is created or an error is thrown. These are crashes and they need a specific kind of troubleshooting.

How to determine if you are dealing with an Exception:

1.  Look for Error Events. You could see an error in various logs including application specific logs or the default Windows Logs.

2. Does the Process ID of the application change? If the PID of a process changes, that means it was stopped and restarted.

3. Look for a dump file. Under many circumstances, a crashing application will dump its memory out to Dr. Watson.

4. There are rare circumstances where you see a “silent exit” in while a process terminates without generating any errors or dump files. These are the most difficult to find.

5. Is there a memory leak crashing an application?

Troubleshooting an Exception or a Crash.

Once it has been determined that there is a genuine exception or crash, there is only so much troubleshooting that can be done outside of code debugging. This debugging can be done through very specific logging or memory dumps. Memory dumps are the preferred method for Root Cause Analysis. Therefore, the first goal is generate a dump file if one wasn’t automatically generated. You can also get an iDNA trace if the process is crashing on startup.

Configure Dr. Watson to generate full memory dumps.

Start-run -> drwtsn32.exe on 32 bit machines.

Download the debugging tools for windows in you need to take specific dumps.

http://www.microsoft.com/whdc/devtools/debugging/default.mspx

You also NEED performance monitor data when debugging is many situations. It is always better to have too much information than not enough and often, this data is critical for finding root cause. Take the extra time to setup a Counter Log with All Counters and Instances.

Other Options

If the crash is reproducible, you may be able find the problem based on what is happening right before the crash.

Example:

In the Exchange world, it is sometimes possible for a malformed message to crash Transport. The message will either be sitting in the Mailroot\VSI\1\Queue directory in Exchange 2003 or in the Poison Queue in Exchange 2007. Turning up Diagnostic Logging will generate a warning or error pointing to this message. Thus, this is an exception you can control by finding a way to prevent that message from reaching the servers. You can remove that message from the queue and stop further occurrences through a firewall or through built in Spam prevention.

The downside to preventing that technique is that it doesn’t give us full Root Cause. It tells us what caused the crash, but not why. To get the why you may need the memory dumps as well.

Misconfigured

A configuration error is the most common cause of outages in an IT environment. Often there are so many people making changes that one hand doesn’t know what the other hand is doing. Configuration errors can cause exceptions or crashes, but they generally result in the application/server not generating the desired outcome.

Problems are most easily identified as configuration related by asking the questions:

Has this ever worked?

What changed?

If the server or application has never given the expected results, then you are probably dealing with a setup issue. These problems simply require replaying the steps the customer has done to find the problem.

If the application/server has worked in the past, then next step is to determine what changed. This is where the bulk of IT troubleshooting is done. There are two approaches to tackling this issue: Discover what changed or troubleshoot based on symptoms. Both of these approaches have distinct advantages and problems.

Finding what changed is a big picture approach. It involves gathering all the involved parties and running through change control data. There are utilities like MOM, MPS Reports and ExBPA to help gather change information. The goal with the discovery process is that someone will know what changes have been made and you can determine if they are possible causes. The problem is that under most situations, the people driving the problem simply don’t know what has changed. If they did know, they probably don’t need your help fixing it.

So the main advantage here is that this approach can very quickly narrow down the issue if the right people are involved. The main disadvantage is that you can spend a huge amount of time looking for changes and not finding any.

Examples of Big Picture troubleshooting techniques are the 5 Why’s and Kepner-Tregoe.

Symptom based troubleshooting can be extremely efficient depending on the knowledge level of the people troubleshooting and access to resources. In Symptom based troubleshooting, the goal is to take the symptoms of the issue at hand and rule them out one by one.

For example, the age old Help Desk call of “My computer won’t start”

Possible causes:

-It’s not plugged in.

-The monitor isn’t hooked up.

-there’s a Bios problem

-probably a lot more causes that I can’t think of right now

The help desk engineer would generally start eliminating causes. They would make sure the monitor is plugged in and then make sure the monitor is plugged in and turned on. If the problem still persisted they would look at the BIOS. If they found it was a problem with the BIOS, they would then start eliminating possible causes from the BIOS until they found root cause and fixed the issue.

Most engineers take this approach because it allows them quicker access to the immediate problem. The main drawback to Symptom based troubleshooting is that it is dependent on the knowledge level of the person doing the troubleshooting. A lot of time can be lost gathering data for things that aren’t related to the issue at hand.

Let’s say, for example, we take the “My computer won’t start” problem and say it was caused by a hardware update that is documented in the company’s change control records. Because the engineer immediately went to troubleshooting the symptoms, they have potentially wasted a huge amount of time looking at symptoms that aren’t related to the issue.

As you can imagine, the best way to troubleshoot is a combination of the Big Picture and Symptom based approaches.

 

It’s Slow

A slow application/server is generating the correct results in an unacceptable amount of time. These problems are rarely caused by an exception but can be caused because of a configuration error. For the purposes of this paper we will concede that we are dealing with pure performance based issues.

Before you can start to troubleshoot slow applications/servers, its best to ask what it is being compared to. Slow and fast are both perceptions that are relative without concrete benchmarks.

However, troubleshooting slow systems has a specific troubleshooting path:

Establish a Performance Monitor baseline on the problem server and another server to isolate differences. Through this, you can find whether you are dealing with a disk IO issue, network issue, memory issue and so forth.