How a Simple IT Operations Checklist Can Prevent a Load of Grief

Written by Kip Ng, Principal Microsoft Premier Field Engineer, based in Canada.

Checklist6:05 p.m.: I was called and informed that the system had gone down. I rushed back to the office.

7:10 p.m.: I arrived and found that the drive containing the database had failed. How could it be? It was running on RAID-5. Well, let’s figure that out later and let’s fix the problem first.

7:30 p.m.: We decided to restore data from the backup.

9:00 p.m.: After we managed to locate the backup personnel, and looked through the backup, we discovered that the backup had not  been successful for the past seven days. Dear Lord, why is that no one knew?   We went back to the conference bridge to decide the next course of action.

9:56 p.m.: We decided to restore the week old database and would attempt to use the transaction logs to replay those old transactions.

11:35 p.m.: The database restore was successful. However, the attempt to replay the transaction logs failed. What else can go wrong now? We went back to the conference bridge to provide status.

12:30 a.m.: Found out that the transaction logs replay failed because someone turned on the circular logging. Whose bright idea was THAT?!?

Does any of this sound remotely familiar? If it does, I can tell you that you’re not alone. In my line of work, I deal with lots of customers and a ton of critical situations, and many of these are similar to what’s outlined above. In every post-mortem, I continue to preach the following: “To help ensure the availability and reliability of your IT systems, you must actively monitor the physical platform, the operating system, and essential services and components.”

Preventive maintenance and monitoring, combined with disaster recovery planning, help minimize problems and downtime when these occur. Let me give you some examples from the scenario above:

  • Storage failure. A simple event log check on a regular basis may have detected the drive failure. Having SAN or RAID technology does not mean it will never fail. RAID-5, for example, cannot sustain the failure of more than 1 hard drive. It does, however, give you the opportunity to detect single-drive issues earlier, which can potentially prevent the catastrophic failure outlined above..
  • Unsuccessful backup That could have been detected by a simple check and action could have been taken far in advance of any critical situation.
  • Circular logging turned on. A configuration check would have detected this earlier.

Yes, in a perfect world, everything is automated and monitored and that’s all wonderful. Unfortunately, most real-world environments are not. That’s why it’s so important for every company to have a daily, weekly and monthly Operations Checklist.

Why do you need it? Do I really need to say more? It’s to prevent what happened above, and:

  • To help you to meet the performance requirements of your service level agreements (SLAs).
  • To ensure that your overall system is in good health, including items such as a daily backup, no disk errors, server temperature is fine, disk usage is good, network is not congested, etc..
  • To allow you to take preventive action.

So, do you have an Operations Checklist for your environment? If not, start thinking and working on it. For example, here’s an Operations Checklist for Microsoft Exchange Server 2007 that I hope you find useful both for it’s content, and as model for other Checklists you might build.