To hotfix or not to hotfix, that is the question

Let me start this blog post by making it clear that this is my opinion, not an official stance. Everyone is entitled to their opinion and this is just mine. Feel free to ignore it if you like.

I often time have discussions with customers about applying hotfixes, and more recently cumulative updates. The discussion revolves around if these things should be applied pro-actively or not. There are many folks in the support organizations here at Microsoft that want to make sure they are looking after our customers and thus recommend to them to apply all the latest and greatest fixes. The thought is that doing so will avoid fighting already known and fixed issues. This can be a big time and frustration saver rather than battling for days trying to get something to work, only to find out there was already a fix for the issue.

Change is risk. This is why most medium and large companies with mature IT processes have some form of change control in place. Any change means a deviation from the status quo and while that may improve things, it has the potential to cause problems as well. Standard change mitigation is to test the changes before committing to them in production. This is a good practice for IT groups as well as internally at Microsoft. Everything that Microsoft releases is tested before let out “into the wild”. As some of you may have noticed, occasionally that testing has missed a scenario or two and had unforeseen side effects. Fortunately I think such scenarios are getting to be less and less as process and diligence improves. Unfortunately, we aren't to a 0 occurrence rate just yet.

I was a test lead/manager for several years in the SCCM development group and was part of many hard discussions on how much testing is the right amount of testing for a given problem. With only a few quick tests the possibility of having a regression or unforeseen problem caused by the fix was high. On the other end of the spectrum I could sit down and dream up an infinite amount of potential tests, meaning that the product could never be released.

To give you an off-the wall example of how far these tests can go. Think of a screw driver. It is a fairly simple and straight forward thing that most of us use with out really thinking about it. Various tests are: can it turn to the right, turn to the left, fit screw size A, fit screw size B, comfortable in my hand, comfortable in my kids hand, not break under normal use, not break in a deadly way under more extreme use, not melt in my hand, not melt out in sunlight, not melt in my garage with the heater left on during a 100 degree day, not melt in the sun, go to space, not wear down too quickly, look good on a store shelf to sell better, etc.

So, how much testing is enough testing? Well, the balance point changes. The more critical and time sensitive something is, the more risk we take by doing less testing. Hotfixes are generally at the higher end of the risk scale. We test them in lab, maybe with some internal folks, and usually with at-least a few customers before they become available for everyone to download. The testing is limited because we want to be able to get them out fairly quickly. Cumulative updates get a little more testing, especially the interaction of the multiple fixes. The full development, then testing, then release happens in only a few months (and is in addition to the testing that some included hotfixes may have already been through) with a limited number of people so while this is more coverage than a typical hotfix, it isn’t really what I personally would consider as “low risk” just yet (although it is, arguably, getting close). SCCM CU hold less risk than a typical standalone hotfix would, and typically only include items that have a clearly understood risk to them which was covered by internal testing. Service packs get much more rigorous testing across many different in house and external scenarios. The chances of problem arising from a service pack are very low (or at-least well understood and documented) and thus on a personal level I consider it a “low risk” type of deployment.

So…, why do I write all this up? The advice I have always heard is “if it isn’t broke, don’t fix it” and in general I think that applies to software patching. Don’t apply a hotfix or a CU unless you are experiencing the symptoms that it means to address. Yes, you might waste a weekend battling a problem only to find out that a fix already existed. Compared to wasting a weekend applying the fix then battling an issue caused by it I think it is a good trade off to have not applied the fix if it wasn’t truly needed. There is one caveat I make to this statement, however, and that is for “invisible” problems. These are problems that you may actually be having, but not know about. A good example is a memory leak. Sure, you might have a leak in your admin console (as an example) but if you close it at-least once per day then you never realize it. The fact that every Monday after you left it open all weekend it is sluggish until you restart it has just become a habit that you have never bothered to investigate. A fix that solves admin UI memory leaks might help, or might be completely unrelated and do nothing for you but it is worth considering applying proactively.

So now I shall get down off my soap box. There are may smart people who I have respect for who disagree with me on this stance and in the end what works best for one company may not work best for all companies. Make the choice you deem appropriate for your company and your role. I will hope that it works out well for you in any case.


8/15 - Minor updates to clarify CU