What happens when a customer has hit a bug and wants it fixed?

Well, the first thing to do is to verify that this is our bug. One discussion that crops up over and over is:


Customer: “Your heap manager sucks. Look, it is crashing in the heap free routine”

Me: “Well, it could be our problem but I haven’t seen a heap manager bug since NT4. That crash is normally associated with a corrupt heap”

Customer: “But is it crashing in your routine!”

Me: “Yes, but it looks like the heap has been overwritten. We need to find out who did that”


In a perfect world, all of our APIs would be bulletproof. Pass them a bad block of memory and they spit it back with an error code. Do a bad cast and the API spots that this is an ANSI string, not a Unicode one like you said. In a perfect world, there would always be metadata available and no-one would mind that Windows ran at half the speed. This is not a perfect world and it possible to blow your application up pretty good if you do the wrong things. If that happens, we will still do everything we can to help out but we can't fix our OS to make a bad app run.


Let us assume that it is something that we have done wrong as per the previous blog. I may be 100% sure that it is our bug but rules are rules. Development get to make the call on if this is a bug. I log a bug of a special type which asks our developers to perform a specific action, in this case to confirm this is a bug or tell me why it isn’t.


Access Violations and hangs are pretty much always bugs

Memory leaks are always bugs – large caches and heap fragmentation generally are not.

Performance issues are generally not bugs unless they are really bad. Some things do take a long time.

Failure to perform as per the documentation are debatable. Sometimes the documentation is wrong.

A change in undocumented behaviour is hardly ever a bug. If we didn’t say what it was supposed to do, whatever it does (other than crash/hang/leak) is probably OK.


In the case described in my last blog, it is pretty clear that this would have to be a bug. Let us assume that the dev who looks at it agrees. The developer may suggest some workaround that I hadn’t thought of – after all, he or she is a smart cookie.


The next stage is to assess the business impact and the possibility that we can get a solution to the customer without fixing this as a hotfix. This might seem like a strange thing to do. You would expect us to want to fix all bugs. The honest answer is that we do want to fix all bugs. No-one likes having bugs in their code. We are also very aware that every fix has a risk associated with it. Fixing bugs before a product is released is low risk and cheap. Bugs are constantly fixed during development. It is part of the process for us and pretty much every other software developer. Fixing bugs after a product is released is high risk and expensive. A released product is basically stable. We have to be very sure that a fix to a bug will improve the stability over all. Before we release a product, we have an army of Beta testers hammering away at the product. It gets many thousands of hours of testing under a wide range of conditions. Fixes before the Beta get really, really well tested. Fixes to a released product get tested by us and by the customer who asks for the fix. Even with the best programmers and best testers, that makes us a little nervous. People rely on our runtimes for their business. People rely on our runtimes in the military or in medical systems sometimes.  We need to be very sure that the risk of reducing the stability is lower than the risk of leaving the bug as it is for now. If it is possible to leave the bug until the next service pack then that is what we will do. Service packs get a Beta program all of their very own. A bug fixed in a service pack is better tested.


So, we need to consider how much the bug is hurting the customer. In the case above, it might be possible to workaround the problem by having the customer insert some checks in their code to avoid the error case. If they could do that, then we would certainly want them to do so. That involves changes to one application rather than a runtime used by millions of people. The smaller the change, the lower the risk. I know that it causes a lot of frustration when we ask a developer to work around something that is clearly our error. I understand the frustration completely. I was a developer myself for many years. There is a good reason for the request though.


In this hypothetical scenario, let us assume that the error comes from invalid viewstate passed in as part of the URL. There isn’t much that the customer can do about checking that so it isn’t going to be practical to workaround this problem. Worse yet, it opens up the customer to denial of service attacks. We should fix that if we possibly can.


Cost to us is less of a factor. Hotfixes are expensive because we have to pay Devs and testers and a build lab. Some teams have dedicated “sustained engineering” teams and some have the main dev teams do fixes as well. I am not going to give the numbers here but my car cost less than a lot of the fixes that we do. Sometimes we just have to grin and bear it.


After reviewing the bug with development, we decide that we have a bug with limited risk and a currently high impact which could become much worse if anyone discovers it as a DOS exploit. In this case, Dev will give me the OK to request a hotfix. If it had been riskier or less severe then I would have logged the bug and we would have looked at getting it fixed in the next service pack or next version.


When I file a fix request, it is very likely that the request will be granted. The only real chance that it will be rejected is if there is an unforeseen complication. I will file the fix with a priority that is based on how much it is hurting the customer. If it is really causing major problems for a live server then it will get a high priority. If it is going to be a problem for a system that doesn’t go live for 3 months then it gets a lower priority. That doesn’t mean it is less important, just less urgent.


At this point, most of the activity is in Redmond rather than here. The first step is for development to reproduce the problem (if possible) and review the conclusions already drawn. Depending on what the bug is, we may have a detailed understanding of what code changes are needed or we may not. In this case, we know exactly what needs doing. We need to release the critical section in the error case so this is an ideal case.


So, the developer creates a test version of the fix which we call a “private” or a “buddy build”. This is for internal testing – if we can reproduce the problem inside Microsoft. If not then we ask the customer to help us test. The rules on private fixes are very clear. They should never be put on any machine other than a test machine which can be rebuilt if need be. They are virtually untested. Running them is a high risk thing to do and there is a considerable risk of side-effects because the test version might not have been created in a controlled build environment. Still, it should address the problem and so provide proof of concept. Privates don’t come with installers either and so have to be manually copied on the machine and registered or GACed by hand. Sometimes a customer will run a private in a production environment if the alternative is a completely dead system but we don't recommend it at all.


If the private works well then we proceed to proper QA, packaging (building the installer and EULA) and testing. It is about now that a knowledge base article number is associated with the fix. The actual article doesn’t generally get written until rather later. If testing goes well then we have a shiny new hotfix that can be delivered to the customer.


Next blog, I will talk about licensing and localisation.