Random and unexpected EXCEPTION_FLT_DIVIDE_BY_ZERO and EXCEPTION_FLT_INVALID__OPERATION
Your application is running fine then one day it starts to fail with EXCEPTION_FLT_DIVIDE_BY_ZERO (0xC000008E) or EXCEPTION_FLT_INVALID__OPERATION (0xC0000090) exceptions at seemingly random places.
One reason might be the current value of the floating point control word (fpcw). This is a bit mask used to control whether Intel 8087 and later CPUs raise exceptions or not when certain types of floating point errors occur. There is a good article about this over on the openwatcom.org site.
In Windows applications the usual value for fpcw is 027F. For example, just fire up notepad, attach WinDBG and do the following:
0:001> ~*e r@fpcw
The value of this register can be set in many ways, for example the functions _control87, _controlfp, __control87_2, _clear87, _clearfp, _status87, _statusfp, _statusfp2 can all modify it. But if it is modified it fundamentally changes the ground rules for how some floating point operations will behave on that thread until the original value is restored.
As 027f is the "normal" value on Windows, almost all Microsoft and third party applications, components, frameworks and libraries are written and tested with the expectation that this register will have this value and that floating point operations will happen in a particular way either raising or not raising exceptions. Therefore any code that needs to modify this register for some reason has a duty to change it back again when finished unless it is running on its own private thread. If not, mayhem will result.
And that is exactly what we see sometimes here at Microsoft support.
I had a case not so long ago from a systems integrator. They were delivering a solution to an end customer using a system developed by another company which in turn used applications from different vendors. After an update to one of the components was deployed these applications began to fail with floating point exceptions. Troubleshooting was particularly difficult because the end customer was in an isolated environment without access to the Internet so remote access was out of the question. Every time we wanted to do some troubleshooting someone from one of the vendors had to go on site and we'd have a phone conference where I would talk them through a series of debug steps "blind" (you develop good visualisation skills in my job).
From the outset I suspected something was modifying the fpcw so the first thing I had them check was the value of fpcw on all threads at the point in time where the exceptions had started to happen but the process was still up (unhandled, these exceptions will take down a process). Sure enough, somehow the "normal" value had been changed on thread 0, the main UI thread of the application:
0:000> ~*e rfpcw
So this was the cause of the exceptions but the harder question to answer was who was changing this?
The lack of direct access to the system in question limited the complexity of what debug steps I could use as I would be talking someone else (who was not familiar with debugging) through it. I decided to start by running the process under the debugger from the beginning and setting the debugger to break any time a module loaded into the process and dump out the the value of fpcw on every thread and then continue execution. All this output would then be captured into a log file which could be brought back from the onsite visit. This was on the assumption that whatever DLL was making the change was likely to be doing it when it first loaded into the process. To do this we used the following command just after launching the process under the debugger:
0:001> .logopen c:debug_session.log
Closing open log file debug_1318_2008-09-30_15-06-38-527.log
Opened log file 'c:debug_session.log'
0:001> sxe -c "~*e @fpcw;g" ld
(The reason for using the @ before the register name tells the debugger that it is a register name and should not be valuated for any symbol resolution that might be going on. Doing this can make debug sessions a bit more responsive.)
What this output showed us was like the following:
ModLoad: 053a0000 053b1000 C:librariesthirdpart.dll <<< start of loading of third party module
ModLoad: 77760000 778cc000 C:WINDOWSsystem32shdocvw.dll dll <<< start of loading of shdocvw.dll ( a Windows component)
fpcw=00001372 <<< incorrect value of fpcw on thread 0
Sometime between the start of loading thirdpart.dll and the start of the load of the next DLL into the process the wrong fpcw value was set. Therefore we can now say with a reasonable degree of certainty that it is this module that is responsible.
After discussions between all parties involved we eventually established that this module was being injected into the process to fulfil a hooking/monitoring function. Unfortunately the changing of the fpcw value appeared to be a side effect of the non-Microsoft compiler the DLL was compiled with. Certain compilers seem to generate code that does this possibly as a legacy side effect of targeting non-Microsoft operating systems in the past. The vendor was not in a position to recompile this module so in the end they had to redesign things to avoid using it.
[A little tip for spotting certain components as being compiled with certain non-Microsoft compilers (based off my experience). A clue of this lies in the timestamp in the version resource (do lmvm thirdpart in the debugger):
2A425E19 time date stamp Fri Jun 19 23:22:17 1992
Now I remember that when I joined Microsoft Developer Support in 1995 we were in the beta of Windows 95 and although there was a thing called Win32 that gave some kind of 32 bit implementation on the 16-bit Windows platform we were only just at the beginning of 32-bit computing. So I was fairly sure this component was not really compiled in 1992. I've seen this 1992 thing a few times now and I think it has usually been with PE binaries produced by a non-Microsoft compiler. ]
I've also seen cases where modification of the mxcsr register has led to very unexpected errors:
Microsoft VBScript error 800a000b
Division by Zero
This was occurring on this line of ASP code:
Imagine how confusing that was!
In that case a third party ASP.NET charting component was changing the mxcsr register to 00001fa0 or 00001fa4 instead of its "normal" Windows value of 00001f80. (customer was hosting ASP.NET and ASP applications in the same application pool).
In another case we saw a customer getting a VBScript error 6, overflow on this line:
x = 1 + 2.0
Again, confusion reigned. This time it was caused by a component that was using MMX/SSE2/SSE3 instructions.
I'm not against code altering the fpcw or mxcsr registers. But if you are a library component that is going to be used by arbitrary threads in some foreign host process then your documentation needs to have a big red warning sticker on it and you certainly shouldn't go around injecting yourself into other processes and changing the way the CPU behaves. That's just bad manners!