Efficient logging and exception handling in Windows and Web services : Part 1 – Raising exceptions, writing dumps

There are many articles discussing exception best practices, and they are almost all recommendable.  They usually discuss about code patterns; here I will discuss about application design patterns.

Reading logs: most often a journey in hell

Didn’t you spent hours or days investigating issues just because you had irrelevant, or incomplete information in the log files or your service, and you were desperate about how poor the logging of this tool or service was done? 

Did it happen that you had to plug a complex set of debuggers and tweak them with underground settings and sonic exception detection, just in order to reveal the hidden error that the application does not want to show you, hidden like an ancient treasure in the tomb of software errors?

Did it ever happen to you, that you were confused by cryptic logging information, trying to find your Rosetta stone towards those hieroglyphs of ineffable events, sniggering your ignorance about what sounds to you like another language?

We have all faced this.  And wait, let’s be honest, we have all written stuff like this.

Why should I write efficient logs and exceptions?

- Works on my machine

Writing efficient logging and exception handling is not the sexiest part of your application or service.  A good or a poor logging does not change what the application does, so why bothering?  However, it will be one of the most important features when someone will have to investigate an issue.

- I have high quality coding standards, my application is fully tested, why do I need logging and exception handling?

An application is almost never self-contained, it always lives inside an environment.  When the environment does not behave like your application expects, then an error will probably happen.  Your application will not so often diagnose itself what is different or wrong in the environment; but it can tell you what it expected and what it got.  If your application is executed in a more complex environment that the cocoon it was developed in, or if someone takes over your codebase and add features, probably he will need to analyze what he must correct.

Sometimes it will be very difficult to plug a debugger on the production and iterate through source code.  Sometimes identifying the pattern of the issue in order to reproduce it will be a challenge in itself.  Race conditions such as deadlocks are an example.

And yes, probably you will have to investigate your own bugs too.

So let’s gets started.  As a preliminary reading I advise the following documentation section: https://msdn.microsoft.com/en-us/library/ms229014

1. Raise an exception when you cannot decide what to do next

Think about your code like a company.  The individual functions and methods are the individual contributors, the classes are the managers, the libraries are the chief officers, and of course the Main() entry point is your CEO.

In all companies, employees sometimes face issue they are not prepared, or which are too big for them to take the responsibility of any decision.  Your pizza delivery boy is not able to repair his motorcycle if it does not want to start.  What will he do in such situation?  He will ask his manager.

Do the same in your code.  When you face an error, but you cannot decide what to do, because it will depend of things broader than the context of the method or function, throw an exception.  Do not be shy about this.  Your method should either do what it is supposed to do, or fail.  There is no problem in failing if there is no way to handle the problem in a relevant manner.

Example: you have written a function GetImageSize(string fileName) which computes the size of an image given a file name on the disk.  It may happen that you use this function in many situations in your application.  If this function fails (because the file is not an image, for instance), then do not try to do anything.  You miss context.  Within this function, you have no clue about who asks you this and what for.  So raise a NotAnImageException or a BadFormatException, and that’s all.

 if( header.FileFormat != ”JPEG” ) throw new NotAnImageException();

Tracing is even not necessary.  It will not hurt to log here, but as we will see, it will not help very much anyway.

2. Document your exception: action, object, expectation, result

Now you know that you have to raise an exception, let’s do it.  And we should do it properly.  Even if your pizza delivery boy is not able to handle the situation, you would like him some precisions about what is wrong about his motorcycle.  How many lonely IT support guys were told that “my computer does not work” for only information, while the customer was upset and expecting an immediate resolution?

Compare the following:

 // Exception #1: "I have a problem"
if( header.FileFormat != "JPEG" ) throw new NotAnImageException();

// Exception #2: "I have a problem with X"
if( header.FileFormat != "JPEG" )
{
    string error = string.Format("Cannot compute the image dimensions " +
        "of file {0}.", path)
    throw new NotAnImageException(error);
}

// Exception #3: "I have a problem with X, I got Z."
if( header.FileFormat != ”JPEG” )
{
    string error = string.Format("Cannot compute the image dimensions " +
        "of file {0}: the format {1} is not a recognized image.",
        path, header.Format);
    throw new NotAnImageException(error);
}

//  Exception #4: "I have a problem with X, I was expecting Y and I got Z."
if( header.FileFormat != "JPEG" )
{
    string error = string.Format("Cannot compute the image dimensions " +
        "of file {0}: the format {1} is not a recognized image. Ensure " +
        "the image is in Jpeg format.",
        path, header.Format);
    throw new NotAnImageException(error);
}

In all cases, assuming the exception is uncaught, the application will crash.  But assuming the error is captured and written in some event log or service log, the time spent to understand and correct the issue is much lower for the last case than for the first one.  There is no need of thousands of lines of log or error description.  Only one relevant line is necessary to locate, understand, and resolve the issue.

As much as possible, describe:

  • What action is failing, what you cannot do.
  • On what object you were trying to perform this action.
  • What you were expecting.
  • What you observed instead.    

Note: it is even better if those elements are properties of your exception class, but it is sometimes difficult to have properties for everything that can go wrong.

 public class NotAnImageException : Exception
{
     public string ImagePath { get; set; }
     public string ActualImageFormat { get; set; }
     // ... 
}

3. Gather knowledge and context

Knowledge and context need to be present together.  Here 1 + 1 = 3.  If one of the two is missing, the user of your application or service will loose time in understanding and locating the problem.  Whether you present an error message to a user, or you write to a log file, make sure that you combine the two ingredients or the user will not understand what is wrong in your application.

The Knowledge is what was wrong, as explained in Rule 2.

The Context is how it impacts the business.  The pizza delivery boy can have a problem with his motorcycle during his commute to work, or during a delivery to a customer.  The manager will not react to the problem in the same way.  In the first case, he will probably accept less orders; in the second, he will also have to call the customer because the pizza will not make it on time.

Compare the following lines of logs in a file:

1. Knowledge without context: “Access denied on file C:\application\rev34en\2011-12-05.txt” .  What it this file for? Does this error matter? Is it a fatal exception, or something that the system is supposed to manage?  Will the system retry to open the file automatically?

2. Context without knowledge: “Unexpected error when importing the list of new users, task canceled.” .  Great… if you have to investigate this, you will prepare yourself for a Kung-Fu debugging session and/or a mystic try and error sessions given what the spirits will inspire you this day.

3. Context and knowledge: “Unexpected error when importing the list of new users, task canceled: access denied on file C:\application\rev34en\2011-12-05.txt” .  By concatenating those two elements, the person supporting the application will know instantly what component in the service failed to function properly, how it impacts the overall application, and what is the source of the problem.

You need to put together Knowledge and Context when you handle the exception.  If you write to a log file, if you crash, if you present a message box to the user or wait for an input of any kind, this is the secret rule for applications which remain user-friendly even when they fail to perform what they are supposed to do.

4. Handle the exception at the proper level in the stack

Let’s go back to our pizza delivery company. You would expect from every employee to not report all issues to their managers. There are issues they are supposed to handle, it is part of their job. For instance, we expect from the manager of a local retail point to listen to an unhappy customer complaint and react accordingly.  You do not want the problem to bubble up to the regional director, who will miss context and not spend the proper time for this customer.  You do not want the delivery guy to manage the communication himself, because he is not trained for that.  But you have to answer this customer, or a bad reputation will be upon you.

On the other hand, if the pizzas are always cold when delivered because the regional director gave to the local retail point a too wide range of distribution, this issue will probably not be solved efficiently if the local retail point tries by himself and does not bubble up the problem.  But the regional director will have to do something about it.

It is the same with your application. When you are supposed to handle the error, do it.

Handling exception means that the application will opt for another code path because of the error (it can be failing/exiting, retry, waiting for some user input), or that you decide the error is acceptable and move on. But you will take a decision. If you don’t know what to do with the exception because you lack context, you are probably too low in the stack, so don’t catch it. If the exception causes a much larger process to fail than necessary, then probably you have caught it too high in the stack.

 try
{     UploadUserImageToWebProfile();
}
catch(NotAnImageException exception)
{
     Log("Cannot upload the image to the web profile of user {0}. " +
       "The web profile will show a default image instead.  " +
       "The source error is: {1}", userName, exception);
}

Of course we have enriched the context by adding which user it concerns.  Let’s give clues and pointers to the people who will read the log and try to reproduce the issue.

5. Do not swallow errors

 try
{
     ShowListOfBlogEntries();
}
catch {}

Some developers think that it is bad to expose error messages to the user or the service, and they adopt a policy of “catch all” or “continue on all errors”.  Well, except that in sensitive cases you should not present to potential attackers the internal logic of your application, this is a big myth.  No application went more stable or gave more customer satisfaction because it was hiding the problem.  Instead of that, the users of the service or the framework are confused and lose precious time trying to understand what is wrong.  As a customer, you hate dealing with a company which obviously has an issue on delivering you a service, but continues acting like you are too stupid to notice, denies the reality or ignores you, and does not present you a plan for recovery.  When you are waiting for your pizza, you appreciate to be contacted because the delivery guy had a problem. 

Even a end-user, who is no IT, needs to know what was wrong, how it impacted him, and how he can workaround.  Surfacing your failures is not a symptom of lack of professionalism, it is not even a symptom that your coding or architecture skills are not great; it is the evidence of a mature error management as long as it comes with proper communication and follow-up.

6. Do not catch errors if you can’t add any value

In your every day life, didn’t you ever face a problem, which resolution took much longer than necessary, because you were speaking to a middle man who was not fully aware of the details and could do nothing for you, except repeating what you said to the real person in charge? I am sure all of you have already names in mind.

 try
{
    OpenFile(path);
}
catch(FileNotFoundException exception)
{
    throw new Exception("File was not found.");
}

This code is not only useless.  It also hides the original exception which could contain more information (the path to the file).  However some developers sometimes think that an error can happen here or there, and they feel more comfortable if they log it or do something about it; and they forget to wonder if this is really useful.  Learn to resist to that temptation.  Do not catch an error if you can’t add anything to it.  If the FileNotFoundException is sufficient, do not add anything to it, you will be this annoying and useless middle man.

There is a way to be a useful middle man though, as we will see in rule 7.

7. Wrap the low-level exception, but do not break the original stack

We all know, our bosses are always in hurry.  They do not have the time to be interested by what we do.  They only want simple facts.  They do not understand the reality of the little problems of our little condition, we humble mortals.  Sometimes you put less efforts explaining what you do to your child than to your manager – reason is, your child is really deeply interested because you are his proudness.  Your manager has little time to hear your gibberish and wants his back, and other parts of his body, covered before anything else.  The manager of the pizza company does not care if you ran out of fuel, or if the ignition fails, or if you had a crash.  He will not solve the problem.  He will only evaluate the importance of the issue, take the decisions to ensure his business will continue to run.  The details of why the motorcycle is failing will only interest the mechanic.

For this purpose, it is sometimes interesting to translate the Exception if the native will not be meaningful when exposed to the caller.  But someone should always be able to look into the first problem, and for this .NET proposes Exception encapsulation.  The original Exception stays available through the InnerException property.

 public string ReadAccount(string accountName)
{
    try
    {
        var file = File.OpenRead(accountName + ".txt");
        return file.ReadToEnd();
    }
    catch(FileNotFoundException exception)
    {
        throw new UserNotFoundException("The account " + accountName +
            " does not exist.",
            exception);  // Embed the original FileNotFoundException
                         //so the issue can be debugged
    }
}

In that case, the caller of the ReadAccount() wants to access an account.  The caller does not know that you have chosen to implement your account management over the file system.  You could have used a database.  He will probably be confused by a raw FileNotFoundException.  By throwing a UserNotFoundException, you give him a better level of information; but you do not do it at the expense of hiding the initial problem.

Knowing what the the inner exception was will definitely help you when you will have to support your application.

8. Crash properly

Following “rule 5 – Do not swallow exceptions”, some developers assume that the application should always be up, thus it is better to catch everything and continue the process, they value it as a reliability pattern in the application.  And it may sound contra-intuitive to stand that crashing an application can enhance its reliability.

Well, it depends and most important this is incomplete.  What is really valuable is how easy or painful it is to investigate application crashes.  Choosing to continue the execution or terminate the program is a less priority concern.

There are situations where it is better to not continue the application and crash.  Here can be the reasons:

  • The application is in a state where further operations can damage the data or the internal logic.  It is necessary to perform a “clean crash” rather than a “dirty continue”.  A crash allows the application the be restarted from a well-known point.
  • The ages where crashing meant a obscure message box (“Access memory violation at 0x00054357”) with a useless stack dump are gone.  There are cases where you cannot attach a debugger on the application, because you ship your application to the customer, or because you do not have access to production environment.   Hopefully stack dumps are easy to use and highly valuable nowadays.
  • If your application is a Windows Service on Windows 2008, the uncaught exceptions will be logged in the Windows Event with the stack trace, and mini-dumps will be created automatically with Windows Error Reporting.   You may even configure your Windows Service to be restarted automatically.
  • .Net provides the API Environment.FailFastwhich also generates a stack dump and exception log.
  • Getting more errors during the development and test phase enhances the general quality of the application.  If the errors are always caught, or worse, silently ignored, they may stay unknown, or at the bottom of a “known warnings” list, that nobody had the time to investigate.  If issues are reported and fixing them is easy, then you will end up with a higher quality application, without changing the skills of the development team.  If errors are ignored or require painful investigations, then only the highest priority issues will be solved, and the product will ship with more bugs. 
  • In large production and distributed environments, the failing node will be identified faster if the application is down.  This has pros and cons and it will probably be either an advantage or a disadvantage depending on your company processes and teams.

However:

  • Think about degrading before failing.  Crashing should be the last resort solution, a reaction to an event that your application definitely cannot handle.  If some feature is not vital for your business, it is not worth it that it brings your entire application down.  However, it has to be controlled in the code logic.  “Continue on all errors” is not an acceptable way to implement degraded mode, because the behavior of the application is not predictable.
  • “The configuration file is unreadable”, “the database does not exist”, “the disk is full”, can be valid scenarios for crashing your application.  But your application may also assume to continue with default factory configuration, to create the missing database, and to serve read-only queries even if the disk is full.  This really depends on what your application requirements and functionalities.
  • Crashing may be the end of your application instance; but it must not be the end of the lifecycle.  The business will have to understand the impact of the problem.  The service engineering team will have to understand how they can workaround and the development team will have to fix the issue.  This must be part of application error handling design as well.

Windows Error Reporting is a nice feature provided on Windows 2008.  By default, Windows Error Reporting tries to upload the crash info to Microsoft and removes it when done.  You may tweak the registry keys to keep the crash dump on disk in all cases for your own investigation.

 

https://msdn.microsoft.com/en-us/library/bb787181.aspx

9. Write dumps

The Win32 API MiniDumpWriteDump allows you to write a memory dump and continue the application.  It is a very useful pattern to enhance problem resolution, in completion to the logs.  This is obviously an expensive process, so it should be used very seldom, but it is also very valuable.

Here is a way to do it in C: https://blogs.msdn.com/b/joshpoley/archive/2008/05/19/prolific-usage-of-minidumpwritedump-automating-crash-dump-analysis-part-0.aspx

Here is an example in C#.  You will find many examples on the Web.  Once the dump is generated, you just have to open it with Visual Studio, and have the source code available.

It is important that you call this method within a catch block.  Doing this, the exception will be embedded in the dump, and the instruction pointer will be pointing to the line in error, not the line which created the dump.

 using System;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;

namespace Sample
{
    ///  /// Creates a mini dump of the current process /// 
    internal static class MiniDump
    {
        public static void CreateDump(string path, MiniDumpType miniDumpType)
        {
            var exceptionInfo = new MiniDumpExceptionInfo
                                    {
                                        ThreadId = GetCurrentThreadId(),
                                        ExceptionPointers = Marshal.GetExceptionPointers(),
                                        ClientPointers = false // false because own process
                                    };
            Process process = Process.GetCurrentProcess();

            using (var stream = new FileStream(path, FileMode.Create))
            {
                Debug.Assert(stream.SafeFileHandle != null);

                // The problem Marshal.GetExceptionPointers can return null on x86 machines due to differences
                // in low-level exception handling.
                // Then passing a MiniDumpExceptionInfo structure with a NULL ExceptionPointers members causes an
                // access violation. So we only pass this structure if we got a valid ExceptionPointers member.
                // It will probably result that x86 machines will see the instruction pointer to the MiniDumpWriteDump
                // line and not the exception itself.
                IntPtr exceptionInfoPtr = Marshal.AllocHGlobal(Marshal.SizeOf(exceptionInfo));
                Marshal.StructureToPtr(exceptionInfo, exceptionInfoPtr, false);

                try
                {
                    MiniDumpWriteDump(
                                      process.Handle,
                                      process.Id,
                                      stream.SafeFileHandle.DangerousGetHandle(),
                                      miniDumpType,
                                      exceptionInfo.ExceptionPointers == IntPtr.Zero ? IntPtr.Zero : exceptionInfoPtr,
                                      IntPtr.Zero,
                                      IntPtr.Zero);
                }
                catch (Exception exception)
                {
                    Trace("Cannot generate the mini-dump because of exception " + exception);
                }

                Marshal.FreeHGlobal(exceptionInfoPtr);
            }
        }

        [DllImport("kernel32.dll")]
        private static extern int GetCurrentThreadId();

        [DllImport("DbgHelp.dll", SetLastError = true, CallingConvention = CallingConvention.Winapi)]
        private static extern Boolean MiniDumpWriteDump(
            IntPtr hProcess,
            Int32 processId,
            IntPtr fileHandle,
            MiniDumpType dumpType,
            IntPtr excepInfo,
            IntPtr userInfo,
            IntPtr extInfo);

        [StructLayout(LayoutKind.Sequential, Pack = 4)]
        private struct MiniDumpExceptionInfo
        {
            public Int32 ThreadId;
            public IntPtr ExceptionPointers;

            [MarshalAs(UnmanagedType.Bool)]
            public bool ClientPointers;
        }
    }

    [Flags]
    internal enum MiniDumpType
    {
        Normal = 0x00000000,
        WithDataSegs = 0x00000001,
        WithFullMemory = 0x00000002,
        WithHandleData = 0x00000004,
        FilterMemory = 0x00000008,
        ScanMemory = 0x00000010,
        WithUnloadedModules = 0x00000020,
        WithIndirectlyReferencedMemory = 0x00000040,
        FilterModulePaths = 0x00000080,
        WithProcessThreadData = 0x00000100,
        WithPrivateReadWriteMemory = 0x00000200,
        WithoutOptionalData = 0x00000400,
        WithFullMemoryInfo = 0x00000800,
        WithThreadInfo = 0x00001000,
        WithCodeSegs = 0x00002000,
        WithoutAuxiliaryState = 0x00004000,
        WithFullAuxiliaryState = 0x00008000
    }
}