No MSG For You!

Whenever I find myself repeating the same message over and over again, I have to ask why I haven't blogged it yet. This is one of those cases. :)

I've seen quite a few issues over the years with MSG files. The issues range from "it takes too long to write properties" to "the properties on the MSG don't match what I see in the store" to "I get such and such error trying to copy this message to an MSG". The root cause of most of these issues is one of expectations. People are trying to use MSG files as an archival format, and that's not their intended purpose. If you really want to archive mail, you should develop your own format for persisting the data. You'll gain advantages in versatility, speed, and fidelity.

To understand why I make this recommendation, we first need to realize that not all messages can be copied over to the MSG format. This is noted at the end of https://support.microsoft.com/kb/171907. Since MAPI is transacted, the underlying MSG file has to be opened with STGM_TRANSACTED, meaning nothing is committed to disk until SaveChanges is called on the message. Couple that with the quirk in the MAPI specification that pretty much forces you to create a new transaction each time you add a recipient or attachment and you quickly run into the limit on open root storage files noted in https://support.microsoft.com/kb/163202. This OS imposed limit on open root storage objects isn't likely to ever change, as it's an artifact of the implementation. Likewise, the need for new transactions for each recipient and attachments also won't ever change. Neither the MSG format nor structured storage have seen active development in years. This limit is going to be hit whenever a message has a large number of recipients or attachments, or when there exist a deep level of embedded messages.

The next issue is speed. Writing a message to MSG can be quite slow. There is a huge performance penalty working with a structured storage file in STGM_TRANSACTED mode. And this penalty is multiplied by the number of open root storage objects. So not only do you run into a limit trying to add all those recipients and attachments, but each subsequent recipient and attachment is that much slower to add. For instance, I recently worked on an issue where the repro required that I have 5000 recipients on a message that I then copied over to MSG format. It took over an hour to write the file. And none of that delay was actually in the MAPI code - it was all at the COM level.

Next - not every MSG file you can write can be opened by Outlook. Over the years folks have tried various tricks to squeeze performance out of the code writing their MSG files. In many cases, they succeeded in writing the file faster, or allowing more recipients and attachments on the message. But the downside was they wrote a file that Outlook didn't know how to open! One variation of this issue surfaced with Outlook 2007. Given the performance problems working with MSG files, in Outlook 2007 we decided to check the number of recipients and attachments when opening the file. If either was over 2048, then we refused to open the file at all. The main reasoning for this was a number of corrupt MSG files that had surfaced in the wild with astronomical counts of recipients and attachments - on the order of millions. But a side effect was to block Outlook 2007 from opening some MSG files that Outlook 2003 could open. We've had some customers complain about this one and a fix is in the works. I'll report here when it's done. However, that fix will only cover this one variation of the problem. It won't fix the large number of other scenarios out there.

That covers the mechanics of reading and writing to MSG. Now we discuss fidelity. This isn't about whether the MSG format is out partying with the EML format, but rather how faithfully the MSG represents the source message. This is where MSG being a MAPI based format gets you in trouble. For instance, in archival scenarios, especially when the archive is used for legal discovery, properties such as PR_LAST_MODIFICATION_TIME and PR_LAST_MODIFIER_NAME are very important as they indicate who modified the message and when. But since MSG is itself a MAPI message and has such has to obey all the rules of MAPI, those properties will only reflect the time the MSG was written and the name of the account that wrote it, both of which aren't likely to match the original message. This problem can extend to the body properties as well: no matter how you do it, you're likely to end up converting the body from one format to another when storing it in the MSG file. And every conversion carries with it the possibility of a loss of data. Perhaps some line spacing is subtly changed, or font choices aren't preserved exactly. In some messages, these subtle textual differences could have huge semantic ramifications.

Fidelity also figures in when discussing Unicode. In a large organization, messages will be written in a variety of languages. The only way to preserve these messages into MSG format without converting half the characters to question marks or boxes is to use the Unicode format. Unfortunately, this format is only understood by Outlook 2003 and Outlook 2007. Exchange's MAPI doesn't understand this format at all. So if you're relying on MSG files to save out Unicode data, your solution is stuck using Outlook's implementation of MAPI for all processing of the archive. This severely hampers your ability to build a server based application.

Workaround

So, we've got messages that cannot be copied to the archive, a painfully slow API, messages that cannot be opened once archived, and a format that's not capable of representing the actual message being archived. Clearly, these are not the attributes we want in an archive.

Fortunately, the workaround is simple: don't use MSG to archive messages. Instead, develop your own file format to preserve the important properties on a message. Here's one approach using the file system and XML files:

  • Each message, not counting embedded messages or attachments, is stored as a single XML file, the elements of which map back to the various properties from the source message.
  • Properties to archive are chosen by using GetPropList. Additionally, the various body properties should be read individually.
  • The recipient collection is also stored in this XML file.
  • Information about the attachment collection is stored in this XML file, but the actual attachments are stored in separate files. If the attachment is an embedded message, it is stored in an XML just like the parent message.
  • Embedded message XML files may be marked in the file name or by an attribute to distinguish them from the set of parent messages. Alternatively, all attachments could live in a subfolder.
  • All file names would be autogenerated to avoid conflicts.

The only really hard part about this format is determining how to store each of the possible MAPI property types. However, when we look closely, we see there are only 13 types to consider, most of which can be represented as just a simple number or string. Even binary data is easy to store if it's first converted to hex. Multivalued properties, large binary and string properties, and named properties all add additional wrinkles, but are easily addressed. I figure a junior programmer could complete a reasonable first draft of the required code to both read and write a MAPI message to and from XML in an afternoon. In fact, most of the code for writing the XML format is already present in MFCMAPI - check out dumpstore.cpp.

Objections

Hopefully I've convinced most of you not to use the MSG file format for archiving. Some of you might not be convinced though. You might think you've got that one special case that requires you to use MSG. I don't believe such a case exists. I've anticipated a few of the common objections:

  • "I have to use MSG for legal discovery." - See the above article on fidelity. MSG is a poor copy, totally unsuited for answering legal questions. You can do much better on your own.
  • "It's too hard/complex to write my own file format" - Good thing I just spec'd one out for you. :) If you're thinking this you probably haven't tried to sit down and do it.
  • "How would I know which properties to persist?" - That's what GetPropList gets you. It's exactly the same way CopyTo works to determine which properties to copy in the MSG code, with the advantage that you can archive everything now, not just what MSG is capable of storing.
  • "XML is too bloated" - You don't have to use XML - you could use columns in a SQL table, or any other storage medium. It's your choice.
  • "Reading, writing and parsing text is too slow" - Text processing can be quite fast if you approach it correctly. And anything is better than the speed of MSG.
  • "I shouldn't have to fix this - Microsoft should fix it" - Fair enough, but on the other hand, we've never encouraged anyone to use MSG files for archiving. It's just not what they're intended for. Additionally, consider how such a "fix" would be deployed. You'd be requiring all of your customers to install a new build of Outlook or Exchange, and they'd never be able to use an older build. You'd never get this sort of approval from most customers.
  • "I need to be able to open the messages in Outlook" - Most archival solutions include some sort of client side component, even if it's just a web page. If yours does, then from your client side component you can create a message on the fly, read the properties from your archive, and populate it. This would be no slower that opening an MSG file, and in many cases would actually be faster. Additionally, you'd be free to create viewers for your archive that do not depend on Outlook!
  • "I need MSG so I can index the data" - You must be using an IFilter extension that supports MSG. If you used XML, all your data would be text to begin with, so the native filters would already work. Plus, if you want more granular search, it would be easy to write an IFilter for your own format.

The final objection is my favorite: "But I've never had a problem with MSG files" - Bully for you! This article isn't addressed to you then. However, I had one customer who also made this claim when I found they were using MSG to archive messages. Not quite believing them though, I outlined each of the problems listed above. It turns out they had encountered or were encountering every single one of them. They just hadn't connected the problems back to their choice to use MSG to archive their data.