Large messages in BizTalk 2004, what's the deal?
The large message support story in BizTalk Server 2004 is a complex one, mainly because the definition of large message varies significantly. This, in turn, is complicated by the fact that our customers expect everything to work with all of the possible variations of "large message". So, how large a message can BizTalk Server 2004 really handle?
The real answer is "it depends", but the number of considerations varies about as much as there are large message cases, so below I give some rules of thumb to attempt to go by. This post is an attempt to characterize the major classes of scenarios we have seen that require transfer of significant sizes of data, and for each, which features in the core engine is well suited to manage that data, and more importantly, which is not well suited.
The major classes of large message that we have seen are the following:
- Large flat file document, containing many independent records. The records themselves are small, but they were batch delivered occasionally, and the documents need to be processed. Sizes here vary from 100k to 100MB.
- Large flat file document, wrapped in a single CDATA section node in an XML document in order to carry the data through the system. This has typically come in the form of an exposed web service, which is trying to carry a large flat file document that needs to be carried out and processed. Sizes of the flat data is similar to 1, but I've only seen that go to about 1MB.
- Large XML document that is effectively the structural equivalent of the flat file, with hundreds of thousands to millions of "rows" that were batched together. A variant of this case is data coming from the BizTalk SQL Adapter, where the execution pulls back a number of records that is converted to XML for internal processing.
- EDI - style interchanges where the file or data contains medium size documents (10K - 100K) that are intended to be processed independently or in aggregate.
- Large flat document with a header and trailer at the starts and ends of the file, that really considered one report, but has potentially millions of records. Each record could be processed separately from the others, but the entire sequence must be processed in order to complete properly.
Of the above cases, the hardest case to deal with in BizTalk in general is case 2. The problem here is the fact that our internal processing of the data is based on the .Net, and in particular, the XmlReader class. This class for CDATA, and text in general, does not give a mechanism to access the data in a form that is friendly to large message processing. Basically, we get to this node, and effectively have to ask for the entire string to be materialized into memory in order to process it. This happens in all of the native BizTalk Disassembler and Assembler components (xml, flatfile and btf) because the data is streamed through the components via the XmlReader interface. If possible, it is best to avoid this style of XML, because of the implication in terms of materializing the single string. This is aggrevated in Web Services scenarios, because the data is materialized into a complete .Net object, then de-serialized into a message structure before processing by BizTalk, causing at a minimum three copies of the data in the process memory. If you must work with this data, then the best we can recommend not to send data that is more than 1MB into BizTalk, without some form of custom processing or large memory machines. Custom processing would be difficult here as well, because most mechanisms to deal with the xml data will load the entier string into memory, defeating the purpose of streaming the data.
What is required of BizTalk when processing these documents pretty much breaks into two flavors, pass-thru routing (potentially with tracking) or mapping along with the routing.
The really easy case to start with is pure routing, so let me start with that first. The desire here tends to be use BizTalk for a pure message routing infrastructure, and do minimal processing on the message itself. What may be required is to promote several key fields that are important for routing purposes, but nothing more than that. In this scenario, everything but 2 works well, because as we use the XmlReader interface internally, we can "stream" the processing of the nodes into the database without loading any single part into memory. For pure pass-thru cases with property promotion, we have tested up to 1GB size messages, getting them into and out of the processing server. This is not seconds, or minutes, but it can be done. I believe we'd seen something like 4 hours for the 1GB case.
The major consideration for time in this case is how big is the chunk size we use to fragment the data into the database. The default size is 100KB, and it is controlled in the group settings for BizTalk. In the 1GB case, that means we will take 10,000 round trips to the database to store all of the data incoming on the stream. Keeping a transaction open this long will cause the internal processing issue, so setting this value up to 1MB to 10MB will allow the data to be chunked into the db faster improving the overall execution time of the processing. This requires a larger memory footprint, but the machines required for this kind of processing tend to be high memory machines (multi-GB). In order to keep a consistent set of processing, we use (when necessary) a distrubuted transaction to lock all of the resources. We have seen, however at around 300K to 400K messages per submitted batch of messages, we will keep so many locks open at a time that SQL Server 2000 sometimes gives us "out of locks" errors. We've also tried this on SQL 2000 64bit, and that has worked much better for that large a number of documents in a single submission.
This, a much harder case to support, is unfortunately the most frequently done with our product, with sometimes disasterous effect. Given the statement "we can handle large messages", the first thing people try to do is process messages through maps. Unfortunately, this is still a huge problem for BizTalk, primarily because of a lack of good means to do large message transformations. The issue here is that our usage of the .Net XslTransform class we pass a stream object, but it is loaded into an XPathDocument that is processed by the XslTransform class. The problem here is as the transform executes, the XPathDocument caches information about the nodes of the XML along with the data itself to allow for faster access, but this causes severe performance penalties because of the redundant data that sits in the objects. This is where 90%+ of the Out Of Memory (OOM) exceptions that cause orchestrations and receive / send ports to fail come from. The blow-up factor can go up to 10x or more easily consuming all of the memory on the machine. The only recommendation here is to see if the reason for mapping can be accomplished some other way, because even 10MB document may be enough (with JITTed product and user code assemblies, other messages flowing through the process) may be enough to blow the process to 200-500 MB in memory.
One important consideration here is that if you are dealing with non-xml messages (usually called flat files), you have even more to worry about. Flat files seem to have evolved because when paying by the character to transmit data, they were designed to be as efficient as possible in order minimize cost. XML explicitly stated this as a non-goal, with readability as a much higher priority. As BizTalk converts the business data (not attachments) to XML for internal processing, you have gone from a space efficient format to one that, well, isn't. What that effectively comes to is that your message size in flat file, if 1K, becomes 4k or 5K in XML, simply by adding the tags. For example, fields in flat files typically aren't named, so 3 alone in its particular position would define, say a quantity. If you use elements, that becomes <Quantity>3</Quantity>. Those tags take up space as well, and between compression from optional elements missing and the xml expansion, we typically see 3-4x message size increase. So, your flat files don't need to be very big, with reasonably detailed names to make a huge stealth xml document, because it never materializes on disk, most people never see it or think about it as such, but it blows up in memory and causes the system to consume more memory than you think it should.
We've seen maps used in several distinct cases:
Extract/Set properties necessary for performing business logic.
In this case, the best approach is to use distinguished fields and/or property promotion in your process. Orchestration does not load the data of the message stream unless required, and that will happen with map execution. If instead the field you want to read or update is marked as a distinguished field, orchestration will "grab" the right value without loading the whole message into memory, and update the value during persistence, also without loading the message. This is a powerful means to manipulate key fields without loading the whole document into memory.
The condition is that each record is independent, but the requirement is that the messages be procesed in order, so a map is used. This tends to be about half of the cases where the large document styles 1, 3 and 5 have been used. In this case, the best approach is to couple the in order capabilities of orchestration with the ability to disassemble the flat file into records. What happens is that the flat file disasembler will take the documents one by one and publish a separate record for each one. Each message is received into an orchestration that maps each message separately, and sends it out using DeliveryNotification = true, where, assuming the destination is file, it is a dynamic send where the filename is set and the mode is Append. This will complete processing of all of the messages and maintain the order of the messages.
The sticking point here is how does the orchestration finish, because it is a convoy subscription required to correlate all of the message together. A solution is to use a modified FixMsg component. What it does is on every call to Execute, it generates a GUID and promotes it on the message. In the stream processing, it will prepend the stream with a "FixMSGStart", and append the stream with a "FixMSGEnd" strings. What this does is attach a starting message and ending message to the stream, allowing the orchestration to have definite message types that define the start and end of a particular file. This avoids zombie instances.
The simplest form of this is the ability to accumulate some results across all of the records. For example, across a batch of line items, accumulate a total price. If the requirements are also simple, this can be an augmentation over case 2, where a custom variable isused to accumulate the value required. Again, distinguished fields or propoerty promotion here works to extract the field in a memory-efficient manner.
This is where more complicated logic falls into, combinations of the above into a single transform that could not easily be factored into separate processing. The guidance here is pretty grim, as it falls down to the fundamental issue stated above for map execution. For this, the only thing we can really say is "good luck", because as we can't control the amount of caching that is done by .Net, we don't have any control over the growth of memory. At this point, the best we can say is if you have messages above 10MB of XML data (this may translate to smaller flat files -- remember the xml tags around each field counts toward the total message size) is unsupported, as the memory requirements are not easily predictable, hence could cause OOM exceptions even if on the surface, it shouldn't.
It is difficult to describe the above in a single support statement, because of the considerations involved. When asked if we support large messages, the answer always comes to "it depends", and the above should give an idea on what the issues and potholes are.
- If you are looking at a total size of 1MB or higher and require mapping, investigate the style of mapping required to see if it can be done without using the mapping tool. If orchestration coupled with in-order processing will accomplish your needs, go with that, taking care to avoid races that force the instance to complete prematurely (so-called "zombied" instances, as Lee earlier discussed).
- Mapping works up to about 10 MB where it the complexity of the map or number of nodes can cause significant memory inflation, 10x or more. Larger than that given the current architecture it becomes much harder to get to work.
- Without mapping, you still can get burned with wrapped flat file in CDATA sections. If that data gets past 1MB, you are going to have issues -- 10Mb or more is not recommended or going to be reasonably supported.
- Otherwise, the major consideration is the fragment size that is controlled at the group level. Please see the performance whitepaper to see the details as to the messaging fragmentation threshold to understand this in more detail. These cases fall under almost pure routing, and can handle far bigger messages, in excess of up to 1GB.
Hopefully, this helps clear up some of the confusion (or at the very least, explain why there is much confusion) around the large message story. Interestingly, this is actually significantly improved from BTS 2000/2002, since we could never handle messages on the order of a GB even in pure routing scenarios. And yes, we are working on this for future releases. We understand that this is painful, and are working very hard to try and make more large message scenarios easy with BizTalk Server.