Understanding and Recovering From a Mail Storm in Exchange 2003
One of the more common calls that we receive in the Transport Specialty is when a customer has a very large number of messages queued and they don’t seem to be processing. Even though the queues are in an Active State they continue to grow and no warnings or errors are generated. This scenario is what we refer to as a Mail Storm.
A Mail Storm occurs when Exchange cannot process the mail it has queued fast enough to also deal with the incoming load. The queues will show thousands of messages destined for various remote and local recipients. You may see all, or any combination of following queues backed up: Pre-Submission, Messages Awaiting Directory Lookup, Local Delivery, Messages waiting to be routed, and Remote Delivery Queues. The basic, overall symptom is that while you do see some mail processing, you will also see these queues continue to grow. You can think of this as a situation where more load is put on a Exchange Server than it's suppose to handle, i.e. an overloaded Exchange Server .
There are numerous things that can cause a Mail Storm. Here is a quick list of some of the more prevalent ones:
-A Spam attack or Open Relay within your Organization: Spammers aren’t nice people. If you unwittingly leave an Open Relay that is discovered you can expect hundreds of thousands of spam messages to be bounced off your servers. Also, always make sure your Spam definitions are up to date on whatever type of prevention system you are using.
-A looping message: A misconfigured Outlook Rule, Contact or Transport configuration is often the cause. A quick look at the actual messages in the queue usually determines if this is the case.
-A Public Folder Replication Storm: In an Organization with a large number of Public Folders, if a new Public Folder Store is brought up with replicas added then data is going to be sent via email.
-An unknown internal application sending mass amounts of mail: Even in some of the best IT departments I have worked with, sometimes an eager developer can configure an application that starts flooding the Exchange Server.
There are many other causes, but these are the ones we see most frequently.
The bottom line is that the Exchange Server doesn’t have the resources to handle the load and the queues will grow. I like to explain this in terms of a bell curve. Let’s say an Exchange server has 10, 512kb messages being processed. I would expect Exchange to process each message in 0.1 seconds or faster. Now let’s say that your server has 8000 512kb messages being processed with more coming in. In that scenario I would expect Exchange to slow to ~.5 seconds /message. Why? The more messages that are queued, the more resources Exchange has to use to keep track of them all. Of course actual processing time will vary based on hundreds of variables such as hardware and message content.
There are several common bottlenecks that we see that contribute to the overall speed of SMTP transfer. Those are Disk I/O, Memory and LDAP latency.
When you remember that Exchange is just an Application at its core it becomes easier to understand why Disk Latency has become #1 bottleneck in SMTP throughput. Exchange writes/reads from a disk in almost every action that it takes during message delivery:
- Inbound mail is written to the “mailroot\vsi 1\queue” directory.
- Message Tracking Logs are written to.
- Transaction logs and databases are read and written to.
- The mysterious Information Store “Working Directory.”
Everything that Exchange does relies on the disk subsystem. To check the health of the disks, simply use Performance Monitor to look at the following counters: Object: Physical Disk, Counter: Avg Disk sec/Read and sec/Write. The value that you get there is how long it takes the OS to read or write a binary 1 or 0 to the drive. This value should never be over 0.02 and if it is, you have a bottleneck. How long would it take you to write a “1” on a piece of paper?
Memory only becomes a significant bottleneck when the server runs out and begins to page. Memory allocation is generally thought of in these terms:
Each member of a flat Distribution List requires 1 KB of Inetinfo.exe RAM and more if nested.
The Scalability Guide references that vanilla Exchange 2003 will use the following memory allocation for SMTP messages:
· 1,000 open messages=10 MB of InetInfo memory
· 1,000 open messages + 20,000 closed messages = 80 MB of InetInfo memory
· 1,000 open messages + 89,000 closed messages = 366 MB of InetInfo memory
The number of open messages is configurable though the MaxPendingCat registry key:
On a heavily used server, AD Connectivity can easily become a bottleneck. Exchange makes dozens of LDAP queries to DC/GCs during mail routing so you need to make sure that those servers are not only responding, but responding quickly. Again use Perfmon and look at: Object: MSExchangeDsaccess Process, Counter: LDAP Read and LDAP search. These values shouldn’t average over 20ms and you should not see consistent spikes over that.
Knowing your system bottlenecks is a good way to help speed up message flow, but once you are in a Mail Storm you have 2 primary goals: 1. Find out what caused this sudden influx of messages and 2. Get the mail delivered.
So what can be done to recover from a Mail storm? Here is a list of the procedures I have developed over the years:
1. Regardless of the Exchange version the very first thing you need to know is what is causing the mail storm. The most effective way to do this is to simply look at the messages that are queued and look for commonality.
- Are they to or from the same person? If so, disable or disconnect that mailbox. Side Note: NEVER disconnect an SMTP or System Attendant Mailbox.
- Are they Spam? If so, configure your gateways to catch it.
- Are the Public Folder replication messages? Disable replication.
- Do they have the same size or subject? Just another way to track it down.
You have to stop the incoming messages.
2. Your next goal is to free up as many resources on the effected servers as possible. Old mail can always be replayed when the storm is over. Remember, the fewer messages Exchange has queued up, the quicker it will play them.
Here is my tried and true action plan for Exchange 2003
Assuming you have identified what is causing the Mail Storm you can now deal with the messages that are queued.
A. Free up resources. Disable Anti-virus by setting all of the AV services to “disabled” and reboot. That is the best way to clear it out of memory. Since calling anti-virus is the last thing Exchange does before submitting a piece of mail, often Exchange 2003 has to wait for AV to hand it back. Plus, it takes up considerable amounts of RAM and Disk usage. If you are protected at the gateway and client levels the risk is usually acceptable. However, every situation is different so use your best judgment.
B. Give Exchange a fresh queue. Any mail that comes Exchange 2003 off the wire (SMTP) is first written to file system. Hopefully, the vast majority of the queued mail is in that directory. Stop the SMTP Service and navigate to the Exchsrvr\mailroot\vsi 1 directory. Rename the Queue folder to Queue.old and create a new Queue folder. When you start the SMTP service, monitor and make sure it is keeping up with incoming messages.
C. Delete unwanted mail from the Queue.old folder. If you have identified the problem message, you can do a simple windows search within the messages of that folder to bulk select them. Then either delete them or move them somewhere else. You should be left with only the good production mail.
D. Replay the old messages in smaller chunks. I usually start with 100 messages and then adjust from there. There are 2 methods for this. First, you can simply cut and paste from Queue.old into the Pickup directory. This is the fastest, but you lose any recipients that were on the BCC line. Secondly, you can stop the SMTP service and paste messages into the new Queue folder. When you start SMTP it will resend those messages preserving all of the headers.
It’s all about staying in control of the mail flow. There is a lot of busy work, but you will make it through.
BEST PRACTICES TO PREVENT A FUTURE OCCURENCE
Verify the following hotfixes are installed on your Exchange 2000/2003 Servers:
894795 - A message that exceeds the configured size limit is sent to a server that is running Exchange Server 2003
885917 - A message that exceeds the configured size limit is sent to a server that is running Exchange 2000 Server
- Upgrade your OL 2003 clients to Office 2003 Service Pack 3 or Outlook 2003 post-Service Pack 2 Hotfix Package included in KB 898457. Please note that the two KB articles below talk about Exchange 2000 Server and 2003 Server.
908507 - Performance issues and excessive database growth occur on a server that is running Exchange Server 2003 or Exchange 2000 Server when a user adds a large attachment or a set of attachments to a message in Outlook 2003
898457 - Description of the Outlook 2003 post-Service Pack 2 hotfix package: November 7, 2005
- Until you upgrade your Outlook Clients to Outlook 2003 post-Service Pack 2 Hotfix Package included in KB 898457, you can use these articles to block older outlook clients:
288894 - How to disable MAPI client access to a computer that is running Exchange Server
- Configure MaxDSNSize regkey as referenced above.
817299 - XADM: A Requested Delivery Receipt Contains the Message Attachment That You Sent
308303 - Option to strip attachments for messages that generate an NDR
Additionally, you should also follow and implement following best practice recommendations:
- Configure global Message size limits and consider configuring size limits at the User & SMTP VS or Connector levels in the environment
- Make sure you have a good Anti -Virus solution to protect servers from AV attacks, malware, etc
- Make sure you have a good Anti -Spam solution in place to protect against Spam problems
- If you have Entourage clients, make sure you have latest and greatest Exchange SP & hotfixes installed on your Exchange Server
Developed and Written By David Michael and Mohammed Nadeem