How Disk Bottleneck can affect TMG Performance?

1. Introduction

Troubleshooting performance issue is not easy regardless of the product, with TMG being no different, matter fact it is tougher because there are so many other elements involved that are external to TMG, which can play a big role on the overall performance experience.

 

2. General Considerations

 

One of the biggest misconceptions about disk utilization on TMG is that this is a type of subsystem that you shouldn’t worry too much about, I’m sorry to say, but this is wrong. Although TMG doesn’t depend on disk as much as an Exchange Server or a SQL Server, it does rely on disk for many other purposes, here are the main factors:

 

· Logging – although you can choose another location for the log, the overall disk activity will still exist unless you choose to log the data on a remote SQL Server. While remote logging can be a good choice; there is another layer of potential latency that you can introduce, in other words: during the sizing you need to consider how fast the connection between your TMG and your SQL Server is. Considered the amount of logging traffic that will be moving through the network to the SQL Server and if the remote SQL Server has enough power to receive the logs from TMG plus the other activities that probably this server is doing.

 

 

Note: even when you use remote logging, the failover mechanism (in case TMG can’t access the remote server) will be using LLQ feature on the local disk. Read the article Overview of the Logging Improvements in Forefront Threat Management Gateway (TMG).

· Caching – If you are using TMG as a proxy server you will probably want to enable caching capability. In this case you will need to consider the cache size and where the cache file will be located.

· Malware Inspection Temp Scan Storage Folder – when you enable the Malware Inspection feature TMG will temporarily store files that it is downloading and inspecting it. Consider your traffic profile, for example if you have heavy users that download large files, in order to relocate this folder to another disk.

3. Sizing

TMG Product Team did a great job with the new TMG Capacity Planning tool and you should always use this tool to size TMG in order to fit on your scenario. Here it is the link for this tool:

http://www.microsoft.com/downloads/details.aspx?FamilyID=01b2f7a5-8165-4ead-9693-994504f66449&displaylang=en

Another great article which I recommend to read is the general hardware recommendation for Forefront TMG at:

http://blogs.technet.com/isablog/archive/2010/01/12/hardware-recommendations-for-forefront-tmg-2010.aspx

4. A Troubleshooting Example: ISA stops responding twice a day

I don’t have a good set of data for TMG for this type of issue; however when it boils down to disk issues, ISA and TMG are very similar. The only difference is that TMG is more intensive due to the new features that will be using the disk more than ISA used. This problem that I’m going to cover here was very interesting because twice a day ISA was just stop responding web requests and firewall administrator had to restart Firewall Service in order to get back in production.

The initial troubleshoot eliminated all the potential issues on the following areas:

· DNS

· Network binding order

· Re-injection/Backlog

· Authentication

· NIC Issues

All that was good and we need to move to the next level of troubleshooting and this was done by monitoring the server and gathering data while the issue was happening.

4.1. Data Gathering

In order to do that the following preparation was done on ISA Server:

1) Installed DebugDiag (download from the link below):

http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=28bd5941-c458-46f1-b24d-f60151d875a3

2) Opened Perfmon and added the following objects:

> ISA Server Firewall Packet Engine/*

> ISA Server Firewall Service/*

> ISA Server Web Proxy/*

> Memory/*

> Processor/*

> Network Interface/*

> Process/*

> Physical Disk/*

- Configured the maximum size file for 200MB (create a new when it gets full) and the refresh time to 15 seconds.

- Started the perfmon capture.

3) Installed ISABPA (downloaded it from www.isabpa.com).

When the issue was happening, the following actions were done:

1) Went to Start / Programs / Microsoft ISA Server / ISA Tools / ISA Data Packager

2) On the option "Collect data using one of the following repro scenarios", selected "Web Proxy and Web Publishing" and clicked Next;

3) Clicked in Modify Options;

4) In addition to the options that are already selected, included the following ones:

- ISA BPA

- ISA Info

5) Clicked in Start data collection.

6) The Data Packager started to run. When the option “Press spacebar to start the capture” appeared, pressed the spacebar and reproduced the issue by trying to connect from the client workstation to any web site.

7) After finished the test, pressed space bar again in the ISA Data Packager console.

8) Opened DebugDiag.

9) On the first Debugdiag screen (Select Rule Type), clicked Cancel button.

10) Went to Processes tab and look for the wspsrv.exe process.

11) While this window was opening, went back to the workstation and tried to connect again.

12) While the workstation was trying to connect, went back to Debugdiag window, right clicked on wspsrv.exe process and selected the option Create Full User Dump.

Note: depending on the frequency of the problem you might want to take multiple dumps, for example 2 in a 10 seconds interval.

13) Stopped Perfmon counter.

At this point we had the following files:

· 1 DMP File created by DebugDiag

· 1 CAB File created by ISA Data Packager

· 1 (or more) BLG files created by Perfmon.

Note: It is important to mention that in some cases the user mode dump is not enough to determine the root cause, mainly on a scenario where the disk can be the potential root cause for the problem. In this case you will also need to prepare the server to get a full memory dump. You can use KB972110 in order to prepare your server to get a complete memory dump or you can download DumpConfigurator.hta from Codeplex and it will prepare the box for you. If you decide to open a case with Microsoft CSS, make sure to review your dump before send it by using this article Save Support Dollars by Checking Your Memory Dump before Calling Support .

4.2. Data Analyzes

When the issue was happening perfmon showed the following trend:

The Avg Disk Queue Length is consider good when it is up to 2 per spindle (physical disk), notice that in this case we stuck on 42.038 for about 2 minutes, which is a huge value considering that in this case we just have 1 single disk. This will definitely cause ISA to start queuing up request for writing logging in disk and as a side effect Log Buffer Failed Due To Full Queue starts to grow also. This also explains the warning below that appeared during the time that the issue was happening:

In this particular scenario the perfmon data was enough to prove the disk bottleneck.

Author

Yuri Diogenes

Sr. Support Escalation Engineer

Microsoft CSS Forefront Edge Team

Technical Reviewers

Vic Singh Shahid

Sr. Escalation Engineer

Microsoft CSS Forefront Edge Team

Thomas Detzner

Sr. Escalation Engineer

Microsoft CSS Forefront Edge Team