Troubleshooting an unresponsive web server (IIS) – Part 1 of 2, gathering the data
A web server is deemed to be unresponsive if it’s either not providing a response at all and/or it’s not achieving the response time (performance) expectations of the users.
In my “Troubleshooting 101” post, I mentioned that after the problem has been defined (ie basic facts collected), the next step in the troubleshooting process is to gather data relevant to diagnosing the issue. I’m planning to cover an introduction to analysing the data in a future post.
Following is a summary of my recommended action plan:
1) Ensure that the appropriate troubleshooting tools are available and/or optimally configured on the effected server(s)
2) Collect data capturing the current configuration of the server(s)
3) At the time of the problem, collect data necessary to capture the problem state
4) Provide the data gathered to an appropriate resource for analysis
Here’s some details for the aforementioned action plan (steps 1 & 2 are for now whilst the remainder of the action plan is for when the problem next occurs):
1) Troubleshooting tools/configuration:
a. Configure a binary format Performance System Monitor (aka Perfmon) counter log for the following objects. Choose a sample interval and other settings that are appropriate for your environment (ie disk space availability, etc):
Active Server Pages
Thread (this object can be particularly helpful when troubleshooting high cpu however there’s a reasonable additional overhead in including it so you might like to omit it for the initial data gathering attempt)
* meaning all objects beginning with prefix.
By default, the process id (PID) doesn’t appear in the Perfmon Process instance names. The PID can be helpful during analysis of the Perfmon log so make the following registry change before you start the counter logging. Simply add a DWORD named “ProcessNameFormat” and give it a value of 2 under:
281884 The Process object in Performance Monitor can display Process IDs (PIDs)
b. Ensure that you have time-taken enabled in the IIS logging for the effected website(s). Note, it’s not enabled by default and can be very useful as an objective measure of responsiveness.
c. Ensure that you have the “Debugging Tools for Windows” available on the effected server(s). Note, if you prefer, you can simply copy the “Debugging Tools for Windows” folder to the effected sever(s) rather than running the install:
Download “Debugging Tools for Windows”
2) Current configuration:
a. Gather general configuration information from the effected web server(s) via MPSReports:
b. Gather a copy of the IIS Metabase (%windir%\system32\inetsrv\Metabase.xml)
3) Capture the problem state:
a. Whilst the server is next considered unresponsive, capture hang dump(s) via the following command:
cscript.exe adplus.vbs -hang -iis -quiet -o <output path>
ADPlus comes with the “Debugging Tools for Windows” mentioned above in 1c).
How to use ADPlus to troubleshoot "hangs" and "crashes"
b. (optional ) Repeat a). In some situations (eg high cpu), it can be helpful to capture a 2nd hang dump. The 2nd dump should only be initiated after the 1st dump has completed. Note, the 1st dump hasn’t completed when the .VBS is finished – it launches CDB.EXE instances so wait for them to conclude before initiating the 2nd dump. The 2nd dump can be helpful for determining which specific threads are responsible for the cpu usage, etc.
4) Gather the following and provide to an appropriate resource for analysis:
a. Perfmon logs covering the period leading up to the problem,
b. IIS logs covering the period leading up to the problem,
i. IIS activity log (windir%\system32\LogFiles\W3SVCx\*.log).
ii. HTTPERR log (%windir%\system32\LogFiles\HTTPERR\*.log).
c. Hang dump(s),
d. Event logs (both Application and System in .evt format). The event logs are included in the data gathered by MPSReports (step 2a) so you might find re-running MPSReports to be a convenient way to gather a copy of the event logs.
Troubleshooting is typically an iterative process. In other words, repeat steps 3-4 for each occurrence of the issue until resolution is achieved.