Be kind, rewind (but don’t reboot)
One very common belief I have come across is that rebooting Windows somehow “cleans” the system and returns it to normal speed after some performance degradation (and further that reinstalling the OS periodically does some magical cleaning too).
For the most part, this is complete nonsense.
Shutting down Windows will terminate all processes & services and empty the system cache, and starting from clean will cause a system initialization check (was the previous shutdown clean, does the disk need checking or a dump extracting, etc.) followed by a mass struggle for domination of various components wanting to fire up.
See the previous blog entry where I talked about contention – a system startup is a bottleneck, luckily one-off, where all the various parts of the OS plus 3rd party services will want to get started (and then often sit in an idle state for the majority of their lives).
Once this startup procedure is over and the OS is sitting at the authentication prompt (or the desktop, if the user added to the contention by wanting to logon and incur additional load with their logon processes), we now have an empty system cache.
In some (client) environments Superfetch can kick in after the system has been idle a while and start to load in files that it has observed the user requesting in a pattern – this starts to pre-populate the system cache again to remove the delay caused by disk I/O when the file is actually requested.
As file I/O is done, the Cache Manager works with the Memory Manager to keep virtual blocks of files in memory, this is to allow efficient re-use of file without incurring I/O.
Windows caches on file sections, not entire files, for efficiency – and processes reading the same file will use pointers to the same cached file sections.
So a populated system cache is a good thing – unused memory is wasted.
Pages in the system cache age, and pages will get trimmed (paged to disk or freed) based on how long ago they were accessed (i.e. a cache “hit”), if the system needs to free physical memory to satisfy requests for memory allocations then the cache is checked before processes get their working sets trimmed.
In some rare cases, the OS can suffer performance issues after it has been up for a while – however this is not expected or normal, and a reboot to “resolve” this is just masking a problem that should be investigated.
Performance issues most often come from… again… contention.
CPU contention – this can occur if (for example) a multi-processor system has all but 1 CPU stuck in a spinlock state, putting contention on that single CPU (if all CPUs were spinning then the system would be hung, not slow).
There are pools of worker threads in the kernel which deal with queues of work items – if some of these get into a hung state or the queues are backlogged, some, most or all threads in the system can end up in the wait state for much longer than is normal (allowing the queues to build up as time goes by, compounding the problem).
Memory resources come in different flavours for different purposes – I am not talking just physical vs virtual, but things like page table entries (PTEs), paged pool and nonpaged pool.
A system that runs low or out of PTEs will invariably hang.
A system that runs out of paged pool has likely had a leak of some kind in a driver, or the 3GB switch is in use on a busy server, and can result in performance degradation due to constant trimming, a hang or possibly even a crash if a driver requests it with “must succeed”.
Nonpaged pool is similar to paged pool, with the exception that it cannot ever be paged out to disk – typically this is used by drivers at device interrupt level – as with paged pool this can result in a severe performance drop, a hang or a crash in extreme circumstances.
Locks can end up with lots of waiters, if the current owner holds it longer than normal or there are 2 or more threads constantly grabbing and releasing one – locks are a necessary evil so that we ensure coordinated access to data structures and maintain data integrity, but can lead to scalability issues.
In particularly bad cases a deadlock can be encountered – 2 threads that each hold a lock and wait indefinitely on the other to be available, this can hang some parts of the system, tie up worker threads or slow down the entire system until it hangs.
When you start to look at the possible things that can go wrong in an OS, it is more surprising that it works most of the time!
So next time you believe your servers to be going slower than you expect, have a look at the nature of the performance issue:
Is it slow to logon?
- What is the message displayed during the longest delays while you wait for the desktop?
Is it slow to start new processes?
- Is it the same for all processes, or just certain ones?
- Are processes that are already running working at normal speed?
- Does Task Manager show the CPUs are all under constant high load?
- Is something grinding the disk? (If so, is it paging or file I/O?)
Process Explorer is a great tool for identifying where CPU time is being spent, from hardware interrupts down to individual threads in processes.
Resource Monitor (for Vista onwards) is great for real-time analysis of file I/O – which process is incurring what amount of read or write I/O against which file objects.
Performance Monitor (PerfMon) is great for logging performance to view statistical data on memory usage, CPU time, network throughput, etc.
If you use PerfMon, the valuable counters are current (and very occasionally peak), not average - very rarely is an averaged value going to be of use for performance analysis.
Current CPU and disk queue length can show a backlog of “work pending”, while System counters like Pool Paged Bytes and Pool Nonpaged Bytes only have meaning if you know what the maximum is (based on system configuration) – Free System PTEs is, however, a useful counter (keep it over 10,000 is a basic rule of thumb).
The next time your system feels a bit sluggish, take a look at what “sluggish” is to you, and try to identify what is currently saturated rather than reach for the “Restart” button on the Start menu.
Identifying and fixing a problem is better than ignoring or working around it.