Planning and Designing Fault-Tolerant Hardware Solutions

Applies To: Windows Server 2003, Windows Server 2003 R2, Windows Server 2003 with SP1, Windows Server 2003 with SP2

An effective hardware strategy can improve the availability of a system. These strategies can range from adopting commonsense practices to using expensive fault-tolerant equipment.

Using Standardized Hardware

To ensure full compatibility with Windows operating systems, choose hardware from the Windows Server Catalog only. For more information, see the Windows Server Catalog link on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources.

When selecting your hardware from the Windows Server Catalog, adopt one standard for hardware and standardize it as much as possible. To do this, pick one type of computer and use the same kinds of components, such as network cards, disk controllers, and graphics cards, on all your computers. Use this computer type for all applications, even if it is more than you need for some applications. The only parameters that you should modify are the amount of memory, number of CPUs, and the hard disk configurations.

Standardizing hardware has the following advantages:

  • Having only one platform reduces the amount of testing needed.

  • When testing driver updates or application-software updates, only one test is needed before deploying to all your computers.

  • With only one system type, fewer spare parts are required.

  • Because only one type of system must be supported, support personnel require less training.

For help choosing standardized hardware for your file and print servers, see "Designing and Deploying File Servers" and "Designing and Deploying Print Servers" in this book.

Using Spares and Standby Servers

This chapter discusses clustering as a means of providing high availability for your applications and services to your end users. However, there are two clustering alternatives that provide flexibility or redundancy in your hardware design: spares and standby systems.

Spares

Keep spare parts on-site, and include spares in any hardware budget. One of the advantages of using a standard configuration is the reduced number of spares that must be kept on-site. If all of the hard drives are of the same type and manufacturer, for example, you can keep fewer drives in stock as spares. This reduces the cost and complexity associated with providing spares.

The number of spares that you need to keep on hand varies according to the configuration and failure conditions that users and operations personnel can tolerate. Another concern is availability of replacement parts. Some parts, such as memory and CPU, are easy to find years later. Other parts, like hard drives, are often difficult to locate after only a few years. For parts that may be hard to find, and where exact matches must be used, plan to buy spares when you buy the equipment. Consider using service companies or contracts with a vendor to delegate the responsibility, or consider keeping one or two of each of the critical components in a central location.

Standby Systems

Consider the possibility of maintaining an entire standby system, possibly even a hot standby to which data is replicated automatically. For file servers, for example, the Windows Server 2003 Distributed File System (DFS) allows you to logically group folders located on different servers by transparently connecting them to one or more hierarchical namespaces. When DFS is combined with File Replication service (FRS), clients can access data even if one of the file servers goes down, because the other servers have identical content. DFS and FRS are discussed in detail in "Designing and Deploying File Servers" in this book.

If the costs of downtime are very high and clustering is not a viable option, you can use standby systems to decrease recovery times. Using standby systems can also be important if failure of the computer can result in high costs, such as lost profits from server downtime or penalties from a Service Level Agreement violation.

A standby system can quickly replace a failed system or, in some cases, act as a source of spare parts. Also, if a system has a catastrophic failure that does not involve the hard drives, it might be possible to move the drives from the failed system to a working system (possibly in combination with using backup media) to restore operations relatively quickly. This scenario does not happen often, but it does happen, in particular with CPU or motherboard component failures. (Note that this transfer of data after a failure is performed automatically in a server cluster.)

One advantage to using standby equipment to recover from an outage is that the failed unit is available for careful after-the-fact diagnosis to determine the cause of the failure. Getting to the root cause of the failure is extremely important in preventing repeated failures.

Standby equipment should be certified and running on a 24-hours-a-day, 7-days-a-week basis, just like the production equipment. If you do not keep the standby equipment operational, you cannot be sure it will be available when you need it.

Using Fault-Tolerant Components

Using fault-tolerant technology improves both availability and performance. The following sections describe some basic fault-tolerant considerations in two key areas of your deployment: storage and network components. In both cases you should also consult hardware vendors for details specific to each product, especially if you are considering deploying server clusters. For more information about storage options and strategies for server clusters, see "Designing and Deploying Server Clusters" in this book.

Storage Strategies

When planning how to store your data, consider the following points:

  • The type and quantity of information that must be stored. For example, will a particular computer be used to store a large database needing frequent reads and writes?

  • The cost of the equipment. It does not make sense to spend more money on the storage system than you expect to recover in saved time and data if a failure occurs.

  • Specific needs for protecting data or making data constantly available. Do you need to prevent data loss, or do you need to make data constantly available? Or are both necessary? For preventing data loss, a RAID arrangement is recommended. For high availability of an application or service, consider multiple disk controllers, a RAID array, or a Windows clustering solution. (Clustering is discussed later in this chapter.)

  • A good backup and recovery plan is essential. Downtime is inevitable, but a sound and proven backup and recovery plan can minimize the time it takes to restore services to your users. For more information, see "Backing up and recovering data" in Help and Support Center for Windows Server 2003.

  • Physical memory copying, or memory mirroring, provides fault tolerance through memory replication. Memory-mirroring techniques include having two sets of RAM in one computer, each a mirror of the other, or mirroring the entire system state, which includes RAM, CPU, adapter, and bus states. Memory mirroring must be developed and implemented in conjunction with the original equipment manufacturer (OEM).

For more information about storage strategies, see "Planning for Storage" in this book.

Network Components

The network adapter is a potential single point of failure. Fortunately, the network adapter is, on average, very reliable. However, other components outside the computer can fail, causing the same effect that you would experience with the loss of the network adapter. These include the network cable to the computer; the switch or hub; the router; and the Dynamic Host Configuration Protocol (DHCP), Domain Name System (DNS), and Windows Internet Name Service (WINS) systems. Any one of these components can fail and cause the failure of one or more servers and, potentially, all the servers.

You can contend with such failures through redundancy in your network design. Many components lend themselves to backup or load-sharing strategies. The following list describes redundancy strategies for the network hardware (hub or switch, network adapter, and wiring), the routers, and DNS or WINS.

Note

Network hardware   Although hubs, switches, network adapters, and wiring are very reliable, if a service must be guaranteed, it is still important to use redundancy for these components. Consult with the vendors who provide your network hardware and support for recommendations on how to build redundancy into your network. For more information about building redundancy into your network, see "Designing a TCP/IP Network" in Deploying Network Services of this kit.

Routers   Routers do not frequently fail, but when they do, entire computer centers can go down. Having redundant routing capability in the computer center is critical. Your router vendor is a recommended source of information about how to protect against router failures.

DHCP   For the servers on which you must maintain the highest degree of availability, use fixed IP addresses on servers and do not use DHCP. This prevents an outage due to the failure of the DHCP server. This can improve address resolution by DNS servers that do not handle the dynamic address assignment provided by DHCP. For more information about DHCP, see "Deploying DHCP" in Deploying Network Services of this kit.

DNS and WINS   DNS and WINS infrastructure components are easy to replicate. Both were designed to support replication of their name tables and other information. Make sure that when you use multiple DNS and WINS servers, you place them on different network segments. For information about WINS and DNS servers and replication, see "Deploying DNS" and "Deploying WINS" in Deploying Network Services of this kit.

For information about replication options for WINS servers, see "Configuring WINS replication" in Help and Support Center for Windows Server 2003. For information about replicating DNS zones, see "DNS zone replication in Active Directory" in Help and Support Center for Windows Server 2003.

For more information about network infrastructure, see the Microsoft Systems Architecture link on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources.