Stretching Microsoft Server Clusters with Geo-Dispersion
By Martin McClean, Microsoft Enterprise Services
A primary goal for mission-critical businesses today is the delivery of increased server availability, improved network services and dependable redundancy capabilities in the event of hardware and software failures. Many organizations are also seeking to consolidate their infrastructure by eliminating the replication of servers and repeated applications. Microsoft Server Clusters provides a path to achieve these goals. An extension to the Windows operating system, it is a necessary step in achieving mission-critical server availability and scalability across your organization.
On This Page
The Need for Microsoft Server Clusters Understanding Basic and Geo-Dispersed Clusters A Real-world Server Clusters Solution Creating a Server Clusters Solution Implementing Redundancy for Applications The Importance of the Hardware Compatibility List (HCL) Cluster Hardware Considerations Considerations for Building Cluster Servers Additional Cluster Hardware Redundancy Managing Public and Private Cluster Interfaces The Voyager Topology Extending the Cluster with SAN Storage Identifying LUNS in Windows 2000 or NT 4.0 Managing Host Bus Adapters Managing Physical Disk Resources within the Cluster Dynamic Disk Support for Server Clusters Conclusion
The Need for Microsoft Server Clusters
Microsoft Server Clusters provides a managed clustering architecture that keeps server-based applications highly available, regardless of individual component failures. Code-named Wolfpack, it was originally developed into a two-node clustering solution for Windows NT 4.0 Enterprise Edition and has since become known as Windows Server Clusters. It is available as a two node solution in Windows 2000 Advanced Server and a four node solution in Windows 2000 Datacenter server.
Understanding Basic and Geo-Dispersed Clusters
A basic cluster design consists of a group of independent computers that work together to run a common set of applications. They look like a single system to the client and the application but in reality they are multiple servers. The servers are physically connected by network and storage infrastructure and logically connected by the Cluster Service software. The majority of clusters are housed in the same physical location, often the nodes are only separated by a cross over cable.
In the basic cluster, each Windows NT/2000 server in the cluster is termed a node. The components managed by the Cluster Service are known as resources. For example, a physical disk or an IP address is a resource. Resources are placed into groups and these groups are used to define the basic unit of failover of services. The Cluster Service tracks the state of the nodes in the cluster so that if there is an application or server failure, it restarts the application or performs a failover to another server cluster node. The cluster can also failback to the preferred node when it recovers from the failure condition. A storage architecture -or shared disk array- is required to host the applications and services within the cluster. Common to both cluster nodes, the cluster storage is physically connected to both nodes with either a SCSI bus (2 node Advanced Server Cluster) or Fibre Channel connections (2 node Advanced Server Cluster, 2 or 4 node Datacenter cluster). The entire cluster is usually situation in the same physical location.
High availability cluster systems reduce the possibility of single component failure. However, the cluster is still vulnerable as there is no protection from location disasters like fires, flood or malicious damage. The solution is to configure a geographically dispersed (or multi-site) cluster. In this configuration, the cluster nodes are separated geographically and quorum disk is synchronously mirrored between sites. The data disks can also be synchronously mirrored between sites. The Cluster is unaware of the geographic distance between its nodes so this must be implemented at the network and storage levels within the infrastructure architecture.
In a geographically dispersed cluster, the public and private network interfaces must still exist in the same network segment and the cluster nodes must still share the same IP subnet. This is because cluster software is unable to determine network topology and because it operates on IP failover which only functions within the same subnet. To accommodate these restrictions for geographic dispersion, organizations can implement VLAN technology. Virtual LANs (VLANs) can be viewed as a group of devices on different physical LAN segments which can communicate with each other as if they were all on the same physical LAN segment.
Finally, the storage architecture in a geographically dispersed cluster must provide an arbitration mechanism to ensure that the cluster believes it has only one persistent disk with which to communicate cluster information.
For anyone seeking additional information on Windows Clustering, Microsoft provides online help and support through its public web site. Many people often avoid online help but this would be a serious oversight when working with Server Clusters. The online help contains a wealth of information and is well worth reading. It covers all aspects of Windows Clustering technology including Network Load Balancing.
For additional information go to the following web pages on the Microsoft web site: http://www.microsoft.com/windows2000/en/advanced/help/
A Real-world Server Clusters Solution
A large financial organization, Voyager Financials, wanted to consolidate its large server base and reduce overall server administration while increasing its capacity to provide high availability and scalability of services to its customers. They chose to plan and implement Microsoft Server Clusters on the Windows 2000 Advanced Server platform and to introduce geographic dispersion into the cluster design to ensure ongoing services in the event of a physical disaster at central hub locations.
The solution for Voyager Financials is a variant of a true geographically dispersed cluster. It complies with the requirement to geographically separate the cluster nodes and to geographically separate and replicate real-time synchronous cluster information. However, in this design, the cluster data storage is replicated by a separate process and the cluster is not aware of the remote or mirrored copy of the data. Both nodes are connected to the same SAN located at the primary location. The idea is to manually invoke the secondary cluster storage should a physical disaster occur onsite. Additional hardware redundancy was also built in to the cluster design to cater for any single point of failure.
This purpose of this article is to discuss some of the experiences involved in the deployment of this variant geo-cluster solution for Voyager Financials. The article discusses many of the issues that occurred when implementing additional hardware redundancy, real-time replicated storage and geographically dispersed nodes.
Creating a Server Clusters Solution
The first step in developing the required cluster solution for Voyager Financials was to start with a basic two-node cluster design. Each node was then installed at a separate physical hub location and provided physical disaster redundancy for the other. The cluster nodes were connected by configuring a network infrastructure which defined two separate VLANs to be used by the cluster. A dedicated VLAN was used exclusively for the private cluster interface and a second VLAN connection for the public cluster interface. The public interface was configured with dual network adapters in a teamed configuration to provide redundancy across the cluster's public interface. Redundant private interface connections were unnecessary as the cluster can always default to the public network for heartbeat communications in the event of a private interface failure.
Shared storage on the server cluster was provided by a proprietary Storage Area Network (SAN) solution. Additional redundancy was extended by including a mesh of Fibre channel switches to prevent any single switch failure interrupting service to the SAN. In addition, cluster and user data information were mirrored by synchronous real-time replication to a remote SAN in an alternate physical location. The second SAN environment is reserved for disaster recovery purposes and is not visible to the cluster.
The actual design is illustrated below:
The cluster illustrated above ensures disaster recovery situations are catered for by providing real time synchronous data replication for the cluster quorum, user, application and services data to the remote SAN environment. In this variant of the geographically dispersed cluster for Voyager, the failover of applications to the other cluster node is seamless and does not require any administrator intervention. However, the failover to the duplicated SAN is a manual initiated process. In the event of a location (or SAN) failure at the primary location, the administrator needs to manually activate the remote SAN to allow the remote server to gain access to the data.
Implementing Redundancy for Applications
In the Voyager solution, File Sharing, Print Spooler and DHCP Services were split across both nodes, with each service being operated from a preferred node. All applications and services were configured in an Active/Passive configuration. This meant that File Sharing services would be actively hosted on one cluster node and Print Spooler services on the other node. DHCP Services were to be hosted on the node that demonstrated the smallest load. When any failure conditions arose, any of the services could simply failover to the alternative node.
This configuration provides optimum performance balance because both nodes support the cluster resources. However, the performance does reduce when failover conditions occur because one node must support all the cluster's resources until the failed node cames back online.
This is illustrated below:
Active/Passive Configuration of Cluster Services
The Importance of the Hardware Compatibility List (HCL)
Microsoft highly recommends that your server cluster configuration uses identical hardware across all cluster nodes. In addition, any hardware used to construct the cluster should employ strict compliance with the Microsoft Hardware Compatibility List (HCL). A geographically dispersed cluster must also be specifically listed on the geographically dispersed cluster HCL list. You should never take any arbitrary cluster and attempt to make it geo-dispersed.
When purchasing or configuring hardware for your cluster, make sure you remember that HCL compliance applies to equipment BIOS levels and driver versions. All drivers should be Windows Hardware Quality Labs (WHQL) certified and signed. You also shouldn't assume that the latest drivers are always the best drivers to use. Often, the latest drivers have not yet been tested and are not yet supported!
Microsoft expects conformity with the above hardware compliance recommendations before a cluster qualification can be fully supported, so save yourself some trouble and ensure your configuration is compliant from the beginning.
For your reference, the Windows 2000 HCL and the Cluster Service HCL are at http://www.microsoft.com/hcl/default.asp.
For additional information, ncluding Cluster Test kits, see this Microsoft website: http://www.microsoft.com/hwdq/hwtest/devices/systems.asp?area=SysSrv-Clstr.
Cluster Hardware Considerations
Selecting identical quality hardware for each server node is highly recommended for returning reliable performance from your Server Cluster. Voyager Financials purchased each server with quad Intel Pentium III Xeon processors and approximately 4 gigabytes of main memory. All the boot disks were configured with hardware RAID 1 (or mirroring). Each node had identical network interface cards to service cluster communications and identical host bus adapter cards to access shared storage. The clusters connected to a fibre channel switch fabric which in turn connected to the Storage Area Network (SAN). The local area network was configured as a gigabit Ethernet environment.
When configuring your cluster hardware, it is good practice to ensure that the hardware drivers are the latest certified drivers available. Voyager experienced some critical errors on the server when using older hardware RAID BIOS levels so it's best to upgrade these to the most recent certified versions before proceeding.
Considerations for Building Cluster Servers
Building and testing a complex cluster solution can often result in the need to rebuild the basic cluster server more than once. This can be very time consuming, especially if you need to return later and add Service Packs, etc. For this reason it is recommended that unattended scripts are used to build basic cluster servers. The Windows 2000 Resource Kit provides an excellent Setup Wizard which generates an unattended setup file as well as a simple batch file that launches the installation. There are also options to include updated drivers when building the server; this is especially useful for complex environments which may require updating several network and host bus adapter drivers. The latest Service Packs can also be integrated into your build source files if required.
With a mix of independent SAN and fibre switch technologies in the Voyager cluster design, there were several delays in getting the SAN, switch fabric and associated disk assignments configured. Voyager had to work across several political and technical boundaries including the storage administrators, network architects and support system administrators. Communication and staff availability levels between these divisions can easily cause implementation delays so make sure you factor this into your project schedule.
Additional Cluster Hardware Redundancy
To ensure the best possible server availability for customers, Voyager incorporated a great deal of redundant hardware into the server cluster design.
Network Adapter Teaming
To increase redundancy across cluster network interfaces, it was decided to increase the number of network adapter cards and utilize network adapter teaming. Using teaming, the cluster design was provided with two core technology advantages; adapter load balancing and adapter fault tolerance.
Adapter Load Balancing (ALB) increases any server's network transmission throughput. Adapters are grouped into teams and these cards combine to present an aggregated bandwidth capacity. It is possible to increase the number of adapters but note that all the cards must be linked to the same network switch or to the same network segment. Remember you also need enough space in your server to house multiple cards so good planning - before you order your servers - is required.
In practice, Voyager used four network interface cards in each cluster server. Teaming can only be used to configure the public network adapters. You must never do this for the private network adapters. Two gigabit cards were reserved for the cluster public network adapters and two 100 Megabit cards the private network adapters.
Voyager configured the public network adapters with Adapter Fault Tolerance (AFT). AFT is a fail-safe approach to increase the reliability of server connectivity. Basically it provides the ability to set up link recovery to the server adapter in case of a cable failure. This recovery also extends to port failures or even network interface card failures. It is possible to use up to eight cards in this teaming configuration but two cards were adequate to cater for network card redundancy and ensure server availability.
When AFT is configured, the two adapters form a single virtual adapter. An additional local area connection will appear in your network settings dialog box on your server. This new virtual connection is the only local area connection that requires any TCP/IP configuration, such as DNS, WINS etc. If you inspect the TCP/IP Protocol settings for each individual card in the team, you will discover they are disabled for manual configuration.
Always rename the new local area connection to something more meaningful as you are going to end up with multiple adapters by the time you have finished configuring the cluster and it really helps to have meaningful names assigned to them. This is especially useful when you install the cluster service and need to identify adapters in the cluster setup wizard. Also make sure you enable the network icon in the system tray. It is useful to see the activity on the cluster interfaces for troubleshooting purposes. The private network adapters were left unaltered. These form the private connection for the cluster and cannot be aggregated in any fashion.
Managing Public and Private Cluster Interfaces
The nodes in a Server Cluster are connected and communicate using one or more independent networks. These cluster interfaces are called private and public interfaces. You may also see them referred to as public or private adapter interfaces. Typically, each cluster node uses one private interface and one public interface.
The Cluster Service keeps track of the state of the service groups within a cluster and decides when a group and its resources should fail over to an alternate node. This communication takes the form of messages that are sent between the two nodes within the cluster. These messages are called heartbeats and occur regularly. When a cluster misses two heartbeats, it registers a failure of service and initiates a failover action for either an application or service.
For geographically dispersed clusters, the heartbeat roundtrip time must be guaranteed at less than 500ms. Voyager separated the cluster nodes geographically so there was a concern that this might introduce increased roundtrip delays for the cluster's heartbeat and sporadically initiate cluster failover procedures. I call this "heart palpitations." In practice, the gigabit network did provide sufficient speed to ensure heartbeat communications functioned correctly, but this may not be the case for many standard WAN architectures. Make sure you test and ensure your network can sustain the roundtrip communications requirements. In addition, do not be misled into believing that you can avoid these heartbeat issues by adjusting the heartbeat polling interval as this interval cannot be changed in Server Clusters.
If you are concerned about roundtrip latency issues with your network infrastructure, it can be easily checked. Microsoft has provided a tool to verify the latency of heartbeat messages between your cluster nodes. Basically the tool monitors the round-trip latency of UDP packets between nodes in your cluster. If it discovers that the latency between nodes is above the pre-defined value, it writes a warning message to a log file. For more information, including the latency test tool, see the following Microsoft website and download the Geographically Dispersed Cluster Update: http://www.microsoft.com/hwdq/hwtest/devices/systems.asp?area=SysSrv-Clstr.
The Voyager Topology
In the Voyager scenario, the cluster nodes communicate geographically through a meshed gigabit Ethernet topology. This environment was constructed using Gigabit Ethernet switches, routers and associated fibre links connecting remote locations. Recall that clustering is not designed for geographic dispersed networks, so the public and private cluster interfaces still need to believe they are on the same IP network segments. This was achieved by creating virtual network segments for the public and private interfaces using VLANs. The cluster public interface communicates with the general network using a virtual router configured within the VLAN. The virtual router simply acts as an IP gateway for the VLAN just as a standard router routes IP traffic in a standard TCP/IP subnet environment.
This general VLAN configuration is illustrated below:
Virtual LAN Configuration for Geo-Dispersed Cluster
Extending the Cluster with SAN Storage
Storage Area Networks (SANs) have become an affordable means of physically connecting the same storage to multiple computers concurrently. The SAN was originally developed to address the storage issues of centralised management, high availability, scalability and performance. In Voyager's case, the SAN environment is composed of numerous disks, servers and a common switched fibre channel storage fabric. Utilising fibre channel technology allows greater speed and flexibility and SAN data transfers happen in the gigabit range as opposed to the older SCSI interfaces which operate in the megabits range.
Many organizations use SAN architecture for providing managed storage across multiple computer systems. For Voyager, the SAN environment was already in use for mainframe storage. This configuration was altered to include additional disk storage for the Windows 2000 Server Cluster. Since the Windows 2000 operating system requires a dedicated disk to access, LUN masking was used to divide the SAN storage into isolated partitions. In this way, multiple operating systems could continue to access dedicated storage and remain unaware that they are sharing common storage architecture. You can also use zoning to perform a similar operation.
From the SAN perspective, four disk resources were allocated to act as cluster physical disk resources. The first disk was for allocated as the Quorum physical disk. The Quorum disk stores cluster configuration information including database checkpoints and log files. It needs to be a minimum of 50 Megabytes in size, but a size of 500 MB is definitely recommended. Additional physical disk resources were allocated from the SAN for cluster file sharing, print spooling and DHCP services. Keep in mind that the DHCP disk does not have to be too large, as this will only store the DHCP database. Look at the current DHCP database size to help you plan your disk size limit. The majority of Voyager's disk space went to presenting two physical disks as resources for the clustered file sharing and print spooling functions. Extra storage was reserved for future Exchange 2000 and SQL 2000 applications. Always keep in mind that you will need to determine the amount of disk space you are planning to host across the cluster and that it is wise to allocate spare capacity.
The Voyager configuration is shown below:
Finally, always assign meaningful names and designations to cluster resources, for example Q: as the Quorum disk, to allow for easier management and troubleshooting. As an example, the Voyager server cluster hosted a number of Virtual Servers supporting virtual network adapters and implemented across a number virtual LANs using virtual routing. Meaningful names became critical to constructing, troubleshooting and administering this environment, so take the time to plan this correctly!
Identifying LUNS in Windows 2000 or NT 4.0
Voyager's SAN is being used to host the storage requirements of several different systems, including mainframe storage requirements. This storage diversity demands specialized configuration of the disks (or logical devices) that are grouped and presented as basic disks to the Windows 2000 operating system.
Voyager found that Windows 2000 must locate a disk assigned as LUN 0 before it will scan and identify any disks with LUNs greater than 0. In order to correlate certain logical devices and LUN schemes, Voyager chose to assign a small disk or logical partition as LUN 0. This allowed Windows 2000 to always identify other non-zero LUNS.
For additional information go to the following web pages on the Microsoft web site (see Microsoft Knowledge Base article 162471):http://support.microsoft.com/default.aspx?scid=kb;en-us;162471&sd=tech.
Managing Host Bus Adapters
Voyager's server cluster computers communicate with the SAN using fibre channel Host Bus Adapters (HBA's). A HBA is an interface card that is installed in the cluster server for managing traffic between the server and the switched storage fabric. Two HBA cards were installed within each cluster server node to cater for single adapter failures.
To manage redundancy with the dual Host Bus Adapters, Voyager implemented third-party software to create redundant paths between the server and the storage area network. Multiple data paths became available through the dual host bus adapters and a switched storage fabric. In the event of a component failure, the redundancy software automatically switched the data (sometimes called I/O path switching) through an alternative data path to ensure server availability.
The server cluster software should never see multiple data paths if the multi-path drivers are working correctly. In addition, any RAID systems you purchase should ship with multi-path drivers that have Windows 2000 certification. This will ensure that they are fully supported in your environment.
The figure below shows the Voyager configuration:
Voyager Configuration for Delivering Multiple Data Paths (in blue)
Managing Physical Disk Resources within the Cluster
SAN disk storage presents some interesting issues with servers in the cluster. In specific, be careful of the effect of running Chkdsk functions on SAN-supplied disks. If a server file system determines that the disk is dirty, it will start a Chkdsk function across the disk. This may not be desirable as some of the disks are hundreds of gigabytes in size and this would take a long time to complete. However, it is not good operational procedure to disable Chkdsk. It runs for a reason and disabling the use of Chkdsk on a disk that requires it can lead to data integrity issues.
Windows 2000 includes enhanced disk resource private properties when you are using Server Clusters. These enhancements provide you with the ability to control when Chkdsk is run against a cluster disk. Remember, it is not a good idea to disable Chkdsk unless absolutly necessary.
You can obtain more information from the Microsoft web site at (see Microsoft Knowledge Base article 223023): http://support.microsoft.com/default.aspx?scid=kb;en-us;223023&sd=tech.
Dynamic Disk Support for Server Clusters
Windows 2000 supports dynamic disk capabilities for non-clustered disks. However the server cluster is a different matter. When you install the cluster service, only basic disks are available for use as physical disk resources. This limits you from dynamically altering the disk capacity should the business need arise. To allow the cluster service to view and use dynamic disks, Voyager installed Veritas Volume Manager (VVM). Veritas is a third-party product that extends the Windows 2000 Logical Disk Manager and provides increased disk management flexibility.
When implementing dynamic disks for your server cluster, it is good practice to first install and configure your server cluster resource groups using the basic physical disk resources. Once basic cluster operations are verified, identify and remove the basic disk resources that you wish to convert to dynamic disks. There is no need to delete any of the Network Name or IP Address resources from your resource groups; you can use them again when you introduce the new dynamic disks. You can now install your third party software to add the dynamic disk features to your server cluster. Once this is completed, you can recreate the new physical disk resources within your cluster and bring your resources back online.
Voyager chose to use Veritas Volume Manager for dynamic disk support in the server cluster. Dynamic disk conversion was a straight forward process using the Volume Manager Dynamic Upgrade Disk Wizard and the cluster easily recognized the dynamic disks as new physical disk resources. It was also a simple process to extend the disk sizes using the Extend Volume Wizard. If you choose to use Veritas Volume Manager in your own environment, keep in mind that Veritas should be your first point of call for cluster support issues.
Ensuring high availability and scalability of applications and network services has become a key deliverable in the provision of information technology services to organizations. This goal can be assisted with the introduction of Microsoft Server Clusters across either the Windows 2000 Advanced Server or Datacenter Server environments. Additional provisions for disaster recovery can be added using geographic dispersion, and redundancy can be enhanced by introducing additional hardware.
The key to a successful deployment using Microsoft Server Clusters is a detailed design. Regardless of the cluster's size, ensure that you dedicate sufficient planning to address the issues of equipment procurement, HCL compliance, introduced network complexities, provision for hardware redundancy and the testing of failure scenarios. The results are well worth the time and effort!
Appreciation is extended to Gerard O'Neill, Alan Watts, and Astrid McClean from Microsoft Consulting Services and to Chris Whitaker from the Microsoft Enterprise Server Product team. Their assistance was invaluable for the writing and technical review of this article and for the real world project on which the article is based.