Chapter 2 - Achieving High Reliability

Article
12/05/2007

No system is perfectly reliable, nor can any system design anticipate all possible modes of failure. For these reasons reliability is usually expressed as an availability percentage such as Five-9s, indicating 99.999 percent availability or an expected aggregate outage of roughly five minutes per year. Microsoft cooperates with hardware vendors to define vendor-specific best practice configurations that can be certified to deliver Five-9s level of service or better, if required.

The Windows Server System was designed to deliver high availability and reliability, and does so through the following means:

Protected application memory space. The Windows Server guarantees that no other application can violate an application's protected memory space. The .NET Framework also imposes constraints on rogue processes and includes mechanisms to proactively validate the correct behavior of processes. (See Chapter 6 for a discussion of the .NET Framework.)
Support for failover processing. The Windows Server supports server clusters of up to eight nodes for any server application, with failure or maintenance requirements triggering other nodes to immediately begin providing failover service. Windows Server 2003 also supports network load balancing (NLB) that balances incoming Internet Protocol (IP) traffic across nodes in a cluster.
Economically feasible redundancy. Although UNIX system vendors offer fully redundant processing configurations, they are often a high-cost item. The substantially lower cost of Windows Server Systems means that redundant configurations are often a viable and cost effective option.
Automated Deployment Services (ADS). This allows for the automated and remote distribution of software and patch upgrades, minimizing the amount of downtime and reducing the risk of human error in maintenance processes.
Policy-based management. This feature enables one-to-many management, allowing the management of very large distributed systems environments on a policy or profile basis, rather than by individual system or user.
Windows Management Instrumentation (WMI). WMI provides access to more than 10,000 system objects through application, scripting, and command line interfaces, allowing them to be finely monitored, controlled, and reported.
Troubleshooting features. These features include built-in performance monitoring, logging, tracing, and system recovery capabilities to enable quick troubleshooting and the resolution of abnormal operating conditions.
Microsoft Operations Manager (MOM). MOM provides knowledge packs for specific Windows Server components and other Microsoft server products that use information regarding product-specific and component-specific events to filter, determine the severity of, and help identify and develop solutions for these events.

A highly reliable system exhibits a variety of positive attributes as listed in Table 2.1.

Table 2.1: Reliability Attributes

Attribute	Description
Resilient	The capability of the computing environment to continue operating in a fully functional state, without degradation caused by hardware or software failure.
Recoverable	In instances where outages do occur, the capability of the environment to be restored to full capacity with minimal loss of data.
Controllable	The ability to provide timely and expected services as required.
Uninterruptible	The ability to implement required changes and maintenance procedures without disruption of services.
Production-ready	As delivered, the software environment contains a minimum number of bugs and requires a limited number of predictable patches or fixes.
Predictable	The behavior of the computing environment is consistent, that is, what worked previously will continue working, and all instances of a service will operate in the same way.

The key to a reliable design is to identify and address single points of failure, those places where the failure of a single component causes the entire system to fail. A production server is a complex system and many factors affect its reliability, including environment, communication links, software, and hardware. Each of these factors can potentially be the source of a single point of failure.

Redundancy is a means to address single points of failure. Hardware manufacturers provide systems with redundant power supplies and interfaces, software can create redundant storage systems, and operational procedures can include contingency plans.

Adding massive redundancy alone does not automatically eliminate all single points of failure. A Web hosting facility may contract with two independent Internet service providers, believing that this will prevent a communication failure. However, the two providers may both carry network traffic to the facility on the same cable. In this instance a single physical disruption will still result in an outage. This example demonstrates that a reliable design must take into account many possible failure modes and scenarios.

Regardless of the features and functions available, technology alone does not guarantee reliability. Professionalism in continuing operations and the management of an enterprise computing facility are key requirements for high reliability.

For example, when a redundant component fails, some process must recognize the failure and repair or replace it; and in the event the system is upgraded or new software is installed, operations personnel must take care to follow procedures that specify a safe method to conduct the change. When migration to Windows is considered, the disciplines and processes traditionally associated with UNIX or any production system must also be migrated to ensure success and reliability.

Reliability Overview

UNIX system manufacturers provide high-end servers that incorporate redundant hardware features such as power supplies, cooling, network, and other device interfaces, including provisions for handling processor failure in multiprocessor systems. These are generally proprietary mechanisms, and the software to support them is specific to each manufacturer.

Although Microsoft does not manufacture server hardware, the Windows Server System provides the means for hardware manufacturers to support their redundancy features in a common user interface.

For more information on Microsoft Datacenter Server certification for OEMs, refer to:
https://www.microsoft.com/windowsserver2003/datacenter/dcprogram.mspx

Database Journaling for Storage Redundancy

Database servers are one major category of server systems, and database software has provided storage redundancy through transaction journaling for many years. This redundancy defends against simple medium failure or can even be the basis of a remote site disaster recovery plan.

A database system, such as SQL Server or Oracle, commits database updates to a transaction journal and to the main database storage location. The transaction journal and the main database storage occupy physically distinct media. Combined with periodic database backups, this protects against any single storage medium failure. Transaction journals also provide some protection against corruption caused by application software failure by storing a history of changes within the database. Corruption introduced by application errors can often be corrected by rolling transactions back to a point in time before the error.

Transaction journals and backups conveyed to a remote site, together with spare hardware, provide a defense against a complete site failure such as those caused by a natural or man-made disaster.

These basic storage redundancy methods do not provide high availability because the associated failure-recovery techniques are sometimes time consuming. Recovery from a medium failure involves restoring a backup and reapplying transaction journals to bring the restored data up to date.

Database Replication

Modern database server systems, including Microsoft SQL Server and Oracle, provide features supporting high availability through database replication. This is referred to as a warm standby approach, where a copy of the database occupies a physically remote server and transaction journals are transmitted periodically from the active server. It may also be a fully live system, where each transaction is applied to multiple copies of the database in real time.

For more information on database replication on Microsoft SQL Server, refer to:
https://msdn.microsoft.com/library/default.asp?url=/library/en-us/replsql/replover_694n.asp

Redundant Arrays of Inexpensive Disks (RAID)

UNIX vendors provide both hardware and software implementations of redundant arrays of inexpensive disks (RAID). For example, Sun Microsystems has licensed the Veritas Volume Manager and ships it bundled with its Solaris UNIX system. Enterprise disk arrays are available from a variety of vendors including Sun, IBM, and EMC. These products provide various redundancy and performance options including mirroring and distributed parity.

For customers who require specific features offered by the Veritas, IBM, or EMC products, these vendors provide Windows Server-based versions. However, Windows Server includes an integrated software implementation of mirroring (RAID 1) and distributed parity (RAID 5), which may be more cost-effective. These features provide protection against single medium failures, but incur some processor overhead because of the additional input/output operations required.

Windows Server also supports disk array products from a variety of other vendors including system manufacturers such as Dell and Compaq. The Windows Server disk management tool provides a common interface to control both software RAID and integrated hardware disk arrays.

For more information on RAID levels, refer to:
https://msdn.microsoft.com/library/default.asp?url=/library/en-us/optimsql/odp_tun_1_87jm.asp

Clusters

Protection against hardware failures, whether in data storage, communication, or nodes, is not by itself adequate to provide high availability. According to IBM’s High Availability Cluster Multi-Processing for AIX Concepts and Facilities Guide, hardware failures account for only a small percentage of unplanned outages. Other contributing factors include operator errors, environmental problems, and application and system software errors.

A cluster is a collection of systems that appear to users as a single computing resource. Any member of the cluster can service a client request without the client knowing which system performed the operation. Depending on application architecture, clusters can be configured so that an application fails over from one cluster member to another, or to ensure that the load is shared between operating cluster members.

Large-scale UNIX system vendors offer clustering as a means to address all of the potential failures server systems are likely to encounter. Windows Server 2003 provides clustering in both load balancing and failover configurations. Clustering includes redundant hardware, automated monitoring, system error recovery, and well-planned processes for maintaining both hardware and software while keeping the application available. Vendors such as EMC offer remote storage replication products that integrate with clustered systems, providing business continuity for large-scale systems.

For more information on clustering in Windows Server 2003, refer to:
https://www.microsoft.com/technet/prodtechnol/windowsserver2003/technologies/clustering/default.mspx

For more information on IBM's High Availability Cluster Multi-Processing for AIX Concepts and Facilities Guide, refer to:
https://publib.boulder.ibm.com/epubs/pdf/c2348643.pdf

Windows Server System Reliability

The Windows Server System is designed to deliver high reliability and to support the needs of the IT professionals who manage delivery of the specified reliability service level. The core of the Windows Server System is Windows Server 2003. Each edition of Windows Server 2003 supports the indicated level of load balancing and clustering features:

Windows Server 2003, Standard Edition: NLB
Windows Server 2003, Enterprise Edition: NLB, 8-node Server Clustering
Windows Server 2003, Datacenter Edition: NLB, 8-node Server Clustering
Windows Server 2003, Web Edition: NLB

Windows Server 2003 Reliability

The Windows Server System has the following built-in reliability-related capabilities:

Network Load Balancing. This provides failover support for IP-based applications and services that require high scalability and availability.
Component Load Balancing. This provides dynamic load balancing of middle-tier application components. Component Load Balancing components balance loads among multiple nodes to dramatically enhance the availability and scalability of software applications.
Server Clustering. This provides failover support for applications and services that require high availability, scalability, and reliability. With clustering, organizations can make applications and data available on up to eight servers linked together in a cluster configuration. Back-end applications and services, such as those provided by database servers, are ideal candidates for server clustering.

Typically, all three of these capabilities are used in combination. For example, an e-commerce site may have front-end Web servers using NLB, middle-tier application servers using Component Load Balancing, and back-end database servers using Server Clustering.

The Windows Server System allows full remote creation and configuration of server clusters, and nodes can be added to an existing server cluster from a remote management station. Resulting drive letter changes and physical disk resource failover are updated to Terminal Server client's sessions.

In the Windows Server System market, many different hardware vendors offer a variety of high-reliability features and options. This allows the customization of the level of reliability support purchased to range from moderate to ultra-high, on an application-by-application basis.

The Windows Datacenter High Availability Program complements the Windows Server 2003, Datacenter Edition operating system. The Datacenter High Availability Program has a unique support and services model built for customers with mission-critical server requirements and consists of the following components:

Windows Server 2003, Datacenter Edition or Windows 2000 Datacenter Server
Qualified Datacenter Configurations
Certified Datacenter Support Providers
High Availability Support Program

Critical peripherals must also be fault-tolerant. The capability of the Windows Server to support the quality and features of peripherals is no different than in the UNIX environments However, if extra high availability equipment or facilities are required, the ability to justify them is improved when system purchase costs are significantly lower, as they are with the Windows Server System.

For more information on Windows Server 2003, refer to:
https://www.microsoft.com/windowsserver2003/

Storage Reliability

Windows Storage Server is a dedicated file server software that provides reliable and highly available Network Attached Storage (NAS) capabilities. The Windows Storage Server simplifies storing and managing data, making it available to local and remote users as necessary. It protects important files from loss, disaster, or unauthorized access. Windows Storage Server includes advanced availability features such as point-in-time data copies, replication, and server clustering.

Windows Storage Server also offers support for Storage Area Networks (SAN) and a hybrid solution where NAS devices can be attached to a SAN. In this configuration, a NAS head or gateway (containing the filing functionality) attaches to the Local Area Network (LAN), and behind that lies a backend SAN consisting of the Fiber Channel Networking and storage disks.

For more information on Windows Storage Server, refer to:
https://www.microsoft.com/windowsserversystem/wss2003/default.mspx

Database Reliability

For years, Oracle has been the standard relational database management system (RDBMS) for UNIX systems. The Windows Server System equivalent of Oracle is SQL Server 2000, an RDBMS that is capable of supporting some of the largest, most critical, and highest performing applications in the world.

The SQL Server High Availability Series provides prescriptive guidance for IT professionals in overcoming many kinds of barriers to database reliability, including environmental, hardware, communication and connectivity, software, service, process, application design, and staffing barriers.

In addition to very high performance relational transaction services, SQL Server also offers a suite of data analysis tools necessary to interpret data and optimize database performance. These include integrated OLAP and Relational Online Analytical Processing (ROLAP) based data warehousing, business intelligence, data mining, and visualization capabilities.

For more information on Microsoft SQL Server and the High Availability Series, refer to:
https://www.microsoft.com/sql/

Application Reliability

Modern applications are typically delivered through networks, either to the public through the Internet or to internal users through a corporate intranet. Microsoft Internet Information Services (IIS), with its associated .NET development environments in Microsoft Visual Studio^®, is the new mainstream delivery platform for enterprise applications.

IIS is a full-featured Web server that provides a reliable and secure method of simplifying the manner in which Web sites are developed and delivered. It includes support for Active Server Pages (ASP), the .NET Framework and Extensible Markup Language (XML) Web Services, and Internet protocol version 6.

IIS offers a dedicated application mode, which runs all application code in an isolated environment. IIS also supports Web gardens, in which a set of equivalent processes on a computer each receives a share of the requests normally served by a single process, achieving better multi-processor scalability.

For more information on Internet Information Services, refer to:
https://www.microsoft.com/windowsserver2003/iis/

Application Center is the Microsoft deployment and management tool for high availability clustered Web applications. Application Center creates Web sites that are scalable, robust, easy to administer, and secure. It reduces application management complexity, replicates any changes made to a server by updating other servers in the cluster, automates the deployment of applications from server to another, and allows applications to achieve on-demand scalability.

For more information on Application Center, refer to:
https://www.microsoft.com/applicationcenter/

Sources for Detailed Guidance

For information on reliability, refer to:
https://www.microsoft.com/mscorp/twc/reliability/default.mspx

For information on Windows Server 2003 clustering, refer to:
https://www.microsoft.com/windowsserver2003/technologies/clustering

Download

Get the Introduction to the Microsoft Enterprise Platform for UNIX Professionals

Update Notifications

Feedback

Send us your comments or suggestions