Chapter 2: Achieving High Reliability

Article
12/05/2007

Microsoft currently works with nearly all major hardware vendors to define vendor-specific best practice configurations that can be certified to deliver 99.999 percent reliability (also known as Five-9s level of service) or better, if required.

Regardless of the features and functions available, technology alone does not guarantee reliability. The factor that contributes the most to reliability is the culture of professionalism ingrained in the operation and management of an enterprise computing facility. When migration to Windows is considered, the disciplines and processes traditionally associated with the mainframe must also be migrated to ensure success and reliability.

The Windows Server System was designed to deliver high availability and reliability, and does so through the following means:

Protected application memory space. The Windows Server guarantees that no other application can violate an application's protected memory space. The .NET Framework also imposes constraints on rogue processes and includes mechanisms to proactively validate the correct behavior of processes.
Support for failover processing. The Windows Server supports server clusters of up to eight nodes for any server application, with failure or maintenance requirements triggering other nodes to immediately begin providing failover service. Windows Server 2003 also supports network load balancing (NLB) that balances incoming Internet Protocol (IP) traffic across nodes in a cluster.
Economically feasible redundancy. Although fully redundant processing configurations are often not economically feasible in the mainframe environment, the substantially lower cost of Windows Server Systems means that redundant configurations are often a viable option.
Automated Deployment Services (ADS). This allows for the automated and remote distribution of software and patch upgrades, minimizing the amount of individual downtime and reducing the risk of human error in maintenance processes.
Policy-based management. This feature enables one-to-many management, allowing the management of very large distributed systems environments on a policy or profile basis, instead of by individual system or user.
Windows Management Instrumentation (WMI). WMI provides access to more than 10,000 system objects through application, scripting, and command line interfaces, allowing them to be finely monitored, controlled, and reported.
Troubleshooting features. These features include built-in performance monitoring, logging, tracing, and system recovery capabilities to enable quick troubleshooting and the resolution of abnormal operating conditions.
Microsoft Operations Manager (MOM). MOM provides knowledge packs for specific Windows Server components and for other Microsoft server products that use knowledge about product-specific and component-specific events to filter, determine the severity of, and help identify causes, and develop solutions for these events.

A highly reliable system exhibits a variety of positive attributes; these are listed in Table 2.1.

Table 2.1: Reliability Attributes

Attribute	Description
Resilient	The capability of the computing environment to continue operating in a fully functional state, without degradation caused by hardware or software failure.
Recoverable	In instances where outages do occur, the capability of the environment to be restored to full capacity with minimal loss of data.
Controllable	The capability to provide timely and expected services as required.
Uninterruptible	The capability to implement required changes and maintenance procedures without disruption of services.
Production-ready	As delivered, the software environment contains a minimum number of bugs and requires a limited number of predictable patches or fixes.
Predictable	The behavior of the computing environment is consistent, that is, what worked previously will continue working.

This chapter mainly focuses on the "resilience" factor, the attribute most commonly associated with reliability of the mainframe and Windows Server System environments.

Mainframe Reliability

Mainframes have a reputation for being highly reliable. They represent mature technology that is stable and not prone to failure, allowing most computing centers to operate mainframes without active failover mechanisms enabled for noncritical applications that can tolerate modest amounts of downtime.

For applications less tolerant of downtime, multiple processing units can usually fulfill the need for failover. Should a processor unit require maintenance or suffer hardware or software failure, the load carried by that unit is absorbed by other units. This is accomplished by transferring the processing load of less critical applications to another machine or logical partition.

Generally, mainframe operating systems have built-in support for processor-related failover, while DASD-related failures are usually mitigated by software in the control units that manage such drives, and by the high parallelism inherent in disk farms and

Redundant Array of Inexpensive Disks (RAID) technology. Applications can thus continue to function while a particular drive is unavailable because of maintenance or because of a failure.

Failures that cause the loss of a complete processor complex, a disk farm, or the physical data center cannot be addressed as easily. In this case, prudent businesses have a contingency plan to continue application processing at another site.

This site may be an additional processing center that is normally used by the business, or may be a shared facility whose use is secured by contract for contingencies only. Extensive planning by IT professionals is required to ensure that the critical workload can be moved to the backup facility when required. Typically this transfer is not immediate, and involves substantial manual processes.

At the highest levels of application criticality, workloads must be capable of being transferred to a backup facility almost instantaneously. To ensure that such a transfer can be made, it is typical practice to alternate production processing between facilities on a regular basis, to exercise failover capabilities.

Part of maintaining high reliability is the continuous collection of information on reliability business impacts/costs/threats, application reliability requirements, reliability experience, and failure analysis. This information is typically used to develop some form of a reliability plan that provides a cost-effective solution to the real reliability requirements of the business.

The key elements of achieving mainframe reliability are:

Multiple processing units with OS-based load-shifting to reserve reliable processing power for highest priority applications.
Critical peripherals arrayed into assemblies that are fault-tolerant through high parallelism.
Business and infrastructure arrangements that allow the shifting of workload to another facility for low to medium criticality applications.
Multiple fully functional processing facilities with automated failure detection and automated instantaneous transfer of workload for high criticality applications.
Continuous data gathering and regular planning to ensure that the computing infrastructure deployed, and the reliability measures taken, meet the current needs of the business and reduce or eliminate single points of failure.

Windows Server System Reliability

The Windows Server System is designed to deliver high reliability, and to support the needs of the IT professionals who manage delivery of the specified reliability service level. The core of the Windows Server System is Windows Server 2003. Each edition of Windows Server 2003 supports the indicated level of load balancing and clustering features:

Windows Server 2003, Standard Edition: NLB
Windows Server 2003, Enterprise Edition: NLB, 8-node Server Clustering
Windows Server 2003, Datacenter Edition: NLB, 8-node Server Clustering
Windows Server 2003, Web Edition: NLB

Server Reliability

The Windows Server System has the following built-in reliability-related capabilities:

Network Load Balancing. This provides failover support for IP-based applications and services that require high scalability and availability.
Component Load Balancing. This provides dynamic load balancing of middle-tier application components. Component Load Balancing components balance loads over multiple nodes to dramatically enhance the availability and scalability of software applications.
Server Clustering. This provides failover support for applications and services that require high availability, scalability, and reliability. With clustering, organizations can make applications and data available on up to eight servers linked together in a cluster configuration. Back-end applications and services, such as those provided by database servers, are ideal candidates for server clustering.

Typically, all three of these capabilities are used in combination. For example, an e-commerce site may have front-end Web servers using NLB, middle-tier application servers using Component Load Balancing, and back-end database servers using Server Clustering.

The Windows Server System allows full remote creation and configuration of server clusters, and nodes can be added to an existing server cluster from a remote management station. Resulting drive letter changes and physical disk resource failover are updated to Terminal Server client's sessions.

In the Windows Server System market, many different hardware vendors offer a variety of high-reliability features and options. This allows the customization of the level of reliability support purchased to range from moderate to ultra-high, on an application-by-application basis.

The Windows Datacenter High Availability Program complements the Windows Server 2003, Datacenter Edition operating system. The Datacenter High Availability Program has a unique support and services model built for customers with mission-critical server requirements and consists of the following components:

Windows Server 2003, Datacenter Edition or Windows 2000 Datacenter Server
Qualified Datacenter Configurations
Certified Datacenter Support Providers
High Availability Support Program

Critical peripherals must also be fault-tolerant. In the enterprise-level systems of major hardware vendors, there is no difference between the quality and features of peripherals in the Windows Server and mainframe environments. If extra high availability equipment or facilities are required, the ability to justify them is improved when system purchase costs are significantly lower, as they are with the Windows Server System.

For more information on Windows Server 2003, refer to:

https://www.microsoft.com/windowsserver2003/

Storage Reliability

Server farms are a result of the separation of DASD from the control of the central processor in the mainframe environment, making storage a network resource instead of a processor resource. Windows Storage Server provides a parallel, but more versatile, solution for the Windows Server System.

Windows Storage Server is a dedicated file server software that provides reliable and highly available Network Attached Storage (NAS) capabilities. The Windows Storage Server simplifies storing and managing data, making it available to local and remote users as necessary. It protects important files from loss, disaster, or unauthorized access. Windows Storage Server includes advanced availability features such as point-in-time data copies, replication, and server clustering.

Windows Storage Server also offers support for Storage Area Networks (SAN) and a hybrid solution where NAS devices can be attached to a SAN. In this configuration, a NAS head or gateway (containing the filing functionality) attaches to the Local Area Network (LAN) network, and behind that lies a backend SAN consisting of the Fibre Channel Networking and storage disks.

For more information on Windows Storage Server, refer to:

https://www.microsoft.com/windowsserversystem/wss2003/default.mspx

Database Reliability

For years, DB2 has been the standard relational database management system (RDBMS) for mainframes. The Windows Server System equivalent of DB2 is SQL Server 2000, an RDBMS that is capable of supporting some of the largest, most critical, and highest performing applications in the world.

The SQL Server High Availability Series provides prescriptive guidance for IT professionals in overcoming many kinds of barriers to database reliability, including environmental, hardware, communication and connectivity, software, service, process, application design, and staffing barriers.

In addition to very high performance relational transaction services, SQL Server also offers a suite of data analysis tools necessary to interpret data and optimize database performance. These include integrated Online Analytical Processing (OLAP) and Relational Online Analytical Processing (ROLAP) based data warehousing, business intelligence, data mining, and visualization capabilities.

For more information on Microsoft SQL Server and the High Availability Series, refer to:

https://www.microsoft.com/sql/

Application Reliability

Modern applications are typically delivered to the public through the Internet, or to internal users through a corporate intranet. Microsoft Internet Information Services (IIS), with its associated .NET development environments in Microsoft Visual Studio^®, is the new mainstream delivery platform for enterprise applications, replacing standard green screen terminals and teleprocessing monitor software such as CICS/TSO and IMS/DC.

IIS is a full-featured Web server that provides a reliable and more secure method of simplifying the manner in which Web sites are developed and delivered. It includes support for Active Server Pages (ASP), the .NET Framework and Extensible Markup Language (XML) Web Services, and Internet protocol Version 6.

IIS offers a dedicated application mode, which runs all application code in an isolated environment. IIS also supports Web gardens, in which a set of equivalent processes on a computer each receives a share of the requests normally served by a single process, achieving better multiprocessor scalability.

For more information on Internet Information Server, refer to:

https://www.microsoft.com/windowsserver2003/iis/

Application Center is the Microsoft deployment and management tool for high availability clustered Web applications. Application Center creates Web sites that are scalable, robust, easy to administer, and more secure.

Application Center reduces application management complexity by allowing administrators to quickly construct logical groupings including the contents, components, and configuration of applications. It also replicates any changes made to a server by updating other servers in the cluster, automates the deployment of applications from server to another, and allows applications to achieve on-demand scalability.

For more information on Application Center, refer to:

https://www.microsoft.com/applicationcenter/