Chapter 6. Cluster Service

Summary of Cluster Service Requirements

Note   This chapter is required for Certification on Windows 2000 Advanced Server and Windows 2000 Datacenter Server (see Endnote 5). Applications that do not meet these requirements are eligible for the Windows 2000 Server certification only.

Rationale

A server cluster is a group of independent servers managed as a single system for higher availability. Cluster Service is a set of system services in Windows 2000 Advanced Server and Windows 2000 Datacenter Server that enables you to form server clusters by connecting multiple servers together, making them appear to network clients as a single, highly available system.

Cluster Service can automatically detect the failure of an application or server, and restart the application, either on the same server if it is still alive, or on another surviving server.

These requirements help ensure that your application will run properly with Cluster Service enabled, so that:

  • Your server application can failover to other servers

  • The client-side of your application properly handles failure of the server application

Customer benefits

Customers that run your application in a clustered environment can achieve higher availability, because your application can continue to provide service during both planned downtime (such as hardware and software upgrades) and unplanned outages (such as hardware or software failure).

When one of the systems-or nodes-in the cluster fails or becomes unavailable, Cluster Service transfers its workload to another system in the cluster. Users only experience a momentary pause in service. Cluster Service can also be configured to provide failback, so that when the failed server comes back online, the workload is rebalanced across the server cluster.

Requirements

  1. Applications must be able to install on at least two nodes for certification on Windows 2000 Advanced Server and two, three, and four nodes on Windows 2000 Datacenter Server.

  2. Application must support failover to all cluster members

  3. Clients must survive failure of the server application without crashing or affecting the stability of the system

References

Cluster Service Architecture white paper
www.microsoft.com/windows2000/library/howitworks/cluster/clusterarch.asp

Clustering in Windows 2000 Advanced Server:
www.microsoft.com/WINDOWS2000/library/howitworks/cluster/asoverview.asp

White papers on Windows 2000 Clustering:
www.microsoft.com/ntserver/ProductInfo/Enterprise/default.asp

Information on writing resource DLLs:
https://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnmscs/html/msdn\_mscs\_resource\_dlls.asp

General Background:
In Search of Clusters by Gregory F. Pfister, ISBN: 0-13-437625-0. An excellent introduction to clustering technology, including a description of the common programming models.

How to Comply with Cluster Service Requirements

1. Applications Must Be Able To Install on Two Nodes for Certification on Windows 2000 Advanced Server and on Two, Three, and Four Nodes on Windows 2000 Datacenter Server

In order to qualify for Certification on Windows 2000 Advanced Server, applications must install on two nodes.

In order to qualify for Certification on Windows 2000 Datacenter Server, applications must install on two nodes, three nodes, and four nodes.

Note   Your application setup should not make assumptions about the number of nodes in the cluster. It should enumerate all nodes in the cluster and allow installing your application on any node, even if the disks where your application data is stored are not physically located on that node.

2. Applications Must Support Failover to All Cluster Nodes

When a node in the cluster fails, Cluster Service will move the resource group from that node to a new node. A resource group is a collection of resources that provide services to clients and can depend on each other.

Resources that represent the primary functionality of your application must be able to start up (come online) on any other node in the cluster. After a failover is complete, clients should be able to access all data exposed by primary functions.

Note   Cluster Service operates under a "shared nothing" architecture in which each server owns its own disk resources. In the event of a server failure, ownership of the clustered disk is transferred from one server to another. For applications to properly support failover, the application's data must be stored on the clustered disk.

3. Clients Must Survive Failure of the Server Application Without Crashing or Affecting the Stability of the System

A client that ships with your server application must gracefully handle both cluster node failures and application failures. Cluster and application failures may cause clients to temporarily lose their connection to the server application (see below). Your client must survive both the failure of the server application and node failure as follows:

  • When a connection to the server application is lost, your client application must not crash or compromise the stability of the client operating system.

  • Once the failover is complete and the application is restarted on a cluster node, your client must reconnect to the cluster using either of the following mechanisms:

    1. Reestablish the lost connection without user intervention and with no loss of data,

      - or -

    2. Offer the user a chance to reconnect and retry the operation that failed-for example, prompt the user to refresh the data in the client.

  • If the server application is not able to restart, the client must inform the user that the connection could not be re-established.

Connections to the server application can be lost for any of the following reasons:

  • Application fails and is then restarted on the same node

  • Application fails and is restarted on a new node

  • Node fails, and all resources failover to a new node

  • Administrator moves the resource group containing the application to a new node

  • The administrator shuts down the server application

  • All nodes in the cluster fail

  • The client's network connection to the cluster is interrupted, even though the cluster and the server application are still running

These failures may be exposed to the client application as application timeouts, invalid handles, network failures and connection timeouts.

Development Guidelines

The guidelines in this section are not requirements that will be tested individually for Certification. However, following these guidelines will help you meet the requirements described above.

1. Use TCP/IP Protocol

Services that communicate with clients (as well as their clients) must use TCP/IP in order to be able to take advantage of IP address failover provided by Cluster Service. Servers that do not communicate with clients need not use TCP/IP.

2. Application Should Use a Virtual Server Name and IP Address To Connect to the Node Hosting the Server Application

Clients communicating with the cluster resources must use the virtual server IP address or virtual server network name to support failover.

If the server application publishes a network name or IP address to clients, it must publish IP virtual server IP address or network name. A server application that depends on a computer name or an IP address should use a network name and/or an IP address of a virtual server that is used by clients to access this application. Your server application should not fail to restart on another node because the computer name on this node is different.

 The following code sample illustrates how to set the server application environment as part of your resource dll online routine.

  //
// Create the new environment with the simulated net name when the
// services queries GetComputerName.
//
if ( ! ClusWorkerCheckTerminate( pWorker ) )
{
   nStatus = ResUtilSetResourceServiceEnvironment(
      YOUR_SERVICE_NAME,
      pResourceEntry->hResource,
      g_pfnLogEvent,
      pResourceEntry->hResourceHandle
      );
   if ( nStatus != ERROR_SUCCESS )
   {
      break;
   } // if: error setting the environment for the service
} 

About IP address failover

Client applications use a virtual server IP address to access services running on a Windows 2000 Server Cluster. A Virtual Server is a cluster resource group containing an IP address and a network name. A virtual server can be brought online on any node in the cluster, however, it appears to clients accessing it as the same physical machine.

The IP address of the virtual server has to be configured as a cluster resource in the same resource group where the server application was created. In case of a node failure, all resource groups running on this node are moved to another node in the cluster. The IP address of the virtual server is now available on another node and all connections with the clients can be reestablished.

3. Upon Failure, Clients Must Preserve User Data

The client application must be able to reconnect and resume an operation in the event of cluster node failure or application. It must either offer the user a chance to retry the connection or it must retry the connection automatically until it succeeds or can determine that the server application could not be brought online.

In case of a node failure, all resource groups running on the failed node are moved to another node in the cluster. Cluster Service requires some time to bring all resources online and restart services on the other node. The time needed to failover a server application depends on many factors. The most significant is the time required to restart the application.

4. Location of Application Data Must Be Configurable

Cluster service can failover only disks managed by the cluster that are on the storage bus shared among all nodes in the cluster. Your application setup should allow selecting the drive and installing application data on any drive. Cluster-aware setup should allow installing the application data only on a shared drive managed by the cluster.

5. Checkpoint Either Automatically or Manually State Information Required for Clean Restart

If a server application maintains any state information required for a clean restart, it should checkpoint this state information frequently to a shared disk managed by a cluster. It should use this data to recover quickly after a failure.

6. Upon Failure, Application Can Be Restarted and, If Applicable, Recover to the Last Checkpoint

The server application must recover from a node failure. A sudden node failure, for example a power blackout, should not leave your application in a state where it cannot restart.

After a node failure, Cluster Service moves the server application running on the node to another node along with other resources it may depend on. The server application must restart, recover, and resume operation in time you specified in your product literature.

7. At Least One Instance of the Application Can Run as a Cluster Resource

Cluster Service manages applications as cluster resources. A cluster resource is a physical or logical entity that can be owned by a node, brought online and taken offline, moved between nodes, and managed as a server cluster object. A resource can only be owned by a single node at any point in time. A resource is associated with, and managed by, a resource type.

If the resource supports it, Cluster Service can manage multiple instances of the same resource, but it is acceptable to support only one instance.

To take advantage of clustering, your application has to be configured as a cluster resource. You should be able to create at least one instance of your application. Your application must function properly as a cluster resource. Cluster service must be used to start (bring online) and stop (take offline) your application.

8. Can Be Configured at Least as a Generic Service or Application

The monitoring and failover capabilities of Cluster Service can be extended to support any application. Cluster Service uses resource DLLs to extend its failover support to other resource types.

Applications that do not offer an application-specific resource DLL can still take advantage of clustering by using a generic application or generic service resource type. These resource types offer failover protection against most failures, notably, node failure. However, they cannot detect your application failures. If your application crashes or hangs, Cluster Service won't be able to detect this failure and either restart or failover your application.

It is acceptable to use generic application or generic service type to manage your application as a cluster resource.

How to Pretest Applications for Cluster Service Requirements

How to Pretest That Your Application Is Cluster-Ready

If your application's setup is cluster-aware, use setup to configure all nodes.

If your application's setup is not cluster-aware, install your server application on at least two nodes in the cluster. Use the Cluster Administrator console to create a virtual server and configure your server application as a generic service or application. Use Cluster Administrator console to move your resource to either node in the cluster. If your application is cluster-ready, it should come online on any node in the cluster. Clients should be able to access the service provided by your application, no matter which node hosts it.

For certification on Datacenter, repeat this procedure for three-node and four-node configurations.

How To Pretest That Your Application Supports Failover

  1. Once you have installed the application on all nodes in the cluster, run functionality tests to verify application is fully functional and stable.

  2. Cause the node running your application to fail so that fail-over of the application is triggered. Below are suggested techniques to trigger failure:

    • HW failure-simulate by doing hard reset

    • OS failure-simulate by emitting a Ctrl+C command followed by a ".reboot" command from a remote kernel debugger.

    • Application failure-simulate using the End Process feature in Task Manager or Process Viewer (Pview.exe in the Windows SDK).

Note   Normal shutdown of the machine is not a valid test for failover, because the application will have the opportunity to gracefully shut down.

  1. Verify that application restarts on a new node in the cluster.

  2. Run functionality tests to verify all functionality is again available on the new node. The application must have access to all data that it previously had access to.

  3. For testing on Datacenter, repeat Steps 2-4 to verify that the application subsequently fails over to each of the remaining nodes.

How To Pretest That Clients You Provide Survive Failure and Subsequent Restart of the Server Application

Cause the server application to fail using each of the following scenarios:

  1. Shut down the server application using the normal shutdown sequence and leave all nodes in the cluster running.

  2. Terminate the application process (do not use the normal shutdown sequence), but leave the node running

  3. Kill the node. Note   Do not use normal shutdown. The node and application must not have time to exit gracefully. Below are suggested techniques for various failure modes:

    • HW failure-simulate by doing hard reset

    • OS failure-simulate by emitting a Ctrl+C command followed by a ".reboot" command from a remote kernel debugger

    • Application failure-simulate using the End Process feature in Task Manager or Process Viewer (Pview.exe in the Windows SDK).

For each case:

  • Verify that the client does not crash or lose stability when the server application fails.

  • Once the server application is restarted, either on the same node or a different node, verify that the client either:

    • Re-establishes the connection with no user intervention and no loss of user data

      - or -

    • Prompts the user to retry the connection and that the application then establishes the connection.

    Note   If you need to manually configure the client to access the server application, configure it to use the virtual server, not the node name.

How To Pretest That Clients You Provide Survive Failure Without Subsequent Restart of the Server Application

  1. Cause the application to fail in a way that does not allow it to restart on the cluster. To do this, you can take the resource offline, or kill both nodes.

  2. Verify that the clients do not crash or lose stability.

  3. Verify that the clients notify the user in a reasonable time that the connection to the server application was lost.

  4. Verify that the client can be closed without crashing or affecting the stability of the client's workstation, and that the user can preserve data, if appropriate.