Step 3: Determine Role Placement and Fault Tolerance

Article
02/25/2008

Published: November 12, 2007 | Updated: February 25, 2008

If the SoftGrid instance is supporting critical applications, it can be deployed using several methods to increase fault tolerance. There are different strategies for different server roles in the SoftGrid environment. SQL Server, for example, is made highly available by deploying it in an active-passive MSCS cluster. The VAS’s high availability is based on creating a load-balanced array of VASs. Active Directory has built-in high availability, and the Management Web Service can also be installed on two or more separate servers to provide redundant management points.

A number of services are important to the functionality of the SoftGrid infrastructure. In order to increase the reliability and availability of these services, additional technology can be used to increase each component system’s fault tolerance. Although there are a number of fault tolerance strategies and technologies available, not all are applicable to a given service. Additionally, if SoftGrid roles are combined, certain fault tolerance options may no longer apply due to incompatibilities.

SoftGrid does not currently support the full range of fault-tolerant solutions in the market. For example, there are several additional methods for making Microsoft SQL server fault tolerant beyond what is given below; however, these additional methods are not applicable to the SoftGrid system. Finally, component level fault tolerance, such as RAID systems, is not discussed at a SoftGrid role level. Below is a list of the major options for fault tolerance in SoftGrid:

Built-In. The service is designed to provide fault tolerance through a built-in mechanism. Active Directory is an example of such a service. Additional reliability, availability, and scalability are provided by adding additional domain controllers to the environment.
Fault-Tolerant File Replication. Windows-based servers provide two methods for keeping multiple file shares synchronized. Both methods use Distributed File System (DFS) replica sets to define which file shares should be kept in sync. In Windows Server 2003 and earlier, File Replication Service is used to perform the replication. FRS uses a last writer wins algorithm. If two versions of the file are changed on different servers or if two or more files with the same name are added to the replica tree of different servers, the last file written will be kept. The additional changes will be lost. Additionally, FRS cannot enforce file sharing restrictions or file locking between two users who are working on the same file on two different replica set members. Distributed File System Replication replaces FRS on Windows Server 2003 R2 as the preferred replication method. DFS-R provides a multi-master replication service for files between multiple file shares. In addition, it provides an improved management tool and higher performance over FRS. There are additional replication tools, such as Robocopy, which can provide this functionality as well and which do not require DFS to be used.
Network Load Balancing (NLB). NLB provides failover support for IP-based applications and services that require high scalability and availability. Windows Server provides a software-based NLB within the OS. Hardware-based load balancing solutions can be used in place of an NLB cluster. Hardware-based load balancing devices can provide greater scalability and reliability due to more complex load balancing algorithms and specialized hardware. However, at least two hardware load balancers need to be deployed for fault tolerance. If only one is used, then it becomes the single point of failure for the cluster. Web-tier and front-end services are ideal candidates for NLB.
Server Clustering. A server cluster provides failover support for applications and services that require high availability, scalability, and reliability. Server clusters are made up of two or more servers that can assume the load of the other servers in the cluster in the event of a failure. Server clusters based on Microsoft server cluster technology will require additional resources for the shared storage system and private network configuration for the heartbeat.

Ignoring scaling and fault tolerance requirements, the minimum number of servers needed for a location with connectivity to Active Directory is one. This server will host the Content Storage System, Management Web Service, Microsoft SQL Server, and the Virtual Application Server roles. Server roles, therefore, can be arranged in any desired combination since they do not conflict with one another.

Ignoring scaling requirements, the minimum number of servers necessary to provide a fault-tolerant implementation is four when Active Directory is already present in the environment with multiple domain controllers. The Content Storage System, SQL Server, and Virtual Application Server are all capable of being placed in fault-tolerant configurations. The Management Web Service can be combined with any of the roles, but remains a single point of failure.

Table 3. Compatible Fault Tolerant Role Combinations

	NLB	Server Clustering	Minimum Servers
Content Storage System	√	√	* Can be placed with SQL Server (shared) or VAS (distributed)
SQL Server		√	2
Virtual Application Server	√		2

Task 1: Active Directory

Active Directory provides group security and access control to applications. When the SoftGrid client connects to a VAS to request applications that can be accessed by the user currently logged on, it passes group membership information to the VAS in the form of the user’s Windows security token. The VAS in turn uses Active Directory to track permissions on applications. The VAS will provide access to the application for which the user has been granted permissions. The client then streams those applications down to the client’s computer. If a user’s permissions are modified to remove them from a group that is associated with a particular application, the next time he or she attempts to launch the application, access will be denied. A domain controller should be located near the location, ideally within the same location, in order to efficiently handle the requests made by the SoftGrid system.

If Active Directory is unavailable, clients will be unable to launch applications if the VAS is still running. If both the VAS and Active Directory are not working, clients will enter disconnected mode and launch any applications they previously had successfully launched. If a VAS service is unable to contact Active Directory during startup, the service will fail to run.

Fault tolerance for domain controllers can be accomplished by adding additional domain controllers to the infrastructure. Guidance for adding additional domain controllers is outside the scope of this guide.

Decision 2: Content Storage System

If the content storage system is unavailable, clients will be unable to install new applications or update existing applications. The content storage system can be placed locally with the VAS or on a shared storage device Network Attached Storage (NAS) or a file server. If the content is stored in a shared location, then ensuring the availability of the data is critical. Care must be taken to ensure that the network path between the location of the content storage system and the VASs is sufficiently high bandwidth. The disk I/O subsystem and NIC in the file server/SAN/NAS that hosts the content must have sufficient I/O throughput to handle several VASs reading concurrently from the content share.

If the content is stored locally to the VAS or if the virtual application packages need to be shared across locations, file replication solutions are recommended for keeping the content shares synchronized.

The directory or share that is used to store the SoftGrid-enabled application packages is referred to as the SoftGrid Content directory.

Option 1: Built-In

If the content storage system is a Network Attached Storage (NAS), then typically these systems provide multiple methods for ensuring data availability, reliability, and scalability. They can provide multiple hardware paths to the data, low level RAID capabilities, and fault-tolerant hardware. From a client system point of view, the storage appears as a remote share in the case of NAS. In order for Storage Area Networks to be fault tolerant, the file server hosting the share needs to be made fault tolerant as well.

Option 2: Fault-Tolerant File Replication

If each VAS hosts a local Content Storage System within a single location, then file replication can be used to keep the multiple storage systems synchronized. Likewise, if the information in the content storage system needs to remain the same across locations, file replication can be used as well.

If DFS is being used, then the availability of the data is increased because DFS is able to redirect requests to another copy of the share if the targeted location is off-line.

It is important to note that FRS and DFS-R are not supported on a Server Cluster although DFS is supported.

Option 3: Server Clustering

Server clustering can be used to increase the fault tolerance of single content storage system file share. The file share becomes a clustered resource running on a cluster with two or more computers. If the computer hosting the file share fails, the file share will move to a remaining active node.

Although a share hosted in a server cluster can become part of a DFS namespace, the content of the share cannot be replicated using FRS or DFS-R.

Evaluating the Characteristics

Complexity
Built-In	Many devices, such as NAS, are automatically configured to handle fault tolerance.	Low
Fault-Tolerant File Replication	Configuring the DFS and implementing file replication between the shares can be moderately difficult. Microsoft provides guidance for implementing this form of file protection using DFS and FRS or DFS-R.	Medium
Server Clustering	Server clustering tends to be extremely complex to set up due to the interaction between networks, shared storage, and specialized hardware and software configurations.	High

Cost
Built-In	The additional hardware for NAS can increase the cost of the implementation moderately.	Medium
Fault-Tolerant File Replication	If using existing file servers, then fault-tolerant file replication is fairly low in cost as it’s built into the OS.	Low
Server Clustering	Server clustering is costly due to the requirements of additional servers and shared storage.	High

Task 3: Management Web Service

Any server may host the Management Web Service as long as it can communicate with the SoftGrid database and Active Directory. The Management Web Service reads and writes configuration data to the SoftGrid database as well as querying Active Directory for group membership information. Typically, this Web service will also be installed on the VAS in smaller installations. The Management Console can be placed on the same management server or may be placed on an administrator’s workstation.

An important consideration to make when the Management Web Service is placed on the VAS is the negative performance impact that report generation can have in large environments. For this reason, it is recommended to have a dedicated server to host the SoftGrid Management Web Service in large environments that will be running reports.

The Management Web Service is only used to configure the SoftGrid environment. If the Web service fails, the SoftGrid system will continue to function normally with the exception of SoftGrid management changes and reporting.

In the event of a Management Web Service failure, the Management Console can be used to redirect the system to use another instance of the Management Web Service in the environment.

Although multiple instances of the Management Web Service can be run in a single SoftGrid instance, no testing has been done with providing fault tolerance to the Management Web Service, and therefore it is not officially supported at this time.

Task 4: Microsoft SQL Server

SoftGrid requires SQL Server 2000 or SQL Server 2005. Data about the application, license management, and report data are kept in the SQL Server database. In locations where fault tolerance is not required, the SQL Server-based server can be installed on the same server as the Web Management Service and VAS.

SQL Server provides a number of mechanisms for fault tolerance. This includes Database Mirroring, Log Shipping, Server Clustering, and Peer-to-Peer Replication. Although all of these provide some form of increased fault tolerance for a database, the only supported method for SoftGrid today is server clustering.

If the SoftGrid database is unavailable, no configuration changes can be made to the SoftGrid system. VASs that are currently running will continue to service clients. However, the VAS service will fail to run if the database is unavailable during startup.

Clustering SQL Server-based servers will increase the complexity of the environment. Creating a new cluster using Microsoft Cluster Services will require additional servers with the appropriate hardware to support the cluster service. MSCS will also require a shared storage device that can be locally attached to the servers running SQL Server, thus increasing the costs of deploying SoftGrid.

Because the load introduced by SoftGrid is extremely low, an existing SQL Server-based server cluster can be used to host the SoftGrid configuration database to provide fault tolerance through minimal costs.

Decision 5: Virtual Application Server

VASs perform a critical role in the SoftGrid infrastructure. They are the servers that have direct connectivity to the client workstations; they are also responsible for streaming applications to the clients. The VAS role must be deployed in the same location and, if possible, on the same fast LAN as the SQL Server role in order to ensure good connectivity between the VAS and the SoftGrid configuration information that is stored in the SQL Server database. In locations where fault tolerance is not required, the VAS can be deployed to the same server as SQL Server and the Management Web Service.

Fault tolerance is achieved by load balancing VASs. Some load balancing solutions will provide fault tolerance at the machine level. That is, if the entire machine fails, the load balancing cluster will no longer send requests to that system. However, application failures are not recognized, so client requests will still be sent to the affected server. Other load balancing solutions will provide higher levels of fault tolerance by recognizing when the application layer has stopped responding. This will ensure that the remaining servers will continue to handle the streaming functionality in the event that a server fails. N+1 or greater VAS redundancy is required to provide fault tolerance.

The VAS service can use a Network Load Balancing system to provide additional fault tolerance to the system. The VAS is not cluster aware nor has it been tested on a server cluster, so this configuration is not supported at this time. There are two network load balancing options available: software-based NLB and hardware load balancer.

Option 1: Software-based NLB

NLB is a cost-effective method for providing load balancing as well as a basic level of fault tolerance and scalability. NLB does not query the health of the real-time streaming protocol (RTSP) on the VAS. This can lead to a situation where the VAS appears healthy because the NLB heartbeat is detected; however, the VAS service is down and will not answer client requests.

Although up to 32 systems can be placed in a single software-based NLB cluster using Microsoft NLB, it has been observed in production that the effective performance of the system drops for cluster groups containing more than six members, so independent verification testing should be conducted if needed.

Option 2: Hardware Load Balancer

To provide access to the VAS array of servers and to recognize when a VAS has stopped responding to requests automatically, a hardware load-balancing solution that supports Hypertext Transfer Protocol (HTTP) and RTSP is required. This level of configuration adds complexity to the overall deployment of the SoftGrid servers. Hardware load balancers also add costs to the SoftGrid solution. Two or more hardware load balancers are necessary. If only one is implemented, then the hardware load balancer becomes the single point of failure. Because the handling of client connections is handled by specialized hardware, hardware load balancers tend to scale to handle more concurrent client sessions than software-based load balancers.

Evaluating the Characteristics

Complexity
NLB	NLB is simple to implement.	Low
Hardware Load Balancer	Hardware load balancers tend to increase the complexity of the environment due to the need for two and the additional knowledge needed.	Medium

Cost
NLB	NLB is available in all editions of Windows Server 2003 and later. As a best practice, an additional network interface card is added to all nodes in the cluster to create a private network for the cluster heartbeat.	Low
Hardware Load Balancer	Two or more hardware load balancers are needed to ensure fault tolerance. If only one hardware load balancer is used, it becomes the single point of failure.	High

Fault Tolerance
NLB	NLB only provides fault tolerance at the machine level. If the application layer fails, NLB will not detect it.	→
Hardware Load Balancer	Hardware load balancers are able to detect application layer failures.	↑

Security
NLB	NLB does not affect security if properly implemented.	→
Hardware Load Balancer	Hardware load balancers can increase security of the infrastructure due to rudimentary packet screening features.	↑

Validating with the Business

What are the service level agreements in place for the applications being virtualized? If applications must be available, implementing a load-balanced and redundant solution may be a requirement of the deployment. It is important to understand which applications are critical to the enterprise and how virtualization and streaming may affect their availability to the parties that rely on them. Fault tolerance for SoftGrid protects against failures with application deployment and updates. Applications that have previously deployed will run in disconnected mode if an infrastructure failure has occurred.

Decision Summary

This step should be repeated for each SoftGrid instance required. At this point, the requirements around fault tolerance will have been identified as well as the implementation to meet those requirements for a given SoftGrid instance.

Fault tolerance for SoftGrid in Connected Mode provides a system that is able to service client requests for new applications or updates. Applications that have previously been cached will run in a disconnected mode in the event of an infrastructure failure.

Additional Reading

An Overview of Windows Clustering Technologies: Server Clusters and Network Load Balancing: https://technet2.microsoft.com/windowsserver/en/library/c35dd48b-4fbc-4eee-8e5c-2a9a35cf63b21033.mspx?mfr=true
Planning Server Deployments: https://technet2.microsoft.com/windowsserver/en/library/cd6dd855-c25a-42e9-a0b1-861989aeac741033.mspx?mfr=true
Configuring Network Load-balancing: https://support.microsoft.com/kb/240997

This accelerator is part of a larger series of tools and guidance from Solution Accelerators.

Step 3: Determine Role Placement and Fault Tolerance

Task 1: Active Directory

Decision 2: Content Storage System

Option 1: Built-In

Option 2: Fault-Tolerant File Replication

Option 3: Server Clustering

Evaluating the Characteristics

Task 3: Management Web Service

Task 4: Microsoft SQL Server

Decision 5: Virtual Application Server

Option 1: Software-based NLB

Option 2: Hardware Load Balancer

Evaluating the Characteristics

Validating with the Business

Decision Summary

Additional Reading

Additional resources