SharePoint Clustering Techniques for High Availability SharePoint

A keyword that’s often thrown around when discussing SharePoint architecture is “clustering”. The problem is that it’s often not very clear what it means in the context of SharePoint, so here’s a quick article on what this might possibly mean so we can all be on the same page when we talk about this term just for clarity sake.

Clustering Techniques Available for High Availability SharePoint

So here’s how we can “cluster SharePoint” to make it more highly available. There are several options; in no particular order:

SQL Server Clustering for SharePoint

One of the first meanings I assume for clustering is SQL Server clustering for the SharePoint databases back-end, of which there are two types – failover and AlwaysOn. Believe it or not, both can be used at the same time.

Failover clustering is just providing a single logical SQL Server instance over X passive/1 active server (3 passive, and 1 active server for example). The idea is that when the active server dies for any reason whatsoever, another server will pick-up where the previously active server left off. Automatic failover is key to the whole idea, data is shared between all nodes, and this is the more traditional SQL Server clustering around – it’s quite common to see in fact.

SQL Server AlwaysOn clustering is a bit more complicated. It’s X logical SQL Server instances for which a single logical interface (the listener) may or may-not exist, and the primary node may or may-not automatically failover between the instances. Some people use it like failover clustering – a single instance for automatic failover; others SharePoint admins use it to backup data to a separate site (for disaster-recovery for example), and you can even combine uses. It’s sort of failover-clustering + mirroring rolled into one.

So that’s SQL Server clustering. SharePoint can use all or any setups since SharePoint Server 2013, but ultimately to SharePoint it still just connects to a single logical data-source.

SharePoint Web-Front-End Clustering

This is otherwise known as network-load-balancing (NLB). SharePoint is very commonly used in this configuration; x2 or more servers will be dedicated to just serving web-pages to users, the requests of which come through a network load-balancer.

“But an NLB isn’t a true cluster!” I hear you exclaim. Well actually it is, albeit outside the core functionality of SharePoint itself. A cluster is just a collection of servers to service a single end-point, and an NLB is a perfect example of that. You can take a server out of the NLB (to reboot the server for example) and outside traffic will carry on quite happily, albeit now with more load to the remaining active nodes.

SharePoint Application Server Clustering

SharePoint needs application roles to handle service-application requests. As that statement is both obvious and slightly dry, here’s a real-life example.

Web-part audience filters; they use user-profile data to know if the audience setting applies to the current user or not and therefore whether to show the web-part or not. This type of query requires the user profile service-application, which in turn needs & will send requests to application servers running the “user profile” service. These calls will work fine (and therefore the page loading with the web-part too) as long as just one of the servers responds to the user profile service request, so always have at least x2 for each type of service. If there’s only one server in the list of “user profile servers” and it doesn’t respond, then that’s a fatal error for the page rendering.

Again, this type of invisible failover ability is technically “clustering”, and actually is very handy at dealing with unexpected outages. More on this architecture here.

Related to application-server clustering in is search clustering and AppFabric clustering.

Virtual Machine (VM) Clustering

Some people like to cluster the machines that run SharePoint Server (or some other dependant server). If the machine host running the VM dies, the machine is failed-over to another host, sometimes in another data-centre even. Hyper-V can do this quite nicely and so can other hypervisors too.

SharePoint keeps running and life goes on blissfully unaware of the disaster that just happened.

Important: This isn't supported with any version of SharePoint as there's no guarantee of data parity between VM replicas. The only thing that we do support for SharePoint is failing over to Azure via Azure Site Recovery.

SharePoint Farm Clustering/Disaster Recovery (DR)

SharePoint DR is pretty simple; it’s simply having x2 SharePoint farms that share the same content. The primary farm is in read/write mode until we move users to a passive farm, which then becomes the new primary. Every farm shares the same content so should be identical to the user, albeit with different configuration & service application databases in parallel.

The idea is that an entire SharePoint farm can die or go offline for some reason and SharePoint users will still be able to use SharePoint. More on SharePoint disaster-recovery here.

Again, this is also clustering because we’ve doubled-up the SharePoint servers to provide the same service – running a whole SharePoint farm. It’s unlikely anyone will mean SharePoint DR when they say “SharePoint clustering” but it’s worth knowing about just in case.

Which Clustering/High-Availability Techniques to Use for SharePoint?

Good question. All of them if there’s enough budget for it :)

If I had to prioritise though, having a disaster-recovery farm is pretty high on the list in my opinion. After that, SQL Server AlwaysOn gives you x2 replicas of the same data, and having the SharePoint WFEs & app-servers clustered too will give you a pretty resilient SharePoint farm.

Hyper-V clustering is probably last on my list of priorities because it’s not easy to setup and it’s just for guaranteeing uptime for virtual machine instances. On the other hand, my preferred designs assume failures will occur at a virtual-machine level and just makes sure SharePoint isn’t affected by them. Failures will always happen; it’s how we handle them that counts for high-availability.

Wrap-Up

So I hope that’s cleared-up what might be meant by the term “SharePoint clustering”. In short, it can mean all sorts of things so hopefully this will help clarify exactly what’s meant, and at what level. They’re all useful for making SharePoint highly available so should all be considered if you care about SharePoint uptime.

Cheers,

// Sam Betts