Why Disaster Recovery Farms are Essential for High-Availability SharePoint
It’s not uncommon for customers plan a highly-available SharePoint (HA-SP) installation to satisfy uptime requirements for the business. It is uncommon that said designs give an architecture in the best way for staying online always though, and this is what this post is about.
The Normal High-Availability SharePoint (HA-SP) Strategy
The normal approach to SharePoint high-availability is to add servers to an existing farm so no one role has a single point of failure. A web-front-end or application/search server can die and things will carry on without any user intervention, assuming its setup properly.
This isn’t a bad idea and should be implemented too if only to allow SharePoint servers to reboot for Windows patches without causing problems in the farm but you still have multiple single points of failure, despite having multiple servers. See below for a list of points of failures that could still ruin your day as a SharePoint admin.
The Better HA-SP Strategy
Better than having a boat-load of servers in one farm for uptime, is having a secondary hot-standby farm ready to go in case of any failure or problem on farm #1. It doesn’t have to be as scaled out or as highly-available as the 1st primary farm but having a backup logical farm that could take users if there’s any problem at all on the primary site is something very valuable indeed.
We keep content in sync between the sites with transaction-log-shipping and basically run two parallel & identical farms.
That’s not to say there’s no value in having service redundancy on any one farm; there is, just the point is that service redundancy shouldn’t be the only high-availability strategy used.
What could go wrong on the primary farm then?
Lots of things really. Most things that go wrong are things we never imagined could or would, but some of the things I’ve seen happen are:
- Configuration database corruption/error.
- Configuration changes are made all the time and it is one complex machine in how it works. Being able to have another one on standby is useful for configuration errors; user/admin-caused or not.
- Problematic service application.
- Search problem – issues with indexes or crawls for whatever reason.
- User profile issue – any number of issues; import issues, site-collection sync issues.
- X/Y/Z service application unexpected behaviour.
- Patching problems and delays.
- Bad or failed Windows/ASP.Net patches – it’s very rare but sometimes platform patches have been known to inadvertently cause issues with SharePoint and/or custom code either because of quality issues or much more common, installation failures for whatever reason. This can cause all sorts of havoc with SharePoint.
- o SharePoint patching & updates.
- This can be a slow & complicated process, depending on what’s got to be updated.
- You can expect to have to patch your farm at some point, if only to remain on a supported configuration.
- Infrastructure/platform failures.
- Any unexpected failure of the dependant technologies that SharePoint relies on – SQL Server, AD, DNS, networking, disk errors, etc, etc.
- More things that we haven’t thought of. Normally it’s the things that haven’t been thought of that get us, so just imagine this list is twice as long.
Any of those would cause you to call us if you couldn’t work out how to fix the issue, and if you only had one farm, could potentially be very urgent to resolve. Sometimes these issues are our fault but often not; who cares though, when there’s a fire the most urgent thing is to put it out.
In a Parallel SharePoint Universe…
Now just imagine you could switch to another entire farm while you resolve the problem on the 1st site. That’s actually pretty easy to setup and it takes a lot of pressure off the network/SharePoint engineers while the problem’s looked at calmly, and none of the users will be any of the wiser.
While users are unknowingly using farm 2, we’re fixing issues back on farm 1 and everyone is happy. Yes, we’ll have to commit back the content changes to farm #1 again if we want users back there again at some point but the point is we have a very quick way of handling a disaster. Hence the name “disaster recover”.
OK so How Do I Get another SharePoint Farm Then?
It’s actually not so difficult – you basically need one or more separate SharePoint server, another SQL Server, and a bunch of new databases for the new farm. For content databases they can be synchronised from the primary with log-shipping.
Obviously the preferable option is to have two equally dimensioned farms, both with their own service redundancy built in. Even better is the ability to switch between the two on a regular basis so that we can be assured any unplanned failover will work, but if money is tight and you can only have a 2nd-class farm for disaster recovery then having a lesser secondary is still better than nothing.
Read more about setting up a disaster recovery (DR) farm here.