Virtual Machine Snapshots for SharePoint
…or perhaps better titled, “why you really shouldn’t consider using snapshots with SharePoint (but fine, if you insist…)”.
A common question that comes up for SharePoint is how to snapshot/rollback virtualised SharePoint farms, often inspired by worried admins having to install updates to the farm looking for a contingency plan. Unless you’re crystal-clear on how to do this, don’t take snapshots and definitely don’t roll them back of production environments especially, but if you must then this is what you need to know.
First of all, our documentation makes it clear this is something not recommended. That means if you break your farm because of virtual-machine snapshotting then it’s your own fault; we can help you fix it of course but don’t say we didn’t warn you it might break.
SharePoint Farms are an “All or Nothing” Game
For any SharePoint server that’s joined to a farm, you can’t just restore that machine. If you do or have just restored a single machine then you’ve probably broken it: remove the rolled-back server from the farm & re-add it if you want to be in a supported configuration. Said server might even work but it’s only a matter of time before disaster strikes.
For servers that are unattached to any SharePoint farm, snapshot away to your hearts’ content.
Now, if you’re willing to jump through all the hoops to get rollback facility & take the risk of stuff breaking, here’s how to take a snapshot of a SharePoint farm.
There’s no clever way around this; you have to rollback every server at once + the SQL databases too, all at the same time.
Taking Snapshots of a SharePoint Farm
The only supported way of snapshotting a SharePoint farm is to have all SharePoint services stopped first. You don’t necessarily need to shutdown the machines to do this, but given it’s easy to forget which services are running, I’d highly recommend shutting down all SharePoint to be sure it’s all stopped.
Are all SPServices shutdown (IIS included)? Sure? Ok, now take a snapshot. If you’re wrong you might break the farm if we rollback, hence I’d recommend the farm-wide shutdown just to be sure.
Now snapshot the farm (here’s a simple farm I use to hack around break/fix tests with)…
And that’s it; I could roll back all the machines to this point and this particular environment would survive just fine. But…
Things That Can Go Wrong With VM Snapshots & Rollbacks
There’s lots that can go wrong. The above example is a 4-machine setup; 1 AD server, 1 SQL Server and 2 SharePoint servers. That’s not very realistic though for a real production network, which is where we start to run into trouble. Lots of trouble, potentially – here’s some of the breaking possibilities.
Epic Breaking Possibility 1: Active Directory Doom
Normally in production networks there’s a whole bunch of AD server which SharePoint uses for, well, absolutely loads of things – pretty much every service account in fact. Often, the network in question will use AD for other things too; user logins, other applications, and general network runtime.
So in other words, given how AD is used it’s unlikely any AD servers should be snapshotted too. What does this mean then? Well, we assume that on rollback, AD is in more or less the same state it was when the snapshot was taken. That’s quite an assumption, and I probably don’t need to explain what potential pitfalls this would imply.
In short, for a SharePoint farm rollback to work, all service & machine accounts in Active Directory need to be identical in AD on restoring/rolling-back the farm to when the snapshot was taken. If that’s not the case because account have been updated or changed since then, you now have a mammoth reconfiguration task on your hands and probably a lot of downtime too.
Finally, snapshotting anything except all domain-controllers at once is an exercise in futility & pain – Active Directory is a distributed database that you really (really) don’t want to risk upsetting by rolling-back some of the domain-controllers for it. You’ve been warned!
Epic Breaking Possibility 2: SharePoint Server State Out of Sync with Farm Configuration Database
On a simpler note, a common problem that comes up is simply that the SharePoint servers are out of sync with the configuration database.
What does out of sync mean? Well, the configuration database is the state & strangely the configuration of all elements of the farm, amen. As that might imply, it can also grow pretty big so to avoid each SharePoint server from hitting SQL Server every time we want to know about the farm setup we have a local file cache of the same data. In short, if that local cache isn’t in lock-step with the configuration database then weird things can start to happen – services won’t work, and generally it’ll cause a lot of, you guessed it; pain, doom, etc.
This is pretty well known by now just on the account of how often this problem arises (thanks to VM snapshotting, funnily enough). The solution is easy though: clear the configuration cache per server.
So how do we make sure we don’t lose sync during snapshotting? Either restore the configuration database from the same snapshot time, or include on the snapshot the SQL Server(s) state & data too.
Epic Breaking Possibility 3: SharePoint Binary/Database Patch Compatibility
Depending on what you snapshot, it’s possible you might end up with databases that won’t work with the binary versions of SharePoint.
Example scenario: we patch SharePoint, update databases too, something goes horribly wrong with the new version (maybe some new incompatibility) so some genius decides to roll-back the SharePoint machines to the previous state in order to resolve the problem. Now we have a compatibility mismatch as old binaries are trying to work with fully updated SharePoint databases and everything blows up properly this time. See more about this potential version mismatch problem here.
Again, more wailing & gnashing of teeth, thanks in part to VM snapshotting.
So if you’re having second-thoughts about snapshotting your production SharePoint farm, then good, my work here is done. There are ways of keeping SharePoint online if you’re worried about uptime for production especially – running a separate contingency farm for example.
Yup, maybe. That all said though SharePoint VM rollbacks is still technically possible to do if you really have to. Just be careful!