SharePoint vs. Snapshots

Note: for the purposes of this article I'm focusing exlusively on virtualization and storage based snapshots. SQL snapshots are a different matter altogether and if available, are used frequently by SharePoint itself. The two should not be confused.

One of the many joys of our currently heavily virtualized world is the use of Snapshots. Whether they're in the storage system or in the virtualization tier, snapshots can provide an invaluable form of insurance against some risky or irreversible activity that's about to be taken in your environment.

Just not for SharePoint. SharePoint HATES snapshots. They go against everything that SharePoint does to manage itself. They create inconsistency and conflict. They change the rules behind SharePoint's back. They are, simply, evil.

And yet, they're so darned useful!

So… we want to use snapshots. We want to leverage the powers of the tools we've paid for. We want to have a safety net that can help us back out of bad decisions. We want to be able to turn back time and recover something we might have lost. We want the power that we've paid for!

It is possible to use snapshots with SharePoint. First, we have to understand a little about SharePoint, a little about snapshots, a little about networking, and a minor surprise about Active Directory domain membership.

Issue #1: The SharePoint Timer Service
Our first issue is the Timer service that should always be running on every SharePoint server. Always. It is the responsibility of the Timer service to check for work that needs to be done, do it, and then report back that the work has been completed. This can include basic maintenance tasks such as cleaning up a few rows in a database to large tasks, such as creating a new web application in the farm. These can be in request to and end user request, such as deploying a SharePoint farm-based solution, or on a scheduled basis, such as starting a search crawl or synchronizing information between databases and service applications. There are many of them, and they can start at any time (per their configuration) and can run for a significant duration of time. You can think of each timer job as a ball that SharePoint has to keep in the air. Some balls need to be thrown up more frequently, some take longer to return to earth, and some get thrown at SharePoint from nowhere. In all instances, it is the responsibility of the Timer service to keep all of the balls up.

In order to keep all of these balls in the air, the Timer service on a SharePoint server checks the Configuration database once per minute and asks if there is any work that it is responsible for. That is, every server checks in once per minute to find out if it should be working on something, and each one is checking in at a different point within that minute due to various factors. So, Server 1 might check at :12 seconds into the minute… Server 2 at :22 seconds, Server 3 at :55, and Server 4 at :56 seconds. There's no real way of knowing… all you know is that at least once per minute, every one of your servers is attempting to do work.

Example: You ask SharePoint to create a new web application. In order to do this, SharePoint must configure a new IIS site on all SharePoint servers that host the Web Application Server role. So, the Central Admin server creates a new timer job assigned to each WFE, informing that WFE that it is to go through any necessary steps to support that new web application. This is what is happening when you see the "Please wait" dialog in SharePoint; Central Administration is waiting for all of the timer jobs it has created to report as being completed successfully. All is well.

However, you decide that something should have been done differently when you deployed the web application, and you, being knowledgeable in virtualization, took a snapshot of the SharePoint servers before you started. "Simple solution!" you say, and revert the snapshots to the previous state.

Unfortunately, all is not well. Although you did remove the IIS site and SharePoint configuration, the configuration database itself firmly believes that the new web application was created, and believes that it should be maintained. The servers will not retry creating it because the timer jobs that were assigned to them say that the work was done successfully, yet all of the timer jobs targeted at that new web application continue to run. And fail. Constantly. Once per minute per server.

"Oh! Well, I'll just revert the SQL server too!" you say, and you do. Unfortunately, while your SQL server's OS state was reverted, you're using a SAN for the actual data files… and those files are unaffected by your virtualization snapshots. Still no good. Problems abound.

Issue #2: The SAN
"Fine, I'll just do all of them… SharePoint, SQL, and the SAN!" you'll say. However, despite your wizard-like speed with a mouse and keyboard, you just can't click all of the buttons at once… and different versions of snapshots take different amounts of time to begin the snapshotting process and to complete the creation of the snapshot.

Issue #3: The Network
Even if you do manage to get the SAN and VM snapshots perfectly lined up, none of those technologies currently capture the state of the network. While servers are talking to each other, to clients, and to databases, packets of information are whizzing over the wires, and network infrastructure is judiciously trying to deliver those packets. If we use our "ball in the air" analogy, this would be the equivalent of one server throwing a ball at the other server which has previously agreed to catch it. Your timing though is impeccable, and you manage to perform this snapshot at just the moment where one server has sent it but the other has not yet received it. Plus, while your environment is performing it's snapshot, the network that is ferrying the ball tosses it to the previously receiving server. Imagine for a moment that image of a baseball player frozen mindlessly as the ball falls directly onto that player's forehead and to the ground. Fortunately, most network protocols in place today are relatively stateful… and when you revert the snapshot (again, assuming your synchronization was perfect to the nanosecond, which it couldn't be), they'll retry. More likely though is that something got lost in translation… and it was all your fault.

Issue #4: Domain Membership
Did you know your computer has an account in Active Directory? Did you know that it changes its password automatically, without telling you, every 30 days? Did you know that if the machine changes its password and then you revert the machine to just before it initiated its own password change request, you may never be able to log in to that machine using a domain logon again? Now you do.

So… How can you safely use either SAN or VM snapshots in SharePoint?

Tune in next time to find out! ;)

The follow-up article, SharePoint vs. Snapshots (Part 2), is now available!