Graceful SharePoint AppFabric Restarts
Many people have asked about how to cleanly restart an AppFabric server so data in the cache isn’t lost and may have even found they’ve not been able to get it to work themselves. It’s a good question; I hope to answer how here to some extent, partly because the official commands don’t actually work so well by default.
Update: the TechNet documentation has since been updated with a script that should work just as well as the guide below. Below explains what’s happening in the script.
First, a quick test to demonstrate the cache working so we can see when it breaks easily. Here I’ve got a small I app made to make a new post every 2 seconds to my own social feed.
Pretty simple concept. The key part to this test is that social feeds only show up in SharePoint when AppFabric is working nicely:
If AppFabric dies unexpectedly and cache-data is lost, then quite simply social feeds won’t appear correctly.
Side-point; you’ll see lots of people complaining about social data not appearing, incidentally. It’s because there’s nothing in the cache, probably because the cache is broken and needs to be repaired.
But I digress; the newsfeed makes for a nice visual test for when we’ve broken our test AppFabric cluster.
The AppFabric Server Restart Tests
So what we’re going to do is two tests to show the right & wrong way of restarting an AppFabric machine; what happens when AppFabric breaks, and why. My environment has x2 AppFabric servers. We’ll reboot one and see what happens; then we’ll do the same again but with a graceful shutdown first this time.
Breaking AppFabric Test – The Norm
So first let’s run a test that’ll do horrible things to our AppFabric cluster; otherwise known as just rebooting an AppFabric machine like you would normally. Here’s the healthy cluster state; all servers are online and servicing caching requests:
Now to reboot server our victim server “search-idx” just like you would on any other day.
Bang! Adios, AppFabric (until it can get its knickers untwisted again, which it eventually will). Let’s look at the damage.
Get-CacheHost reports the machine offline and generally gets quite confused:
…and lo, the social feed is apparently “collecting”.
Interestingly, you’ll see lots of cache fraction suddenly unallocated if you run Get-CacheClusterHealth:
Unallocated named cache fractions
NamedCache = blah
Unallocated fraction = 5.12
…and so on.
Unallocated fractions are basically cache segments don’t have a server (because the server they did have dropped off the cluster suddenly), so our social feed and anything else that needed that cache will have to load it again. Eventually of course the cache will rebuild and there’ll be no more unallocated fractions again but it takes a while, depending on the amount of data in there, servers to copy to, speed, system load, etc.
Now let’s see how to restart an AppFabric server without causing any hiccups.
A Graceful AppFabric Restart
This time we’ll run the graceful shutdown before rebooting, first this command (changing the hostname of course):
Stop-CacheHost -HostName sp15-search-idx.sfb-testnet.local -CachePort 22233 -Graceful
Now I bet that surprised you; in the official SharePoint/AppFabric documentation we’re told to run “Stop-SPDistributedCacheServiceInstance -Graceful”. However for reasons too complicated to go into here, let’s just say for now that the official stop command is far from graceful – the service is in fact dropped like a hot potato and anything on that host in AppFabric goes with it.
More on that another day (it’s not a simple subject); for now though running Stop-CacheHost will work as expected instead and will give you something like this output for “Get-CacheHost” once it’s executed:
Edit: the updated documentation @ https://technet.microsoft.com/en-us/library/jj219613.aspx#graceful has a nice script to automate this graceful shutdown.
Notice the “SHUTTING DOWN” service status against our server in question. It means it’s offloading cache fragments to everyone else in the cluster still “UP”, and not adding any new fragments either, but still effectively online for cache queries. Very graceful indeed.
Monitoring AppFabric Graceful Shutdowns
As the node is shutting down, Get-CacheClusterHealth will show the said node shrink its’ “healthy” object count while all the other server(s) will increase them. It can take a while (15-20 mins) & is quite boring to watch in fact, but if you must wait until it’s done then refreshing the command will show the numbers slowly shift towards the nodes that aren’t shutting down.
You’ll know it’s ready when Get-CacheHost eventually shows the shutting-down host as just “DOWN”. When the server status is “down”, you can finally run the other command in the documented process to complete the server shutdown prep:
This will remove the server from the farm topology for any AppFabric requests, which in reality just means that SharePoint is kept on the same page AppFabric is about what machines should be servicing cache requests, but does little else.
Now you’re finally ready to reboot your server with the peace of mind your AppFabric cache won’t be impacted!
Once the machine has rebooted, re-add the server to the AppFabric cache cluster again with:
What Happens If AppFabric Cache Doesn’t Shutdown Gracefully?
Not much as it happens. In short, no data is lost past any immediately uncommitted “likes” or “follows” or whatever’s going on in the baked-in social functionality (as opposed to Yammer, which frankly is better). Logon tokens may be lost which may result in one or two users being politely asked to login again, but SPLife will basically go on without much drama.
As mentioned before though, the biggest victim tends to be the social feeds if you use them, which may out die-out depending on what cache-chunks were lost. Social can be repopulated pretty much instantly with:
Conclusion: losing temporarily cache by definition isn’t a big drama in SharePoint-land normally, assuming AppFabric is working under normal conditions at least.
That’s it for now. I know some may be asking why the official guidelines on graceful shutdowns don’t work by default; let’s just say we’re looking at it.
In the meantime though, this workaround will work nicely if you absolutely need AppFabric smooth running for whatever reason. I hope this has helped – feedback is always welcome!
// Sam Betts