Understanding SharePoint - Part 2 - The Infamous Query Plan Bug and The Origins of SPSiteManager

Understanding SharePoint - Part 2 - The Infamous Query Plan Bug and The Origins of SPSiteManager

This is the second in a series of posts in regards to "Not being mislead by what your seeing :)"

In the next post in this series, I'll talk about SharePoint index sizes vs the size of your databases, as you can't assume that with a larger database, you have an equally larger index. I'll show you how it can actually be the complete opposite, where your index can actually be smaller for a larger database vs an index of content on a much smaller database

DISCLAIMER: This post shows using Query Analyzer to issue statements against your SharePoint content databases. By no means does this mean that you should change any thing in the database. This is simply for "READING" values. And even this should be done during Off-Peak hours.


Holy SQL Spikes Batman!

A long time ago, in a cube far, far away (Ok, across the hall from where I'm at now, so maybe not so far away)....I worked on an issue with a customer that:

  • On one hand, literally sucked the life right out of me :)

  • On the other, give me the opportunity to come up with SPSiteManager (See Codin' till the fingers go numb)

Due to a recent occurrence of another customer hitting this, and the fact that I discovered I really didn't share this before, I felt this posting was warranted.

First off, thanks to Denzil Ribeiro, a SQL Escalation Engineer who originally worked with me internally to help narrow this down when I was an Escalation Engineer for SharePoint.


The Symptoms and Problem

  • You have a SharePoint site or Portal where you really don't have a lot of users permissions, but you do provide access via possibly a security group.

  • You don't have a lot of documents or list items, and are well within capacity planning guidelines.

  • 99% of your sites have a relatively low number of users associated to it,

    • There have never been more than say a few hundred unique users every hit the site, or perhaps they are mainly My Sites, with one or two users (The MySite owner, and a friend or two), or perhaps they are many, many team sites with low activity.
  • 1% of your sites, or rather perhaps just one single Site collection, or Portal Site Collection, has > 10,000 users associated with it.

    • You've provided access to the site/portal with domain groups, or even a single domain group such as NT Authority\Authenticated Users
  • Yet, sometimes, your performance or even accessibility to your SharePoint environment is completely blocked.

  • The Processors on your SQL server spike and sustain a high CPU utilization %, thus bringing everything else to a grinding halt.

    • In the case where we first hit this, the customer had plenty of web front ends, and was running on an active/active cluster with 16 processors per node, and all 32 procs would spike to 100%, they had one content database with a Portal site collection, and > 9,000 team and My Sites.

The Reason

When any SQL statement is executed in SQL Server, the relational engine first looks through the procedure cache to verify that an existing query execution plan for the same SQL statement exists. SQL Server reuses any existing plan it finds, saving the overhead of recompiling the SQL statement. If no existing execution plan exists, SQL Server generates a new execution plan for the query. (It determines based on the data, the indexes, amount of joins, etc the best approach for retrieving the data to handle the request and presenting the data back to the caller). This execution plan is then cached for use later. This is especially important, if the statement is very parameterized, such as the queries we use to generate result sets for document library and list views.

In this case, the initial statement is compiled with one set of parameters, and for 99% of the time, the plan that the SQL optimizer produced, was a good query plan based on that set of data, and the statement is executed in negligible time. When this same query is executed again with a different set of parameters, since it was a parameterized query, we look up the cached plan and find it, thus use it. The cached plan is good for the first set of parameters, but BAD for the second set, and thus can cause your SQL server to consume the CPU trying to process the results based on the plan it generated previously.

The reason, in this case, why the plan is bad for the second set of parameters was because of the data distribution in the UserInfo/UserData tables. (For more information on how users are cached in the UserInfo and UserData tables, see my posting here: https://blogs.msdn.com/krichie/archive/2006/02/18/534767.aspx ) Because this single site has many users recorded against it via the domain group, it increases the likelihood of this problem. The majority of the sites have very few rows in the UserInfo table for a given site, while the minority has thousands of rows in it for their sites. Hence, the execution plan chosen for a site with very few rows was optimal for any site that had few rows, but when that query plan is used with a site that has a TON of rows it is not optimal for that at all.

I want you to realize, parameterized queries and query execution plans ARE NOT BAD in general. In fact they are good for most cases where there is even data distribution, yet can cause problems if the data distribution is uneven OR if typical parameters are passed initially when forming the plan that is cached.

Oh dear, whaddya gonna do....

I filed a bug with the SharePoint product group team on this, as it wasn't a problem with SQL, it was just how SQL works. The SharePoint product group and sustaining engineering teams both agreed we needed to fix this, it's just the proper fix needed to be thoroughly investigated, regression tested, stamped with approval, etc. We all knew this was going to take a really long time to get the best fix created. (Surprisingly, this didn't take as long as the orphan cleanup fixes I worked with many folks on to push forward :)) Due to timing, we didn't make SP2, so this was a post SP2 hotfix for Windows SharePoint Services, which I HIGHLY encourage and recommend that you deploy. You can get the fix here: https://support.microsoft.com/?id=900929

In the mean time, I had to help this customer. The solution was to move the single site with the large amount of users associated with it to a separate content database, what I call repartitioning or site leveling. Query Execution plans are database specific, so this would prevent the re-use of the plan on large sites. 

Ah, that's a snap right? Well, in the case where the site in question may have been a single Windows SharePoint Services team site, I could have used STSADM -o backup/delete/restore to the target database to take care of this one site, and be done with it all..but guess what....It wasn't a team site that was the problem, but rather the Portals site collection. That was the one with the large user dataset. Guess what, you can't use STSADM for that :) Also, you can't use SPSBackup for this either, because SPSBackup backs up the entire content database, not just the portals site collection.

What's the only other way to do this? Repartition the other 9,999 team sites out of this content database, and into another.

So just how would you do that? Ah shucks that's easy...Here's the steps:

In order to repartition Site Collections in and out of content databases you have to use the following steps:

 

  1. If you have not done so already:
    1. Create a new content database to place the backed up site into.
  2. Read/Write lock the site so that it cannot be accessed during the backup operation
    1. Navigate to Windows SharePoint Services Central Admin page
    2. Select Manage Quotas and Locks
    3. Select Manage Site Collection Quotas and Locks
    4. Type in the URL of the site you need to lock and click the “View Data” button.
    5. Select the “No Access” radio button in the “Site Lock Information” section and choose OK.
  3. STSADM –o backup (to backup the site collection to disk)
  4. STSADM –o deletesite (To delete the site from the current content database)
  5. Visit the Content Database management page from the Windows SharePoint Services Central Admin pages.
    1. Select the First content database and either:

                                                              i.      Set it’s status to Offline or

                                                             ii.      Set it’s max-sites and warning-sites counts to that equal to the number of current sites in the content database (Be sure to have a notebook handy to write down all of the original values so that you can re-set them in a later step.)

    1. Repeat step 4.a for each content database EXCEPT the one you are restoring to (This prevents the WSS Logic from load balancing the existing content databases, to determine which content database to restore the site in)
  1. STSADM –o restore (restore the site into the new targeted database)

  2. Read/Write unlock the site to allow access to the site again

    1. Navigate to Windows SharePoint Services Central Admin page
    2. Select Manage Quotas and Locks
    3. Select Manage Site Collection Quotas and Locks
    4. Type in the URL of the site you need to lock and click the “View Data” button.
    5. Select the “Not Locked” or “Adding content prevented” radio button in the “Site Lock Information” section depending on the sites previous state, and choose OK
  3. Visit the Content Database management page from the Windows SharePoint Services Central Admin pages.

    1. Select the First content database and either:

                                                              i.      Set it’s status back to Online (If it was online to begin with) or

                                                             ii.      Set it’s max-sites and warning-sites counts to their previous values (refer to the notes taken in Step 4.a.ii)

    1. Repeat step 6.a for each content database
  1. Repeat steps 2-8 (unless you need to create more content databases, in which case you restart at step 1). 

In other words, do these steps ~10,000 times. Let's see, would YOU want to do those steps manually? Ah shucks that's easy...NOT!

As you can see this is a very labor intensive and time consuming process. If you had to do this for many sites it becomes impractical to do all of this manually.

The Solution, and birth of SPSiteManager

I had to come up with a solution for my customer. Otherwise, I would not be able to sleep....It was my issue, my customer, and they were in some serious pain, and besides...my job at the time required me to work the issue to closure. They were experiencing the issue all the time, and the outage was a sustaining 3 minutes for every user who hit this one tiny little document library, on this VERY popular portal area :). So what would happen, is users would notice the problem, and then keep trying to click on this document library...As soon as it would render, another user would come along and hit it, thus...down again. They were quite literally down, completely down, until they prevented access to this document library. I knew that we could perform the same steps needed as in the manual steps above, via the SharePoint Object Model, so off I went, Codin' till the fingers go numb.

While I was writing the repartitioning logic, I knew we would have other customers eventually hit this before the fix, and it would also be a great tool to have around, in case you needed to move sites around regardless of the bug or not. I also wrote a simple detection algorithm, to analyze and detect where content databases may be in this state, thus the reason you see the lowusercount= and highusercount= attributes in the <database> element in SPSiteManagers Site Distribution Document. I worked with MSIT to get it run on our own Microsoft internal SharePoint servers, to look for the occurrence of the problem. The product group even gave me a mirror of their internal server (A really big heavy portal server) to work against to ensure we got the logic flawless for solving this type of problem. (And if I haven't told you enough PG, MSIT, etc.....Thanks!!!!)

Heck, since I had to go examine the sites, webs, etc..I thought what the heck, I'll even go add in the checks against the capacity planning guidelines we have for SharePoint. The result, was yet another little tool that went "Boom"

In the end, when the customer ran SPSiteManager (in batches) to repartition their sites out, they completed it in about 10 man hours. That's ~10,000 site collections moved in 10 man hours...Now, imagine the amount of time it would have taken to do the manual steps above.

The ultimate fix

Now before you go off running, and telling everyone that they must repartition their large user associate sites immediately, STOP NOW. You do NOT have to do this. As I noted above, a post SP2 hotfix is available to remedy this problem. See https://support.microsoft.com/?id=900929 .

I will tell you right now, though, that:

  • The KB does not necessarily call this out, and in fact is worded improperly

    It's currently worded as:

"When you run an ad hoc query on a Windows SharePoint Services site, the CPU usage on the server that is running Microsoft SQL Server consumes 100 percent of the CPU time. Therefore, users cannot access the server."

    •  

    • Which is VERY misleading. This in by NO means whatsoever implies that you should go run ad hoc queries on your database.

    • We are working to get the official KB updated with the proper wording.

  • It IS RECOMMENDED however, that you keep a portals site collection in it's own database, as with a portal, it is very common that you would have a large set of users associated with it, and it can get quite large, and if you need to move it later, guess what you can't. You'll have to repartition all those team sites out of it, rather than re-locate the Portal site collection to a new database. What I would recommend, is that when you create a new Portal, go set it's "Max Number of Sites" to 1 in the Manage Content Databases page, then create a separate content database for your team sites.

In Conclusion

If you feel you may be experiencing this problem, and you do not have the Post SP2 hotfix installed as noted above, a couple of simple tests can be executed

  • While experiencing the problem, issue the following statement against the content database in question.

DBCC FREEPROCCACHE

    • If your hang is cleared up by this, chances are you are being impacted by this issue.
  • Use SPSiteManager to perform a simply analysis (Without the need to do a deep level analysis) and look at your user associates on your content databases:

            SPSiteManager -o analyze -allvs

    • If you see a lowusercount <= 1000 users and a highusercount >= 9,000 to 10,000 users, chances are you are being impacted by this issue.

To put it bluntly, install the fix now (or the most recent rollup hotfix which will include this.) :)

Hope this helps!!

 - Keith


Previous Posts on this series:

Understanding SharePoint - Part 1 - Understanding the SharePoint Portal Server Indexer
https://blogs.msdn.com/krichie/archive/2006/07/20/672755.aspx

Additional References:

Description of the Windows SharePoint Services post-Service Pack 2 hotfix package: November 15, 2005
https://support.microsoft.com/?id=900929

SQL Server Architecture: Execution Plan Caching and Reuse
https://msdn.microsoft.com/library/default.asp?url=/library/en-us/architec/8_ar_sa_4azp.asp

SQL Server Books Online: DBCC FREEPROCCACHE
https://msdn2.microsoft.com/en-us/library/ms174283.aspx