Alert Management Tuning: The SQL Management Packs
Disclaimer: Due to changes in the MSFT corporate blogging policy, I’m moving all of my content to the following location. Please reference all future content from that location. Thanks.
I recently had the privilege of participating in a discussion in regards to some of the alerts generated by the SQL management pack, In so doing, that gave me an idea for a set of articles in regards to tuning specific management packs. I don’t plan on covering each and every MP that has been authored, but I do see a lot of the same issues whenever I start working at a new customer site. As such, I hope to provide a list of common alerts that you see when implement a new MP along with some guidance on how to deal with them.
In that same discussion, Kevin Holman had an amusing, yet very accurate observation:
“The problem with the SQL MP – or any MP – is not the MP itself. It is how customers implement it….. this is reality:
- The don’t read the guide.
- They import the MP.
- The SCOM team sets up a subscription to email the SQL team for all SQL related alerts.
- The SCOM team doesn’t partner with the SQL team to properly set up RunAS accounts in order to discover and monitor SQL.
- The SQL MP discovers a LOT of SQL servers that the SQL team doesn’t even own or know about.
- The SQL servers have all kinds of SQL Agent jobs failing because the SQL team ignores them because they don’t care…. “oh those are not important… why cant you just alert me about the important servers and jobs?”
- The SQL servers go nuts with page life and buffer cache monitor alerts, because the customer has virtualized SQL and is squeezing the nickel until it screams… aka memory pressure.
- The SQL team gets bombarded with emails, and they complain to the SCOM team.
- The SCOM team realizes they don’t understand RunAS, but they don’t want to admit this to the SQL team or their own management.
- The SCOM team blames the SQL DBA’s for a crappy environment.
- The SQL team creates a rule to auto-delete all alerts from SCOM.”
His story, while amusing, is sadly true. When it comes to alert management, I find that some variation of this process plays out with many of the management packs that we make, and this is one of the big reasons why I discourage every customer from crafting an alert management process that centers around SCOM sending emails. It is very difficult to take advantage of SCOM’s monitoring capability if the main plan is to have it send an email. SQL in particular presents a bit more difficulty, in large part due to the complexity of the application, its wide spread use, and the simple fact that out of the box, SCOM will not have rights to a SQL environment.
Let’s start with the first issue: You will discover far more SQL installations than what you thought existed in your environment. There are two reasons for this. The first is our management pack design. SQL workflows will be kicked off by SQL express installations, SQL management tool installations, and one off SQL installations because that custom application you own came with some flavor of SQL that the engineer installed on the local server. In all of these scenarios, your first clue to this will be failures related to discovery or SCOM workflows. These include:
- OpsMgr failed to start a process/run a script/wmi failure/etc. These are the yellow alerts that just about everyone ignores (and they should not). A closer look at the alert data and you’ll see failures of various SQL workflows. These failures are either due to permissions or because you don’t have a full version of SQL on the machine in question, causing the failure.
- Discovery failure alerts – the new SQL management packs are kind of enough to give you a red discovery failure alert. I’m not sure exactly which version of the SQL MP instituted this, but as of 6.7.2, you’ll find SQL discovery alerts each time a discovery fails.
The solution to this is relatively easy, but it can be a bit time consuming tracking down every SQL installation in the environment. Add to it that some Microsoft products are complicit in this. Skype for Business, for instance, creates an RTCLocal and LyncLocal SQL express instance on some of its servers. At some point, a discussion needs to be had with the enterprise DBA team as to what they do and do not want to monitor. Usually, I find that most DBAs do not want to concern themselves with SQL express, and often times they aren’t very interested in many of the one off SQL installs that exist in their environment. If that’s the case, we need to turn this off. Kevin Holman provides a nice step through on how to disable SQL express discovery. That same discovery can be disabled on specific machines as well. I find myself creating a SCOM group (dynamic if possible) at simply disables this discovery for that group. I’d note that sometimes a remove-disabledscomclassinstance command will also need to be run once this is done.
That brings us to the second issue: You won’t have permissions to monitor your SQL environment. The easy solution is to give the Local System Account sys admin rights in SQL, but often times, the easy solution is not the right solution, as it completely defeats the application level security that is built into SQL. There is a guide out there for SQL low privilege configuration. Personally, I think it is well worth investing the time to setup SQL for low privilege monitoring. When you have problems with permissions, you’ll see the following alerts in SCOM:
- The alerts listed in the previous section can show due to insufficient access.
- RunAS account alerts will also be found. This could be due to an account that is not distributed properly or a run as account defined by SCOM that hasn’t been given rights to the SQL instance.
- In some cases, I’ve seen Page Life expectancy and Buffer hit cache alerts fire off due to permissions. The easy way to check this is look up the value of these counters in their alert description. If this value is always zero, fires within 15 minutes of initialization, and never goes green, it is possible that SCOM does not have the rights to read the Perfmon counters, though I don’t see this very often.
Once we get through these issues, there are a number of SQL alerts that will be fired, all of which should be addressed.
- Page Life Expectancy/Buffer hit cache ratio. This is typically an issue where the SQL server does not have the resources that it needs.
- Unable to authenticate using Kerberos. I find this one ignored a lot, yet it is straight forward. SQL isn’t authenticating via Kerberos, but instead via NTLM. This is both slower and less secure. All that usually needs to be done is to register an SPN in active directory, and the alert will tell you exactly what SPN needs to be added, removed, or changed. Likewise, it can be fixed in less than a minute.
- Database Backup failed to complete – this too is a favorite of mine, especially when I see a four or five figure repeat count, and it’s about as straight forward as it gets. Your SQL backups are failing.
I like to write a lot to alert management, but management pack tuning encompasses more than that. Most our management packs by default collect A LOT of performance counters. In smaller environments, this is not too big of a deal as the management servers have no problems keeping up with demand and the SCOM DW doesn’t grow too big. However, large environments are not so lucky, as large SCOM environments will see significant performance degradation if the default management pack settings are kept. As such, I encourage my customers review what they are collecting and make a decision as to what to keep. Performance collection in particular is used for reporting, and if reports are not important, then I’d strongly suggest turning a lot of that off. To quickly determine what we use, I recommend using a site called systemcentercore.com. It’s very useful in letting someone quickly see pretty much everything associated with our management packs. I’ve included a link to the SQL MPs here. To get you an idea how big this can be, I had one environment where we turned off more than 600 SQL related performance counters, and that was not all of them. These can always be turned back on, so unless there’s a business requirement to keep this data, I strongly recommend reviewing and disabling the non-essential counters.