Bad OM Performance because State Change Events are not groomed out

Article
03/15/2013

IMPORTANT: Always perform a FULL Backup of the database before doing anything to it !!!

ALSO: It is advised that you open a case at Microsoft before doing this - directly editing the database is not supported and you may find yourself in an unsupported state if anything goes wrong.

So I have encountered another interesting issue where we had a bad performance in SC Operations Manager 2012 (also applies to 2007 R2) because of the fact that the StateChangeEvent table was very large - the largest table actually in the database.

First of all every investigation should start with tuning the "flip-flopping" Monitors and manually grooming the "old" entries: https://blogs.technet.com/b/kevinholman/archive/2009/12/21/tuning-tip-do-you-have-monitors-constantly-flip-flopping.aspx

The important thing to know here from the start is the following: the "standard" OM Grooming Job p_StateChangeEventGrooming will *only* delete the StateChangeEvent entries which are older than the configured State Grooming Threshold *AND* for which the corresponding Monitors are currently in Healthy state.

The modified procedure from Kevin's Article is actually the "same" procedure p_StateChangeEventGrooming *but* it has a small, but big impact change, to delete *ALL* StateChangeEvents which are older that the configured State Grooming Threshold regardless of what state the corresponding Monitors are in.

Ok, should be basically pretty straight forward right? Good! But there are some situations when you tune and you tune and then all of the sudden you can observe that you have no more "noisy" Monitors and we have a "stable" normal number of StateChangeEvents per day. However, after a couple of days we can see that even with these tuning actions which we have done, we can still see that the StateChangeEvent table is again the biggest or one of the biggest top tables in the database. But why is that?!

1. Let's start our investigation by running this SQL Query on the OperationsManager database to see how many StateChangeEvent entries we have, how many should get cleaned out by the "standard" Grooming and how many will *not* be cleaned out by the "standard" Grooming even if they are older than the State Grooming Threshold:

NOTE: Don't worry if you also get results for StateChangeEvents that should get groomed out by the standard Grooming - this means that on the next standard Grooming run, these *will* get cleaned out - you can even force this by running this Query: EXEC p_StateChangeEventGrooming.

  DECLARE<br>    @GroomingThresholdLocal DATETIME,<br>    @GroomingThresholdUTC DATETIME<br> SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, GETDATE())<br> FROM dbo.PartitionAndGroomingSettings<br> WHERE ObjectName = 'StateChangeEvent'<br> EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT<br> SELECT 'Should be cleaned', COUNT(*) AS 'Count'<br> FROM dbo.StateChangeEvent AS SCE WITH(NOLOCK)<br> INNER JOIN dbo.State AS S WITH(NOLOCK)<br>    ON SCE.StateId = S.StateId<br> WHERE<br>    SCE.TimeGenerated < @GroomingThresholdUTC AND<br>    S.HealthState = 1<br> UNION<br> SELECT 'Should NOT be cleaned', COUNT(*) AS 'Count'<br> FROM dbo.StateChangeEvent AS SCE WITH(NOLOCK)<br> INNER JOIN dbo.State AS S WITH(NOLOCK)<br>    ON SCE.StateId = S.StateId<br> WHERE<br>    SCE.TimeGenerated < @GroomingThresholdUTC AND<br>    S.HealthState != 1<br> UNION<br> SELECT 'Total StateChanges', COUNT(*) AS 'Count'<br> FROM StateChangeEvent

2. WOW! So from the above we could see that about 80% (2.255.838 to be exact) of the existing StateChangeEvent entries are *older* than the configured State Grooming Threshold but are still not cleaned-out by the standard Grooming and will not be - the only reason for this is that it seams that the corresponding Monitors are in Warning or Critical state - to get an idea of which these Monitors are, please run this Query:

  DECLARE<br>    @GroomingThresholdLocal DATETIME,<br>    @GroomingThresholdUTC DATETIME<br> SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, GETDATE())<br> FROM dbo.PartitionAndGroomingSettings<br> WHERE ObjectName = 'StateChangeEvent'<br> EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT<br> SELECT<br>    DSV.DisplayName<br> FROM dbo.StateChangeEvent AS SCE WITH(NOLOCK)<br> INNER JOIN dbo.State AS S WITH(NOLOCK)<br>    ON SCE.StateId = S.StateId<br> INNER JOIN Monitor AS M<br>    ON S.MonitorId = M.MonitorId<br> INNER JOIN BaseManagedEntity AS BME<br>    ON S.BaseManagedEntityId = BME.BaseManagedEntityId<br> INNER JOIN DisplayStringView AS DSV<br>    ON M.MonitorId = DSV.LTStringId<br> WHERE<br>    SCE.TimeGenerated < @GroomingThresholdUTC AND<br>    S.HealthState != 1 AND<br>    DSV.LanguageCode = 'ENU' AND<br>    M.IsUnitMonitor = 1<br> GROUP BY DSV.DisplayName

3. Now we encounter a pretty common issue - we should check for this - usually I bet that with this very high number of "unhealthy" Monitors we should have a looooot of open Alerts which in a production environment would not look pretty good so we can bet here that the corresponding Alerts were closed manually (or automated somehow) *BUT* the corresponding Monitors were *NOT* - to check how many these are, run this Query:

  DECLARE<br>    @GroomingThresholdLocal DATETIME,<br>    @GroomingThresholdUTC DATETIME<br> SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, GETDATE())<br> FROM dbo.PartitionAndGroomingSettings<br> WHERE ObjectName = 'StateChangeEvent'<br> EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT<br> SELECT COUNT(*)<br> FROM Alert AS A<br> WHERE A.RuleId IN (<br>    SELECT<br>       M.MonitorId<br>    FROM dbo.StateChangeEvent AS SCE WITH(NOLOCK)<br>    INNER JOIN dbo.State AS S WITH(NOLOCK)<br>    ON SCE.StateId = S.StateId<br>    INNER JOIN Monitor AS M<br>    ON S.MonitorId = M.MonitorId<br>    INNER JOIN BaseManagedEntity AS BME<br>       ON S.BaseManagedEntityId = BME.BaseManagedEntityId<br>    WHERE<br>       SCE.TimeGenerated < @GroomingThresholdUTC AND<br>       S.HealthState != 1 AND<br>    M.IsUnitMonitor = 1<br> ) AND A.ResolutionState = 255

4. WOW! So in this case we had about 2000 results! So to solve this issue, we can execute this PowerShell Script on a Management Server under OM PowerShell:

NOTE: Keep in mind that after running this PS Script you will have get additional new StateChangeEvent entries because a Monitor Reset is also a state change, these will be groomed out as soon as they are older than the configured State Grooming Threshold.

   $alerts = Get-SCOMAlert -ResolutionState 255 | where { $_.IsMonitorAlert -eq $true }<br> foreach ($alert in $alerts) {<br>    $monitor = Get-SCOMMonitor | where { $_.Id -eq $alert.MonitoringRuleId }<br>    $monitoringobject = Get-SCOMMonitoringobject -Id $alert.MonitoringObjectId | where { $_.HealthState -ne 0 -and $_.HealthState -ne 1 }<br>    if ($monitoringobject -ne $null) { $monitoringobject.ResetMonitoringState($monitor) }<br> }

5. After this, we should also run Kevin's modified Grooming procedure again to clean this up:

NOTE: If you are curious about the change in the original, standard Grooming procedure, I have highlighted the change in the Query below.

  SET ANSI_NULLS ON<br> GO<br> SET QUOTED_IDENTIFIER ON<br> GO<br> BEGIN<br>    SET NOCOUNT ON<br>    DECLARE<br>    @Err INT,<br>    @Ret INT,<br>    @DaysToKeep TINYINT,<br>    @GroomingThresholdLocal DATETIME,<br>    @GroomingThresholdUTC DATETIME,<br>    @TimeGroomingRan DATETIME,<br>    @MaxTimeGroomed DATETIME,<br>    @RowCount INT<br> SET @TimeGroomingRan = GETUTCDATE()<br> SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, GETDATE())<br> FROM dbo.PartitionAndGroomingSettings<br> WHERE ObjectName = 'StateChangeEvent'<br> EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT<br> SET @Err = @@ERROR<br> IF (@Err <> 0) BEGIN<br>    GOTO Error_Exit<br> END<br> SET @RowCount = 1<br> SELECT @MaxTimeGroomed = MAX(TimeGenerated)<br> FROM dbo.StateChangeEvent<br> WHERE TimeGenerated < @GroomingThresholdUTC<br> IF @MaxTimeGroomed IS NULL<br>    GOTO Success_Exit<br> DELETE MJS<br> FROM dbo.MonitoringJobStatus MJS<br> JOIN dbo.StateChangeEvent SCE<br>    ON SCE.StateChangeEventId = MJS.StateChangeEventId<br> JOIN dbo.State S WITH(NOLOCK)<br>    ON SCE.[StateId] = S.[StateId]<br> WHERE SCE.TimeGenerated < @GroomingThresholdUTC<br> AND S.[HealthState] in (0, 1, 2, 3)<br> SELECT @Err = @@ERROR<br> IF (@Err <> 0) BEGIN<br>    GOTO Error_Exit<br> END<br> WHILE (@RowCount > 0) BEGIN<br>    DELETE TOP (10000) SCE<br>    FROM dbo.StateChangeEvent SCE<br>    JOIN dbo.State S WITH(NOLOCK)<br>       ON SCE.[StateId] = S.[StateId]<br>    WHERE TimeGenerated < @GroomingThresholdUTC<br>    AND S.[HealthState] in (0,1,2,3)<br>    SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT<br>    IF (@Err <> 0) BEGIN<br>       GOTO Error_Exit<br>    END<br> END<br> UPDATE dbo.PartitionAndGroomingSettings<br> SET GroomingRunTime = @TimeGroomingRan,<br>    DataGroomedMaxTime = @MaxTimeGroomed <br> WHERE ObjectName = 'StateChangeEvent'<br> SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT<br> IF (@Err <> 0) BEGIN<br>    GOTO Error_Exit<br> END<br>    GOTO Success_Exit<br> Error_Exit:<br>    PRINT 'ERROR!!!'<br> Success_Exit:<br> END

To avoid this in the future:

Make sure that you always solve a problem before "closing" the issue (close Alert and Reset Monitor) because if you don't, then you will end up with another Alert and another unhealthy state of that Monitor.
If you don't want to solve an issue and usually just "ignore" the state of a Monitor or it's Alert, then it would be best to disable that Monitor
For Monitors which don't have AutoResolve option or which don't have a "Healthy State Check" always make sure that you reset the Monitor properly: https://technet.microsoft.com/en-us/library/hh212816.aspx
Basically make sure somehow, that you *ALWAYS* also reset a Monitor if you close it's Alert - a good idea to constantly monitor this and make sure you don't have to worry about that, is to implement some kind of automation for this, like in this great "AS IS" example using SC Orchestrator : https://blog.scomfaq.ch/2012/05/05/reset-monitor-using-scom-2012-and-orchestrator-a-must-have-runbook/

Try to keep your environment healthy and always be with your eyes open for new tuning! ;)

Bad OM Performance because State Change Events are not groomed out

Additional resources