Case Study: Slow Datasource Server Killed Subscriptions on My Reporting Services Server
Hey There - I had an experience with our production server recently that I thought I'd share with you, explain what we looked at to figure out what happened, and what I did to solve it.
To give you an overview, I am the Report Server test lead in the Reporting Services team, and I'm the primary contact for the Reporting Services Server that supports the SQL Business Unit here at Microsoft. If you are not already familar with the mechanics of how subscriptions work, please review my other "Troubleshooting Subscriptions" blog postings, the information will below will make much more sense.
When I came to work one monday morning, I noticed that I didn't receive the reports I normall get every morning, not the best way to start the week, I knew something was up. I have a report I created to monitor subscriptions in my server similar to the info in my other blog postings, and it showed that there were 8,000 rows in the Event table in our ReportServer database. The first indication here was that SQL Server Agent was running correctly because it is the source of the 8k records, but since there were no rows in the Notifications table, I knew that Reporting Services was not processing them as fast as they were coming in. One of my developer peers was helping me with the problem, he looked into the trace log files and noticed there there were entries similar to:
w WARN: Schedules, events, and subscriptions processing may be delayed because processing queue reached the maximum item count = N. Items currently being processed:
This is a confirmation of the problem I'd suspected, but I belive this error is only logged with SQL Server 2008R2 and beyond.
Note: I use the term "subscription" here alone, there were also "cache refresh" Events in the table and in my queries below. These are from the scheduled shared dataset cache refresh feature in SQL Server 2008R2, but the process for identifying and solving them are the same so I'll just say "subscription" from here on out.
There is a database application that is most commonly used by reports hosted in our report server, and many of those reports have one or more subscriptions to deliver reports via email. That database server had been recently upgraded (and caused all kinds of other problems:) so I was suspicious that it was the source of the problem here. But to narrow it down, we had to learn a bit more about what entries were in the Event table to definitively trace it down to the problem, and we used the first query below to see which reports were potentially causing the subscription problems. After running the reports manually, there were a few that never seemed to return, and they were early in the queue, so they were the smoking gun.
Ultimately, the problem did end up being related to the database upgrade. The database the reports were originally written against had numerous custom indexes added by my peers with great query optimization skills. The database upgrade involved low level table schema changes, there were views based on those tables that were used by the report queries, and many of the custom indexes were not recreated during the upgrade. Hence there were many report queries that were now running much slower, by an order of magnitude, after the upgrade, and this caused the log jam in our subscription processing. There were also many other reports not affected by the database upgrade, they were either unaffected by it, or they were written against another database, but they were caught up in the backlog of the subscriptions for the slow reports.
At this point, all of the available polling threads in the Report Server were busy processing the subscriptions based on the reports with slow queries. Even more events were getting loaded into the Event table faster than the Report Server could process them. It was clear that the report server would not be able to recover on its own, and it was only a matter of time before executive emails would start going around asking what was going on. So there were a few things that needed to be done to get the system back into production.
Steps to Recover
Disable the subscriptions causing the problems in the first place
With 8k Events waiting to be processed, adding more new Events to the table would not help. And even after the current subscription queuing problem was solved, the problem would only come back if the same subscriptions continued to fire. To get an idea which reports had the most slowdown after the upgrade I created a report based on the second query below to get the average execution time of the reports before and after the upgrade. Thankfully there were only 20 or so reports that were causing the worst problems, and quite a few of these turned out to be invalid or outdated due to people leaving the team and so on. For each of these, I went to the report in Report Manager, and either set the subscription to expire yesterday, or deleted them outright if they appeared to be old or invalid.
Clear out the backlog of queued events
Now that I was comfortable the offending report subscriptions were disabled, the Report Server still needed to crank through all of the queued events. I knew I had to purge them since many of the events were days old and would not provide value. Reporting Services does not have exposed features for this purpose, I'd have to roll up my sleeves and attack it at the database level.
Disclaimer: we do not officially support updating the Reporting Services tables mentioned below, do so at your own risk and only in extreme production stoppage situations like I describe here.
Now that I had a pretty good list of the offending reports/subscriptions, I manually deleted the rows for the slow reports from the Event table with a query similar to:
DELETE Event WHERE EventData = <guidvalue>
The rows that were associated with a subscription have the SubscriptionId uniqueidentifier value in the EventData column, and those associated with a scheduled cache refresh had the ReportId in the EventData column.
Now that I had deleted all of the Events from the offending subscriptions and reports, I removed the rows that had been entered prior to that morning with a query similar to:
DELETE Event WHERE TimeEntered < '9/20/2010 12:00:00'
This reduced the queued events to around 2,000 rows, and after the Reporting Services server was restarted to reset the polling threads, it only took about 30 minutes to process all those events.
As a reporting services administrator, I'm lucky to be in the product group that has the experts to back me up in cases like this; the diagnosis and solution above was reviewed and understood before I did it. This is also the first and only time I've ever modified the database directly myself. My hope in providing this to you is that it will help you get out of a similar situation, but I also hope it doesn't encourage you to make other types of updates to the data in your Reporting Services database, or apply this solution for creative custom features involving the Events table. Much of the data in the database is very inter-related, as the test lead of this area I guarantee you that we cannot test and verify that the Reporting Services server can handle all the various types of changes that can be made directly to the database tables. So please, be careful, know your queries, and run them at your own risk!
And finally, this experience did initiate a discussion between me and my dev/pm peers as to how we can improve the Reporting Services features to handle this better in the future.
I hope this helps, Thanks!
Event Table Query
-- Get the queued events associated with subscriptions select s.SubscriptionId, e.EventData, e.EventType, 'Subscription' = s.Description, 'SubscriptionOwner'=uo.UserName, 'ReportPath'= c.Path,'ReportModBy'=uc.UserName, 'MinEntered'=min(dateadd(hh,DATEDIFF(hh,GETUTCDATE(),GETDATE()),e.TimeEntered)), 'EventCount'=COUNT(*) from ReportServer.dbo.Event e join ReportServer.dbo.Subscriptions s on s.SubscriptionID = e.EventData -- EventData has the subscription id join ReportServer.dbo.Catalog c on c.ItemID = s.Report_OID join ReportServer.dbo.Users uo on uo.UserID = s.OwnerID join ReportServer.dbo.Users uc on uc.UserID = c.ModifiedByID group by s.SubscriptionId, e.EventData, e.EventType, s.Description, uo.UserName, c.Path, uc.UserName union all -- Get the queued events associated with cache refresh plans on reports select 'SubscriptionId'=null, e.EventData, e.EventType, 'Subscription' = 'n/a', 'SubscriptionOwner'= 'n/a', 'ReportPath'= c.Path, 'ReportModBy'=uc.UserName, 'MinEntered'=min(dateadd(hh,DATEDIFF(hh,GETUTCDATE(),GETDATE()),e.TimeEntered)), 'EventCount'=COUNT(*) from ReportServer.dbo.Event e join ReportServer.dbo.Catalog c on c.ItemID = e.EventData -- EventData has the report id join ReportServer.dbo.Users uc on uc.UserID = c.ModifiedByID group by e.EventData, e.EventType, c.Path, uc.UserName order by EventCount desc
Query to show the reports whose execution was much slower after the upgrade
select 'ItemId'=post.ItemID, 'Path'=post.Path, 'LastModBy'=post.LastModBy, 'preExecCount'=pre.ExecCount, 'postExecCount'=post.ExecCount, 'preAvgData'=pre.AvgData, 'postAvgData'=post.AvgData, 'pcntDiff'= case when pre.AvgData = 0 then 0.0 else ((post.AvgData-pre.AvgData)*1.0)/(pre.AvgData*1.0) end from ( select c.ItemID, c.Path, 'LastModBy'=um.UserName, 'ExecCount'=COUNT(*), 'MinData'=MIN(TimeDataRetrieval/1000), 'MaxData'=MAX(TimeDataRetrieval/1000), 'AvgData'=AVG(TimeDataRetrieval/1000) from ReportServer.dbo.ExecutionLogStorage el join Catalog c on c.ItemID = el.ReportID join Users um on um.UserID = c.ModifiedByID where el.ReportAction = 1 -- Render and el.Source = 1 -- Live and el.TimeStart between '9/16/2010' and '9/25/2010' group by c.ItemID, c.Path, um.UserName ) as post join ( select c.ItemID, c.Path, 'LastModBy'=um.UserName, 'ExecCount'=COUNT(*), 'MinData'=MIN(TimeDataRetrieval/1000), 'MaxData'=MAX(TimeDataRetrieval/1000), 'AvgData'=AVG(TimeDataRetrieval/1000) from ReportServer.dbo.ExecutionLogStorage el join Catalog c on c.ItemID = el.ReportID join Users um on um.UserID = c.ModifiedByID where el.ReportAction = 1 -- Render and el.Source = 1 -- Live and el.TimeStart between '9/9/2010' and '9/15/2010' group by c.ItemID, c.Path, um.UserName ) as pre on pre.ItemID = post.ItemID where case when pre.AvgData = 0 then 0.0 else ((post.AvgData-pre.AvgData)*1.0)/(pre.AvgData*1.0) end > 1.0 order by case when pre.AvgData = 0 then 0.0 else ((post.AvgData-pre.AvgData)*1.0)/(pre.AvgData*1.0) end desc