Service Manager Data Warehouse Troubleshooting
There are some great articles out there which talk about troubleshooting the Data Warehouse. In this article I will be going into explaining the internals of the Data Warehouse along with troubleshooting information.
It is important to first of all understand how the DW Jobs work. There are some Jobs that are only visible by looking into the database, others may be seen in PowerShell only and the most basic ones can be seen in the Console. From the perspective of the databases, the DW Jobs are actually called Processes. From now on, I will refer to the DW Jobs as Processes in this article.
The DW Processes are actually categorized like this:
- Process Categories - this is top level way of categorizing Processes (Jobs)
- Processes - these are the actual Processes that exist (ex. DWMaintenance, MPSyncJob, Extract, Transform, Load, etc.)
- ProcessModules - each Process has several ProcessModules which it will actually execute in order to "finish" - there is also a certain order in which these will run depending on dependencies on one another and the level of the modules
Now what you need to know is that each Process will be executed via a Schedule which is actually a Rule (Workflow that runs on a schedule and starts its owning Process). When a Process is scheduled (for execution) it will actually get a new Batch created. Think of a Process as a class and of a Batch as an instance of that class that is actually doing something. It is the same for a ProcessModule, each will get a WorkItem when it is scheduled for execution. Because a Process can have from one to many ProcessModules, when a Batch is created, it also gets all the WorkItems associated with it (ProcessModules of the Process).
So, to recap, because we will be using these a lot in this article:
- Batch = an instance of a Process (which in the Console or PowerShell is a Job)
- WorkItem = an instance of a ProcessModule of a Process
Important to know and also the way the processing/execution will depend on and work, is that each Batch, as well as each WorkItem (of a Batch) will have a certain Status :
- Success (1) - the Batch/WorkItem was successful and at that point the actual Status gets changed to Complete
- Failed (2) - there was an error in the Batch/WorkItem and it will be retried on the next execution
- Not Started (3) - the Batch/WorkItem was scheduled for execution and it will be executed on the next interval (schedule) - this is a good Status, it means there was no error and it will run the next time it needs to
- Running (4) - the Batch/WorkItem is currently running
- Stopped (5) - the Batch/WorkItem was stopped (by the user from the Console - Suspend or PowerShell Stop-SCDWJob )
- Completed (6) - the Batch/WorkItem has successfully completed without any errors
- Waiting (7) - the Batch/WorkItem is waiting for other Batches/WorkItems to finish first on which they are dependent
It is important to know that the actual Status is important for the WorkItems, because the Batch Status depends directly on the Statuses of its WorkItems - here are some examples:
- A Batch can have the Status set to Running while all its WorkItems still have the Status of Not Started => this is not a bad thing, it just means that the certain Batch has received a "start request" from the schedule, but the (first) WorkItem(s) cannot run yet because they are waiting on another Batch/WorkItem (of a different Process) to finish first
- If one of the WorkItems of a Batch will fail and have a Status of Failed, then the Batch itself will get the Status of Failed => all WorkItems need to complete successfully and thus have a Status of Completed for the Batch to also get the Status of Completed
When a Process gets created the first time (on installation or DW registration), a new Batch will be created with some BatchId (the newest one possible - starting from 1 - for the first Process). Then for each ProcessModule of this Process, a new WorkItem with a certain WorkItemId will get created for this corresponding Batch. This will all initially have the Status of Not Started and when the corresponding Schedule Rule will reach it schedule it will send a "start request" for it's Batch. This will set the Status of that Batch to Running. The actual way how these run is that there is a Rule associated with each Process and these Rules run every 30 seconds. As soon as such a Rule runs, if it sees that its corresponding Process has the (latest) Batch with a Status of Running, will take the first/next WorkItem that has a Status of Not Started (depending on dependencies and level) and will try to execute that WorkItem. If it can execute and it is successful, it will get the Status of Completed once it finishes. If it fails, then it will get the Status of Failed. It is important to know here, that it can and will get a Status of Failed also without there being an actual error - but rather it cannot start/run right now because a dependent WorkItem is not finished yet and thus the "error" is actually "waiting on workitems to finish".
Another very important thing to know is that there is also a certain synchronization method between the different Processes, so that they don't interfere with each others work. This is needed because there are some actions which change the database(s) schema, add/remove columns from tables, indexes, primary/foreign keys, etc. We would not want the ETL (Extract, Transform, Load) Processes to run and copy/transform data while we are doing such changes. Because of this, we have these implementations:
- When the DWMaintenance and/or MPSyncJob Processes start running, the first thing they will do, is disable the ETL (Extract, Transform, Load) Processes => it is very important to note here that if either of the ETL Processes will be running when they will get disabled, these will *still* show the Status of Running, but in reality they will not run because they get disabled.
- Because DWMaintenance and MPSyncJob Processes should also not be running at the same time, these use a locking mechanism - whichever of them gets this lock first will be allowed to run until it finishes (Status of Completed) while the other one will remain waiting for it to finish, even if the Status will still show as Running and even if it will *not* be directly disabled => there is a table in the DWStagingAndConfig database used for holding the lock which you can look at - it is called LockDetails.
I have talked about some "invisible" Jobs which can only be seen by querying the database. There are actually a lot of those, but the "real" way to talk about them, is to go back to the Process Categories. There is a very important Process Category here which needs to be mentioned, which is the Deployment category. This is invisible because the MPSyncJob actually takes care handing work over this this category. MPSyncJob will associate Management Packs from one data source to another (ex. from CMDB to the DW). Each Management Pack that got associated will get a new Job (Process) created for it which falls under the Deployment Process Category. We can also refer to this as the "Deployment Job" if you will - and if fact, this is how you would usually hear about this. These are responsible for *actually* deploying (or installing if you will) the Management Packs which were synchronized over - without this, nothing would work because we would have no extended information in the Data Warehouse about what Classes, Relationships, etc. we have in the CMDB (and other managed data sources which we can register to the Data Warehouse).
If either of the deployment "jobs" fail or are not yet finished, then you can be more or less sure that nothing will work properly and you will 99% get errors related to either of the ETL (Extract, Transform, Load) Processes and maybe not only those - depending on where deployment is at that point.
While any of the Deployment Processes will follow the same rules as the others, that is get a Batch and one to many WorkItems, it also has another way of actually making the deployment. Each such Deployment Process (for each Management Pack) will have a new DeploySequence created (with a DeploySequenceId). Because each Management Pack has one to many "items" (Classes, Relationships, etc.) which need to be deployed, for each DeploySequence we will have one to many DeployItems created. This is important information for when you get into the actual troubleshooting part in the database.
In case you are not familiar with the basics of troubleshooting the DW Jobs from PowerShell, I suggest first starting with this article: http://technet.microsoft.com/en-us/library/hh542403.aspx
Also, if you are not familiar with what the ETL (Extract, Transform, Load) Jobs are and how these work, I suggest reading this article as well: http://blogs.technet.com/b/servicemanager/archive/2009/06/04/data-warehouse-anatomy-of-extract-transform-load-etl.aspx
Remember to always start by checking if there are any Management Packs where they Deployment Status shows as Failed in the SvcMgr Console in the Date Warehouse tab under the Management Packs view. For any such Management Pack which is Failed, you should run the "Restart Deployment" task from the Console in the tasks pane. If you are lucky, it was just some timeout or deadlock and it will succeed this time. If not, then you can always see errors about this failure as well as any other failures in the DW Jobs by looking into the Operations Manager event log directly on the SvcMgr Data Warehouse Management Server (filter by sources: Data Warehouse and Deployment).
Additionally, you can try to force reset the DW Jobs and run them in a certain order by using the script from this article: http://blogs.technet.com/b/mihai/archive/2013/07/03/resetting-and-running-the-service-manager-data-warehouse-jobs-separately.aspx
The Data Warehouse database which we are interested in when troubleshooting the DW Jobs is the DWStagingAndConfig database. Here is a list of tables of interest when troubleshoot the DW Jobs and at the end, also a file attached with useful sql queries:
1. The Infra.ProcessCategory table is where all the existing Process Categories are stored. It has a column called IsEnabled which needs to be 1 in order for any Process under this Process Category to be able to run - this is only modified (0 and 1) by the DWMaintenance and/or MPSyncJobs when they run - this is how they disable the other Processes.
2. The Infra.Process table is where all existing Processes are available and classified on Process Categories via the ProcessCategoryId column. These also have a IsEnabled column which should always be 1 in order for them to be able to execute. The only reason why either of these would have IsEnabled set to 0 is if anyone explicitly disabled these by using the Disable-SCDWJob CMDLet, which should never be done. To disable a Process (Job) always disable only its Schedule by using Disable-SCDWJobSchedule . In the screenshot below, I have explicitly filtered for only the "common" Processes, but if you will query the entire table, you will see all of them.
3. The Infra. ProcessModule table is where all existing ProcessModules are located and are classified on each Process via the ProcessId column. That is what you should use in a where clause of a sql query to see all ProcessModules that belong to a certain Process. Here is an example in the screenshot below for all ProcessModules of the DWMaintenance Process. These are not "ordered" in this result - if you want to figure out the "order" in which they would run, you need to check out the ModuleLevel column along with each dependency for each module which can be seen in the Infra.ModuleTriggerCondition table.
4. The Infra. Batch table is where all the Batches will be found (current/previous and next - check out Infra.BatchHistory for a history of these). This is where you will be able to see the BatchIds for the Processes and you can view them for a specific Process by using the ProcessId in a sql where clause. In the example screenshot below, we can see the Batches of the time the screenshot was taken for the Process with ProcessId = 1 (which in this case is the DWMaintenance Process).
5. The Infra. WorkItem table is where all the WorkItems of a Batch will be found. The most important part here is that this table is where we can see any errors for failed WorkItems of a Batch (that is an instance of a certain Process). In the screenshot below, an example of the WorkItems of the Batch with BatchId 3098 that is an instance of the DWMaintenance Process.
6. The DeploySequence table where we have an entry for each DeploySequence (so each MP that is or was deployed). You should know that in this process, a staging table is used which is called DeploySequenceStaging and if deployment finished and was successful, then this staging table should be *empty*. Here is a screenshot (not all results) of how this looks like.
7. The DeployItems table where we store an entry for each DeploySequence (so each MP that is or was deployed). You should know that in this process, a staging table is used which is called DeployItemStaging and if deployment finished and was successful, then this staging table should be *empty*. Here is a screenshot with the list of DeployItems belonging to the DeploySequence of the System.WorkItem.Incident.Library Management Pack.
In addition to that, you can always use SQL Server Profiler to get detailed information about what queries are being executed, details into different errors that might happen and of course having the queries, also a possibility of understanding the "why" behind them. The most useful column on which you can filter on is the ApplicationName column a the trace. These are the various application names used by the DW Jobs processing modules:
- DW Jobs scheduling, execution and processing: SC DAL--Orchestration and SC DAL--SCDW
- DWMaintenance job: SC DAL--Maintenance
- MPSync job: SC DAL--MP Sync
- Extract jobs: SC DAL--SCDW Extract Module
- Transform job: SC DAL--SCDW Transform Module
- Load jobs: SC DAL--SCDW Load Module
- Cube related jobs: Microsoft SQL Server Analysis Services (and general troubleshooting of SQL Analysis Services – Application event log and SQL Analysis Services tracing)
A cool template I usually use for the SQL Profiler Trace which is in 99% of the cases enough, has these settings (events and columns):
NOTE: Attached, you will also find a file (DWJobs_sql_queries.sql) that contains (commented) useful sql queries which you can and should use when troubleshooting the DW Jobs.
Good hunting! :)