Adventures in SharePoint 2010 Workflow Troubleshooting
I recently had a customer who was encountering issues with the SharePoint 2010 Workflow timer job. The issue would cause the Workflow timer job to hang. Other timer jobs would execute successfully, but the Workflow timer job would remain in either a “running” or a “pausing” state. Recycling the timer service would temporarily resolve the issue, but it would inevitably return within hours. Troubleshooting workflow issues can be very painful. There is a lot of good workflow troubleshooting guidance already published, but the steps we used (bottom of this post) were not articulated anywhere. Hopefully this post can save you some of that pain.
To better understand the issue, let’s review some workflow basics. Workflows in SharePoint are not designed to be “real time”. A certain number of workflows, based on some throttling controls, will be executed within the web app’s worker process (w3wp). If the throttle limits are exceeded, the workflow execution will be deferred for batch execution by the timer service (specifically, the “Workflow” timer job). Any workflow that hits a dehydration point (such as a step requiring user action) will be placed into the timer job’s queue for batch execution. Workflows should be actively managed by site administrators, meaning that faulting workflows should be reviewed and remediated frequently. The number of workflow associations for a site/list/content type (seen on the workflow settings page) should be kept under 20, as a recommended practice (see this article for more recommended practices). To do this, workflow managers should remove previous versions of each workflow when a new version is published (assuming that there are no valid running instances of that workflow). For an in-depth discussion of the workflow infrastructure, see this three-part blog post on the SharePoint Developer team blog.
Internally, SharePoint tracks the state of each workflow instance using a combination of values, meaning that there are states other than simply “running”, “completed”, and “canceled”. If you were to query the workflow instance(s) associated with a web or item via PowerShell, you would see that the InternalState property contains values such as “Running, Faulting, Terminated”. That particular status is one of several that indicates a non-recoverable error has been encountered in a workflow instance. Workflow instances in this state (and its sister state of “Canceled, Faulting, Terminated”) should be removed, since they are potential reasons for the Workflow timer job to hang. These instances can be removed (assuming they are not orphaned) by removing the entire workflow association or terminating the workflow instance in the SharePoint UI. Alternatively, they can be removed with PowerShell following the construct below.
Add-PSSnapin Microsoft.SharePoint.PowerShell -ErrorAction SilentlyContinue
$site = Get-SPSite -Identity <url or GUID>
$item = <get list item, either directly or by iterating through a list>
$itemWf = <get item workflow(s) either directly with the InstanceId or by iterating through the $item.Workflow enumeration>
# This will cancel the workflow, which is equivalent to the "Terminate this workflow" option in the UI
# This will remove the workflow instance entirely. Use this to remove a workflow instance that has been canceled, but is stuck in a "Canceled, Faulting, Terminated" state
In most cases of workflow failure, the failing instance will be returned to the workflow queue for re-execution. This behavior allows workflows with recoverable errors to automatically resume and complete successfully. Unfortunately, there are instances in which non-recoverable failures will continue to be retried, thus consuming resources attempting operations that will always fail.
In our case, we started with the standard troubleshooting steps. We increased logging on the Workflow Infrastructure events, pared down Tasks and Workflow History lists to align with recommended limits, disabled security components such as intrusion protection and anti-virus scanning, and terminated workflows with non-recoverable statuses. Still the Workflow timer job continued to hang. We found the ULS information to be hit-or-miss and were unable to track back to the specific workflow instance(s) that triggered the hang condition. Our next step was to use DebugDiag to capture memory dumps of the owstimer service. Analysis of these dumps provided us with workflow associations that were involved, but not with the workflow instance details. We would remove a workflow association, hit the issue again, capture another memory dump, analysis. Rinse. Repeat. This approached was time consuming, but effective. Effective, that is, until we hit a scenario where the business process owner could not justify the removal of an entire association with several hundred running instances. At this point, we were at an impasse. The ULS logs were not pointing to a specific workflow instance and the workflow in question had no instances with a “known bad” internal state.
The solution to finding the failing workflow instances turned out to be fairly simple. We increased the logging on the “SharePoint Foundation” à “General” and “SharePoint Foundation” à “Workflow Infrastructure” categories to Verbose. This quickly fills up ULS logs, so it is not a configuration that you want to enable and forget about. In our case, the Workflow timer job had a pattern of hanging every half hour, one server at a time. This enabled us to crank up the logging a few minutes prior to the expected failure time, let a server’s Workflow timer job hang, then reduce the logging. Finding the workflow that hung the Workflow job was a matter of performing the following steps.
Open ULS log on a server that has a Workflow timer job that hung during the period of increased logging. To confirm that the timer job did actually hang…
…search for “job-workflow)”
…filter the log by the correlation ID associated with the Workflow timer job (use ULSViewer to make your life easier)
…find the last log entry for that correlation ID. If the Workflow timer job is still “running” or “pausing” (check the CA UI for this), but no entries have been recorded for a period (generally one minute of no entries is plenty for confirmation), you have a hung Workflow timer job.
If you start at the bottom of the log and work up, you should find a “Workflow Infrastructure” log entry that contains the workflow instance ID and the list ID (assuming this is a list-level or content type-level workflow). Note both IDs.
Continuing up the log, you should find an entry that looks similar to this: Verbose GetUriScheme (/<relative path/<item name>) . The first entry that you find with this information is most likely your culprit.
Navigate to the item found in step #3 and review the workflow instances for the instance ID (step #2). You will find an in-progress instance that is faulting. Terminate this workflow.
The process above is very much a rinse/repeat process. If you are finding that the faulting instances are coming from the same workflow association (version), you are probably better off removing the entire association. You will need to recycle the timer service on the impacted server after performing the actions above and terminating the faulting workflow.