Estimate performance and capacity planning for workflow in SharePoint Server 2010

 

Applies to: SharePoint Server 2010

This performance and capacity planning article provides guidance on the effect that the use of workflow has on topologies that run Microsoft SharePoint Server 2010.

For general information about capacity planning for SharePoint Server 2010, see Performance and capacity management (SharePoint Server 2010).

Contents

  • Test farm characteristics

  • Test results

  • Recommendations

  • Troubleshooting

Test farm characteristics

The following sections describe the characteristics of the test farm:

  • Dataset

  • Workload

  • Hardware, settings, and topology

Dataset

To get benchmarks, most tests ran on a default Team Site on a single site collection in the farm. The manual start tests started workflows by using a list that has 8,000 items.

Workload

Testing for this scenario helps develop estimates of how different farm configurations respond to changes to the following variables:

  • Effect of the number of front-end servers on throughput for manually starting declarative workflows across multiple computers

  • Effect of the number of front-end servers on throughput for automatically starting declarative workflows on item creation across multiple computers

  • Effect of the number of front-end servers on throughput for completing tasks across multiple computers

The specific capacity and performance figures presented in this article will differ from the figures in real-world environments. The figures presented are intended to provide a starting point for the design of an appropriately scaled environment. After you complete the initial system design, test the configuration to determine whether it will support the factors in your environment.

This section defines the test scenarios and discusses the test process that was used for each scenario. Detailed information such as test results and specific parameters are given in each test result section later in this article.

Test name Test description

Throughput for starting workflows manually

  1. Associate the included MOSS Approval workflow with a list that creates one task.

  2. Populate the list with list items.

  3. Call the StartWorkflow Web service method on Workflow.asmx against the items in the list for five minutes.

  4. Calculate throughput by looking at the number of workflows in progress.

Throughput for starting workflows automatically when an item is created

  1. Associate the included MOSS Approval workflow with a list that creates one task, set to automatically start when an item is created.

  2. Create items in the list for five minutes.

  3. Calculate throughput by looking at the number of workflows in progress.

Throughput for completing workflow tasks

  1. Associate the included MOSS Approval workflow with a list that creates one task, set to automatically start when an item is created.

  2. Create items in the list.

  3. Call the AlterToDo Web service method on Workflows.asmx against the items in the task list that was created by the workflows that started.

  4. Calculate throughput by looking at the number of workflows completed.

Hardware, settings, and topology

Topologies that were used for these tests use a single computer for the content database and from one to four front-end computers that have the default installation for SharePoint Server 2010. Although the workflows that were used in these tests are not available in Microsoft SharePoint Foundation 2010, the results can be used to estimate similar scenarios on those deployments. The dataset that was used for these tests contains a single site collection with a single site that is based on the Team Site template on a single content database.

To provide a high level of test-result detail, several farm configurations were used for testing. Farm configurations ranged from one to four Web servers and a single computer that is running Microsoft SQL Server 2008. Testing was performed with one client computer. The database server and all Web servers were 64-bit, and the client computer was 32-bit.

The following table lists the specific hardware that was used for testing.

Web server Database server

Processor

2px4c@2.33GHz

4px4c@2.4GHz

RAM

4 GB

16 GB

Operating system

Windows Server 2008 R2 x64

Windows Server 2008 R2 x64

Storage

680 GB

4.2 terabyte

Number of network adapters

2

2

Network adapter speed

1 gigabit

1 gigabit

Authentication

NTLM

NTLM

Software version

4747

SQL Server 2008 R1

Number of SQL Server instances

1

1

Load balancer type

No load balancer

No load balancer

ULS logging level

Medium

Medium

Workflow Capacity Planning Topology

Workflow planning topology

Test results

The following tables show the test results for workflow in SharePoint Server 2010. For each group of tests, only certain specific variables are changed to show the progressive effect on farm performance.

All the tests reported in this article were conducted without think time, a natural delay between consecutive operations. In a real-world environment, each operation is followed by a delay as the user performs the next step in the task. By contrast, in the test, each operation was immediately followed by the next operation, which resulted in a continual load on the farm. This load can cause database contention and other factors that can adversely affect performance.

Effect of scaling the Web server on throughput

The following throughput tests were run by using the Approval workflow that is included with SharePoint Server 2010. The workflow association assigns one task, and all instances are run on a single list. Each instance of this workflow creates the following in the content database:

  • An entry in the Workflows table to store workflow status

  • Five secondary list items (one task and four history items)

  • Four event receivers to handle events on the workflow's parent item and task

Workflow Postpone Threshold was set to be very large so that workflow operations would never get queued. Each test was run five times, and each test ran for five minutes.

Manual start throughput

The test in the following table shows how the addition of front-end servers affects the throughput of starting workflows synchronously through the Web service. This test was run with a user load of 25 concurrent users continuously calling the StartWorkflow method on Workflow.asmx and no other load on the farm. The user load was the optimal load before dropped Web requests occurred. The list is prepopulated with up to 8,000 items.

Topology Approval workflow maximum RPS

1x1

14.35

2x1

24.08

3x1

29.7

4x1

30.77

The following graph shows how throughput changes. The addition of front-end servers does not necessarily affect farm throughput in a linear manner but instead peaks off at around three to four front-end servers. In summary, the maximum throughput for manually starting workflows is around 30 workflows per second, and adding more than four front-end servers will likely have an insignificant effect.

Manual start throughput

Manual start throughput

Automatically starting workflows when items are created throughput

The test in the following table shows how the addition of front-end servers affects the throughput of starting workflows automatically when items are created. This test was run with a user load of 150 concurrent users continuously calling the list Web service to create new list items in a single list and no other operations on the server. The list started as an empty list.

Topology Approval workflow maximum RPS

1x1

13.0

2x1

25.11

3x1

32.11

4x1

32.18

The following graph shows how throughput changes. The throughput is very close to the manual start operations. Similar to the manual start test, throughput peaks at approximately three or four front-end servers at approximately 32 workflows per second maximum. Adding more than three or four front-end servers will have an insignificant effect.

Autostart workflow throughput

Autostart workflow throughput

Task completion throughput

The test in the following table shows how the addition of front-end servers affects the throughput of completing workflow tasks. The task list that was used by autostart workflows in the previous test was the list that was used to complete tasks. This test was run with a user load of 25 concurrent users continuously calling the AlterToDo method on Workflow.asmx and no other operations on the server. The list started as an empty list.

Topology Approval workflow maximum RPS

1x1

13.5

2x1

23.86

3x1

27.06

4x1

27.14

The following graph shows how throughput changes. Similar to the manual start test, throughput peaks at approximately three or four front-end servers at approximately 32 workflows per second maximum. Adding more than three front-end servers will have an insignificant effect.

Task completion throughput

Task completion throughput

Effect of list size and number of workflow instances on throughput

The test in the following table shows how throughput changes as list size and number of workflows increases. Data population was done by running the autostart workflow test continuously until 1 million items were created in the list, and stopping at different checkpoints throughout the test to perform throughput measurements as we did with the core throughput tests. Tests were performed on a 4x1 topology.

To maintain reliability during data population, we had to keep workflow queuing turned on to avoid reaching the maximum number of connections on the database server. If no connections are available and a workflow operation cannot connect to the content database, the operation will be unable to run. See Recommendations for more information about workflow queuing.

Number of items or workflows Baseline solution maximum (RPS)

0

32.18

10

32

1,000

28.67

10,000

27.16

100,000

16.98

1,000,000

9.27

Autostart throughput as number of items and workflows increases

Throughput as number of items, workflows increase

For a single list and single task and history list, throughput decreases steadily between 1,000 and 100,000 items. However, the rate of degradation reduces after that point. We attribute degradation of throughput to many factors.

One factor is the number of rows added to many tables in the content database per instance. As mentioned earlier, workflows create several list items in addition to event receivers that each workflow instance registers. As table sizes grow large in different scopes, adding rows becomes slower, and the aggregate slowdown for these additions becomes a more significant degradation than what is experienced with only list item creation.

Task list size contributes additional overhead. In comparing throughput for workflows run on new lists versus new task lists, task lists had a bigger effect on performance. This is because task lists register for more event receivers than the parent list items. The following chart describes the differences.

Throughput with different list configurations (workflows started per second) Million item task list Empty task list

Million item list

9.27

12

Empty item list

9.3

13

If you know that you will have to run lots of workflows against large lists and need more throughput than what your tests show you can get, consider whether your task lists can be separated between workflow associations.

Recommendations

This section provides general performance and capacity recommendations. Use these recommendations to determine the capacity and performance characteristics of the starting topology that you created to decide whether you have to scale out or scale up the starting topology.

For specific information about minimum and recommended system requirements, see Hardware and software requirements (SharePoint Server 2010).

Scaled-out topologies

You can increase workflow throughput by scaling out to up to four Web servers. Then, additional increase will be insignificant. Workflow throughput can be restricted by performance-related workflow settings. These settings are described in more detail in Workflow queuing and performance-related settings.

Estimating throughput targets

Many factors can affect throughput. These factors include the number of users, and the type, complexity, and frequency of user operations. More complex workflows that perform many operations against the content database or register for more events will run slower and consume more resources than other workflows.

The workflow used in this test creates several entries in the content database that are built in to the task activities. If you expect to have workflows with small numbers of tasks, you can expect similar throughput characteristics. If most workflows contain very lightweight operations, throughput may be increased. If your workflows will consist of lots of tasks or intense back-end operations or processing power, you can expect throughput to decrease.

In addition to understanding what the workflows will do, consider where the workflows will run and whether they will run against large lists, on which throughput will decrease over time.

SharePoint Server 2010 can be deployed and configured in many ways. As a result, there is no simple way to estimate how many users can be supported by a given number of servers. Therefore, make sure that you conduct testing in your own environment before you deploy SharePoint Server 2010 in a production environment.

Workflow uses a queuing system to control workflow-related stress on farm resources and the content database. By using this system, when the number of workflows executing against a database reaches an administrator-configured threshold, successive workflow operations are added to the queue to be run by the Workflow Timer service. By default, this service receives a batch of workflow work items through timer jobs every minute.

Several farm administrator settings directly and indirectly related to the queuing mechanism affect the performance and scaling for workflow. The following sections describe what these settings do and how to adjust them to meet performance requirements.

Understanding the basic queue settings

Farm administrators can adjust the following settings to configure basic characteristics of the queuing system:

  • Workflow Postpone Threshold (Set-SPFarmConfig –WorkflowPostponeThreshold <integer>)

    The maximum number of workflows that can execute against a single content database before additional requests and operations are queued. Queued workflows show a status of Starting. This is a farm-wide setting that has a default value of 15. This represents the number of workflow operations that are being processed at a time, not the maximum number of workflows that can be in progress. As workflow operations are completed, successive operations will be able to run.

  • Workflow Event Delivery Batch Size (Set-SPWorkflow –BatchSize <integer>)

    The Workflow Timer service is an exception to the postpone threshold limit and will retrieve batches of items from the queue and execute them one at a time. These batches can be larger than the postpone threshold. The number of work items that the service receives per run is set by using the BatchSize property. The BatchSize property can be set one time per service instance. The default value is 100. When running on application servers that are not configured to be front-end servers, the Workflow Timer service requires workflow configuration settings in Web.config to be set in the configuration database. This must be done through a script that calls UpdateWorkflowConfigurationSettings() on the SPWebApplication object, which will copy the Web.config settings from a front-end server.

  • Workflow Timer Job Frequency (Set-SPTimerJob job-workflow –schedule <string>)

    The frequency with which the Workflow Timer service runs can be adjusted through timer job settings. By default, the service is set to run every five minutes. This means that there can be a five-minute delay before the work items at the top of the queue are processed.

    Note

    Scheduled work items such as task due date expirations are also picked up by the same timer mechanism. Therefore, they will be delayed by the same time interval.

The Workflow Timer service can be turned off on each server by using Shared Services Administration in Central Administration. By default, it will run on every front-end server in the farm. Each job will iterate through all the Web applications and content databases in the farm.

The combination of the postpone threshold, batch size, and timer frequency can be used to limit workflow operations against the database. Maximum throughput will be affected by how quickly operations get queued and processed from the queue.

For example, with the default settings, a single timer service, and a single content database, if there are 1,000 items in the queue, it will take ten timer job runs to execute them all, which will take 50 minutes to execute. However, if you set the batch size to 1,000 and set the timer job to run every minute, the operations would all begin execution after a minute. If you set the postpone threshold higher, more operations will run synchronously, reducing the number of requests that get queued and reducing the total time that is required to process those workflows.

Note

We recommend setting the postpone threshold no larger than 200, because concurrent workflow instances run in their own threads and will each open new SQL Server connections, potentially hitting the maximum connection limits on the database server over time.

If you do not want workflows running on front-end servers and know that operations do not have to occur immediately, you can isolate the Workflow Timer service to run on select application servers, set the postpone threshold to a very low number to force workflows to usually run in the timer service, and set the batch size large so that it receives items more quickly and frequently. If you want to make sure workflows run more synchronously across the system, set the postpone threshold larger so that workflows are not postponed often and have a more immediate effect.

Modify these settings to optimize for how you want workflows to operate. We recommend experimenting with different settings and testing them to optimize them for your environments and requirements.

Adjusting settings for queuing

If the farm will sustain heavy workflow load for long periods of time or there will be many delay events queued from workflows in the system, the number of queued workflow operations will grow. In addition to the basic queue settings, you may have to tune the following settings to keep the queue in good health:

  • Work Item Event Delivery Batchsize

    The table that workflow uses for queued events is a general work item table that is shared with other non-workflow features in SharePoint Server 2010. Thus, there is another timer job that dequeues non-workflow work items. Similar to the workflow event delivery batch size, the work item event delivery batch size specifies the number of non-workflow work items that are dequeued at a time.

  • Workflow Failover Timer Job Frequency

    In rare circumstances, if workflow events cannot be delivered to a workflow instance, the event delivery will be scheduled on the queue as a failover work item to be retried later (starting with 5 minutes later, and then 10 minutes if it fails again, and then 20 minutes, and so on). A workflow failover timer job dequeues failover work items, and this setting adjusts the frequency at which the failover timer will run. By default, this runs every 15 minutes.

  • Workflow Failover Batchsize

    Similar to the workflow and work item batch size settings, this setting controls the number of failover events that each failover timer job will dequeue.

    Because there are many timer jobs that operate on the same table, lots of queued items can cause database contention and reduce throughput and reliability. To reduce contention, we recommend the following:

    • Balance Postpone Threshold and Workflow Batchsize so that batch size is small enough or timer job frequency high enough that a timer job can be completed before the next timer job starts in order to avoid building up too many parallel timer job runs that cannot finish.

    • To avoid table locks, do not set either of the two batch size settings larger than 5,000.

Tip

Offset the frequency of the workflow, work item, and failover timer jobs so that they are not always executing at the same times. To get a large list that has workflows, four minutes for the workflow timer job and six minutes for the failover worked well in our data population scripts.

Improving scaling for task and history lists

Workflows generate many tasks and history items per instance. By default, these lists are indexed to help with scaling, but as these lists grow, performance will always decrease. To reduce the rate of the decrease, keep separate history and task lists for different workflow associations, and periodically change these lists in the workflow association settings as lists become large.

Workflow also has a daily timer job (job-workflow-autoclean) that will automatically delete workflow instances and tasks for instances that have been finished for more than 60 days. Leave this timer job on to keep the task lists and events on the task list as clean as possible. If data must be preserved, write the data to other lists or archive locations. Workflow history items are not deleted by this timer job. If you have to clean these up, this should be done with a script or manually through the list user interface.

Other considerations

Removing columns on lists causes a database operation proportional to the number of items in the list. Removing workflow associations will remove the workflow status column from the list. This causes a large operation on large lists. If you know that a list has millions of items, set the workflow to No New Instance instead of removing workflows.

Troubleshooting

Bottleneck Cause Resolution

Database contention (locks)

Database locks prevent multiple users from making conflicting modifications to a set of data. When a set of data is locked by a user or process, no other user or process can change that same set of data until the first user or process is complete, changing the data and relinquishing the lock.

To help reduce the incidence of database locks, you can do the following:

  • Distribute workflows to more document libraries.

  • Scale up the database server.

  • Tune the database server hard disk for read/write.

Methods exist to circumvent the database locking system in SQL Server 2005, such as the NOLOCK parameter. However, we do not recommend or support use of this method because of the possibility of data corruption.

Database server disk I/O

When the number of I/O requests to a hard disk exceeds the disk's I/O capacity, the requests will be queued. As a result, the time to complete each request increases.

Distributing data files across multiple physical drives allows for parallel I/O. The blog SharePoint Disk Allocation and Disk I/O (https://go.microsoft.com/fwlink/p/?LinkId=129557) contains useful information about resolving disk I/O issues.

Web server CPU utilization

When a Web server is overloaded with user requests, average CPU utilization will approach 100 percent. This prevents the Web server from responding to requests quickly and can cause timeouts and error messages on client computers.

This issue can be resolved in one of two ways. You can add Web servers to the farm to distribute user load, or you can scale up the Web server or servers by adding faster processors.

Web servers

The following table shows performance counters and processes to monitor for Web servers in a farm.

Performance counter Apply to object Notes

Processor time

Total

Shows the percentage of elapsed time that this thread used the processor to execute instructions.

Memory utilization

Application pool

Shows the average utilization of system memory for the application pool. You must determine the correct application pool to monitor.

The basic guideline is to determine peak memory utilization for a given Web application, and assign that number plus 10 to the associated application pool.

Database servers

The following table shows performance counters and processes to monitor for database servers in your farm.

Performance counter Apply to object Notes

Average disk queue length

Hard disk that contains SharedServices.mdf

Average values larger than 1.5 per spindle indicate that the write times for that hard disk are insufficient.

Processor time

SQL Server process

Average values larger than 80 percent indicate that processor capacity on the database server is insufficient.

Processor time

Total

Shows the percentage of elapsed time that this thread used the processor to execute instructions.

Memory utilization

Total

Shows the average utilization of system memory.

See Also

Other Resources

Workflow Scalability and Performance in Windows SharePoint Services 3.0 (https://go.microsoft.com/fwlink/p/?LinkId=207353)