Compute Resource Consolidation Pattern

Article
08/26/2015

Consolidate multiple tasks or operations into a single computational unit. This pattern can increase compute resource utilization, and reduce the costs and management overhead associated with performing compute processing in cloud-hosted applications.

Context and Problem

A cloud application frequently implements a variety of operations. In some solutions it may make sense initially to follow the design principle of separation of concerns, and divide these operations into discrete computational units that are hosted and deployed individually (for example, as separate roles in a Microsoft Azure Cloud Service, separate Azure Web Sites, or separate Virtual Machines). However, although this strategy can help to simplify the logical design of the solution, deploying a large number of computational units as part of the same application can increase runtime hosting costs and make management of the system more complex.

As an example, Figure 1 shows the simplified structure of a cloud-hosted solution that is implemented using more than one computational unit. Each computational unit runs in its own virtual environment. Each function has been implemented as a separate task (labeled Task A through Task E) running in its own computational unit.

Figure 1 - Running tasks in a cloud environment by using a set of dedicated computational units

Each computational unit consumes chargeable resources, even when it is idle or lightly used. Therefore, this approach may not always be the most cost-effective solution.

In Azure, this concern applies to roles in a Cloud Service, Web Sites, and Virtual Machines. These items execute in their own virtual environment. Running a collection of separate roles, web sites, or virtual machines that are designed to perform a set of well-defined operations, but that need to communicate and cooperate as part of a single solution, may be an inefficient use of resources.

Solution

To help reduce costs, increase utilization, improve communication speed, and ease the management effort it may be possible to consolidate multiple tasks or operations into a single computational unit.

Tasks can be grouped according to a variety of criteria based on the features provided by the environment, and the costs associated with these features. A common approach is to look for tasks that have a similar profile concerning their scalability, lifetime, and processing requirements. Grouping these items together allows them to scale as a unit. The elasticity provided by many cloud environments enables additional instances of a computational unit to be started and stopped according to the workload. For example, Azure provides autoscaling that you can apply to roles in a Cloud Service, Web Sites, and Virtual Machines. For more information, see Autoscaling Guidance.

As a counter example to show how scalability can be used to determine which operations should probably not be grouped together, consider the following two tasks:

Task 1 polls for infrequent, time-insensitive messages sent to a queue.
Task 2 handles high-volume bursts of network traffic.

The second task requires elasticity that may involve starting and stopping a large number of instances of the computational unit. Applying the same scaling to the first task would simply result in more tasks listening for infrequent messages on the same queue, and is a waste of resources.

In many cloud environments it is possible to specify the resources available to a computational unit in terms of the number of CPU cores, memory, disk space, and so on. Generally, the more resources specified, the greater the cost. For financial efficiency, it is important to maximize the amount of work an expensive computational unit performs, and not let it become inactive for an extended period.

If there are tasks that require a great deal of CPU power in short bursts, consider consolidating these into a single computational unit that provides the necessary power. However, it is important to balance this need to keep expensive resources busy against the contention that could occur if they are over-stressed. Long-running, compute-intensive tasks should probably not share the same computational unit, for example.

Issues and Considerations

Consider the following points when implementing this pattern:

Scalability and Elasticity. Many cloud solutions implement scalability and elasticity at the level of the computational unit by starting and stopping instances of units. Avoid grouping tasks that have conflicting scalability requirements in the same computational unit.
Lifetime. The cloud infrastructure may periodically recycle the virtual environment that hosts a computational unit. When executing many long-running tasks inside a computational unit, it may be necessary to configure the unit to prevent it from being recycled until these tasks have finished. Alternatively, design the tasks by using a check-pointing approach that enables them to stop cleanly, and continue at the point at which they were interrupted when the computational unit is restarted.
Release Cadence. If the implementation or configuration of a task changes frequently, it may be necessary to stop the computational unit hosting the updated code, reconfigure and redeploy the unit, and then restart it. This process will also require that all other tasks within the same computational unit are stopped, redeployed, and restarted.
Security. Tasks in the same computational unit may share the same security context and be able to access the same resources. There must be a high degree of trust between the tasks, and confidence that that one task is not going to corrupt or adversely affect another. Additionally, increasing the number of tasks running in a computational unit may increase the attack surface of the computational unit; each task is only as secure as the one with the most vulnerabilities.
Fault Tolerance. If one task in a computational unit fails or behaves abnormally, it can affect the other tasks running within the same unit. For example, if one task fails to start correctly it may cause the entire startup logic for the computational unit to fail, and prevent other tasks in the same unit from running.
Contention. Avoid introducing contention between tasks that compete for resources in the same computational unit. Ideally, tasks that share the same computational unit should exhibit different resource utilization characteristics. For example, two compute-intensive tasks should probably not reside in the same computational unit, and neither should two tasks that consume large amounts of memory. However, mixing a compute intensive task with a task that requires a large amount of memory may be a viable combination.

Note

You should consider consolidating compute resources only for a system that has been in production for a period of time so that operators and developers can monitor the system and create a heat map that identifies how each task utilizes differing resources. This map can be used to determine which tasks are good candidates for sharing compute resources.
**Complexity.****Combining multiple tasks into a single computational unit adds complexity to the code in the unit, possibly making it more difficult to test, debug, and maintain.
Stable Logical Architecture. Design and implement the code in each task so that it should not need to change, even if the physical environment in which task runs does change.
**Other Strategies.****Consolidating compute resources is only one way to help reduce costs associated with running multiple tasks concurrently. It requires careful planning and monitoring to ensure that it remains an effective approach. Other strategies may be more appropriate, depending on the nature of the work being performed and the location of the users on whose behalf these tasks are running. For example, functional decomposition of the workload (as described by the Compute Partitioning Guidance) may be a better option.

When to Use this Pattern

Use this pattern for tasks that are not cost effective if they run in their own computational units. If a task spends much of its time idle, running this task in a dedicated unit can be expensive.

This pattern might not be suitable for tasks that perform critical fault-tolerant operations, or tasks that process highly-sensitive or private data and require their own security context. These tasks should run in their own isolated environment, in a separate computational unit.

Example

When building a cloud service on Azure, it’s possible to consolidate the processing performed by multiple tasks into a single role. Typically this is a worker role that performs background or asynchronous processing tasks.

Note

In some cases it may be possible to include background or asynchronous processing tasks in the web role. This technique can help to reduce costs and simplify deployment, although it can impact the scalability and responsiveness of the public-facing interface provided by the web role. The article Combining Multiple Azure Worker Roles into an Azure Web Role contains a detailed description of implementing background or asynchronous processing tasks in a web role.

The role is responsible for starting and stopping the tasks. When the Azure fabric controller loads a role, it raises the Start event for the role. You can override the OnStart method of the WebRole or WorkerRole class to handle this event, perhaps to initialize the data and other resources on which the tasks in this method depend.

When the OnStart method completes, the role can start responding to requests. You can find more information and guidance about using the OnStart and Run methods in a role in theApplication Startup Processes section in the patterns & practices guide Moving Applications to the Cloud.

Note

Keep the code in the OnStart method as concise as possible. Azure does not impose any limit on the time taken for this method to complete, but the role will not be able to start responding to network requests sent to it until this method completes.

When the OnStart method has finished, the role executes the Run method. At this point, the fabric controller can start sending requests to the role.

Place the code that actually creates the tasks in the Run method. Note that the Run method effectively defines the lifetime of the role instance. When this method completes, the fabric controller will arrange for the role to be shut down.

When a role shuts down or is recycled, the fabric controller prevents any more incoming requests being received from the load balancer and raises the Stop event. You can capture this event by overriding the OnStop method of the role and perform any tidying up required before the role terminates.

Note

Any actions performed in the OnStop method must be completed within five minutes (or 30 seconds if you are using the Azure emulator on a local computer); otherwise the Azure fabric controller assumes that the role has stalled and will force it to stop.

Figure 2 illustrates the lifecycle of a role, and the tasks and resources that it hosts. The tasks are started by the Run method, which then waits for the tasks to complete. The tasks themselves, which implement the business logic of the cloud service, can respond to messages posted to the role through the Azure load balancer.

Figure 2 - The lifecycle of tasks and resources in a role in a Azure cloud service

The WorkerRole.cs file in the ComputeResourceConsolidation.Worker project shows an example of how you might implement this pattern in a Azure cloud service.

Note

The ComputeResourceConsolidation.Worker project is part of the ComputeResourceConsolidation solution that is available for download with this guidance.

In the worker role, code that runs when the role is initialized creates the required cancellation token and a list of tasks to run.

public class WorkerRole: RoleEntryPoint{  // The cancellation token source used to cooperatively cancel running tasks.  private readonly CancellationTokenSource cts = new CancellationTokenSource ();  // List of tasks running on the role instance.  private readonly List<Task> tasks = new List<Task>();  // List of worker tasks to run on this role.  private readonly List<Func<CancellationToken, Task>> workerTasks                          = new List<Func<CancellationToken, Task>>    {      MyWorkerTask1,      MyWorkerTask2    };
  
  ...
}

The MyWorkerTask1 and the MyWorkerTask2 methods are provided to illustrate how to perform different tasks within the same worker role. The following code shows MyWorkerTask1. This is a simple task that sleeps for 30 seconds and then outputs a trace message. It repeats this process indefinitely until the task is cancelled. The code in MyWorkerTask2 is very similar.

// A sample worker role task.private static async Task MyWorkerTask1(CancellationToken ct){  // Fixed interval to wake up and check for work and/or do work.  var interval = TimeSpan.FromSeconds(30);  try  {    while (!ct.IsCancellationRequested)    {      // Wake up and do some background processing if not canceled.      // TASK PROCESSING CODE HERE      Trace.TraceInformation("Doing Worker Task 1 Work");      // Go back to sleep for a period of time unless asked to cancel.      // Task.Delay will throw an OperationCanceledException when canceled.      await Task.Delay(interval, ct);    }  }  catch (OperationCanceledException)  {    // Expect this exception to be thrown in normal circumstances or check    // the cancellation token. If the role instances are shutting down, a    // cancellation request will be signaled.    Trace.TraceInformation("Stopping service, cancellation requested");    // Re-throw the exception.    throw;  }}

Note

The approach shown by the sample code is a common implementation of a background process. In a real world application you can follow this same structure, except that you should place your own processing logic in the body of the loop that waits for the cancellation request.

After the worker role has initialized the resources it uses, the Run method starts the two tasks concurrently, as shown here.

...
// RoleEntry Run() is called after OnStart().  // Returning from Run() will cause a role instance to recycle.public override void Run(){  // Start worker tasks and add them to the task list.  foreach (var worker in workerTasks)    tasks.Add(worker(cts.Token));  Trace.TraceInformation("Worker host tasks started");  // The assumption is that all tasks should remain running and not return,   // similar to role entry Run() behavior.  try  {    Task.WaitAny(tasks.ToArray());  }  catch (AggregateException ex)  {    Trace.TraceError(ex.Message);    // If any of the inner exceptions in the aggregate exception     // are not cancellation exceptions then re-throw the exception.    ex.Handle(innerEx => (innerEx is OperationCanceledException));  }  // If there was not a cancellation request, stop all tasks and return from Run()  // An alternative to cancelling and returning when a task exits would be to   // restart the task.  if (!cts.IsCancellationRequested)  {    Trace.TraceInformation("Task returned without cancellation request");    Stop(TimeSpan.FromMinutes(5));  }}
...

In this example, the Run method waits for tasks to be completed. If a task is canceled, the Run method assumes that the role is being shut down and waits for the remaining tasks to be canceled before finishing (it waits for a maximum of five minutes before terminating). If a task fails due to an expected exception, the Run method cancels the task.

Note

Note that you could implement more comprehensive monitoring and exception handling strategies in the Run method such as restarting tasks that have failed, or including code that enables the role to stop and start individual tasks.

The Stop method shown in the following code is called when the fabric controller shuts down the role instance (it is invoked from the OnStop method). The code stops each task gracefully by cancelling it. If any task takes more than five minutes to complete, the cancellation processing in the Stop method ceases waiting and the role is terminated.

// Stop running tasks and wait for tasks to complete before returning // unless the timeout expires.private void Stop(TimeSpan timeout){  Trace.TraceInformation("Stop called. Canceling tasks.");  // Cancel running tasks.  cts.Cancel();  Trace.TraceInformation("Waiting for canceled tasks to finish and return");  // Wait for all the tasks to complete before returning. Note that the   // emulator currently allows 30 seconds and Azure allows five  // minutes for processing to complete.  try  {    Task.WaitAll(tasks.ToArray(), timeout);  }  catch (AggregateException ex)  {    Trace.TraceError(ex.Message);    // If any of the inner exceptions in the aggregate exception     // are not cancellation exceptions then re-throw the exception.    ex.Handle(innerEx => (innerEx is OperationCanceledException));  }}

The following patterns and guidance may also be relevant when implementing this pattern:

Autoscaling Guidance. Autoscaling can be used to start and stop instances of service hosting computational resources, depending on the anticipated demand for processing.
Compute Partitioning Guidance. This guidance describes how to allocate the services and components in a cloud service in a way that helps to minimize running costs while maintaining the scalability, performance, availability, and security of the service.

More Information

The blog article Combining Multiple Azure Worker Roles into an Azure Web Role.
The section Application Startup Processes in the patterns & practices guide Moving Applications to the Cloud on MSDN.

This pattern has a sample application associated with it. You can download the "Cloud Design Patterns – Sample Code" from the Microsoft Download Center at https://aka.ms/cloud-design-patterns-sample.

Next Topic | Previous Topic | Home | Community

Compute Resource Consolidation Pattern

Context and Problem

Solution

Issues and Considerations

When to Use this Pattern

Example

Related Patterns and Guidance

More Information

Additional resources