Cloud challenges: Scheduling

Completed

The effectiveness of a distributed program hinges on the manner in which its constituent tasks are scheduled over distributed machines. This scheduling is usually categorized into two main classes, one for tasks and one for jobs. Tasks are the finest unit of execution granularity, and a job can encompass one or many tasks. Multiple users can submit numerous jobs simultaneously for execution on a cluster, and job schedulers determine which should go next. Hadoop MapReduce, for instance, utilizes a first-in, first-out (FIFO) job scheduler, whereby jobs run in order of receipt, and a scheduled job will occupy the whole cluster until the job has no more tasks to schedule. Hadoop MapReduce also employs other job schedulers, such as the Capacity Scheduler and Fair Scheduler. After a job is granted to the cluster, the scheduling decision morphs into how to schedule the job's component tasks. Tasks can be scheduled either close to the data that they will process, or anywhere. When tasks are scheduled near their data, locality is considered to be exploited. For example, Hadoop MapReduce incorporates two types of tasks, map and reduce tasks. Map tasks are scheduled in the vicinity of their uniform-sized input HDFS blocks, while reduce tasks are scheduled at any cluster nodes (anywhere), irrespective of their input data locations. Pregel and GraphLab, on the other hand, do not exploit any locality when scheduling tasks.

To avoid significant performance degradation, task schedulers must also account for heterogeneity in the underlying cloud system. Similar tasks that belong to the same job, for example, can be scheduled in a heterogeneous cloud at nodes of differing speed. This procedure, however, can introduce load imbalance and make jobs progress at the pace of their slowest tasks. Strategies such as Hadoop MapReduce's speculative execution can mitigate such problems.

In addition, task schedulers must seek to enhance system utilization and improve task parallelism. The objective here is to distribute tasks uniformly across cluster machines in a way that utilizes the available resources fairly and increases parallelism effectively, but this goal presents some contradictory priorities. To begin, by evenly distributing tasks across cluster machines, locality may be affected. Machines in a Hadoop cluster, for instance, can contain different numbers of HDFS blocks. If one machine has a significantly larger number of blocks compared to other machines, locality would imply scheduling all map tasks in that machine. This disposition might make other machines less loaded and utilized. In addition, this strategy can reduce task parallelism by accumulating many tasks on the same machine.

If locality is relaxed somewhat, utilization could be enhanced, loads across machines could be balanced, and task parallelism could be increased. However, relaxing locality would necessitate moving data toward tasks. If done injudiciously, relaxation could raise communication overheads, thereby impeding scalability and potentially degrading performance. In fact, with datacenters hosting thousands of machines, moving data frequently toward distant tasks might become a major bottleneck. To improve performance and reduce costs, an optimal task scheduler should strike a balance between system utilization, load balancing, task parallelism, communication overheads, and scalability. Unfortunately, in practice, this ideal is hard to realize, and most task schedulers attempt to optimize one objective and overlook the others.

Another major challenge when scheduling jobs and tasks is to meet what are called service-level objectives (SLOs), which reflect the performance expectations of end users. Cloud providers have identified SLO violations as a major cause of user dissatisfaction.1 An SLO might be expressed, for example, as a maximum acceptable latency for allocating desired resources to a job, a soft/hard deadline to finish a job, or GPU preferences for certain tasks. In multitenant, heterogeneous clusters, SLOs are hard to achieve, especially when new jobs arrive while others are executing. This situation could require suspending currently running tasks and allowing the new ones to proceed in order to meet their own specified SLOs. The capability to suspend and resume tasks is called task elasticity. Unfortunately, most distributed analytics engines—including Hadoop MapReduce, Pregel, and GraphLab—do not yet support task elasticity. Enabling elasticity is quite challenging and requires identifying safe points at which a task can be suspended without affecting its correctness and such that its committed work need not be repeated on resumption. Clearly, this capability resembles context switching in modern operating systems.


References

  1. J. Hamilton (2009). The Cost of Latency

Check your knowledge

1.

What are the considerations for an ideal scheduler for distributed programs?

2.

Why is meeting SLOs a challenge in cloud computing?