Cloud challenges: Heterogeneity

Completed

Cloud datacenters comprise various collections of components, including computers, networks, operating systems (OSs), code libraries, and programming languages. In principle, if there is variety and difference in datacenter components, the cloud is referred to as a heterogeneous cloud. Otherwise, the cloud is denoted as a homogenous cloud. In practice, homogeneity does not always hold, mainly due to two reasons:

  • Cloud providers typically maintain multiple generations of IT resources, purchased over different time frames.
  • Cloud providers are increasingly applying virtualization technology on their clouds to consolidate servers, enhance system utilization, and simplify management. Public clouds are primarily virtualized datacenters. Even on private clouds, virtualized environments are expected to become the norm.1

Heterogeneity is a direct result of virtualized environments, and colocating virtual machines (VMs) on similar physical machines can cause heterogeneity. Consider, for example, two identical physical machines, A and B. Even assuming identical VMs running the same programs, placing one VM on machine A and 10 VMs on machine B will stress the second machine more than the first. Having dissimilar VMs and diverse, demanding programs is even more probable on the cloud, and the situation is worse there. An especially compelling setting is Azure, which offers more than 30 VM types5 for millions of users with different programs. Clearly, this situation creates even more heterogeneity. In short, heterogeneity is already and will continue to be the norm on the cloud.

Heterogeneity poses multiple challenges for running distributed programs on the cloud. Distributed programs must be designed to mask heterogeneity of the underlying hardware, networks, OSs, and programming languages. The illusion of homogeneity allows distributed tasks to communicate; otherwise message passing and the whole concept of distributed programs fail. Consider the data representation problem: messages exchanged between tasks usually contain primitive data types, such as integers. Unfortunately, not all computers store integers in the same order. In particular, some computers use the so-called big-endian order, in which the most significant byte comes first, while others use the so-called little-endian order, in which the most significant byte comes last. The floating-point numbers can also differ across computer architectures. Another issue is the set of codes used to represent characters. Some systems use ASCII characters, while others use the Unicode standard. In a word, distributed programs have to work out such heterogeneity to exist. The part that can be incorporated in distributed programs to work out heterogeneity is commonly referred to as middleware. Fortunately, most middleware is implemented over the internet protocols, which themselves mask the differences in the underlying networks. Simple Object Access Protocol (SOAP)2 is an example of middleware. SOAP defines a scheme for using Extensible Markup Language (XML), a textual self-describing format, to represent contents of messages and allow distributed tasks at diverse machines to interact. Another example is Representational State Transfer, or REST.

In general, code suitable for one machine might not be suitable for another machine on the cloud, especially when instruction-set architectures (ISAs) vary across machines. Ironically, the virtualization technology, which induces heterogeneity, can effectively serve in solving the problem. Some VMs can be initiated for a user cluster and mapped to physical machines with different underlying ISAs. Afterward, the virtualization hypervisor will take care of emulating any difference between the ISAs of the provisioned VMs and the underlying physical machines (if any). From a user's perspective, all emulations occur transparently. Last, users can always install their own OSs and libraries on system VMs, thus ensuring homogeneity at the OS and library levels.

Another serious heterogeneity problem that requires attention from distributed programmers is performance variation3, 4 on the cloud. Performance variation describes the situation in which running the same distributed program twice on the same cluster can result in different execution times. For example, execution times can vary by a factor of five for the same application on the same private cluster.4 Performance variation is due mostly to cloud heterogeneity, imposed by virtualized environments, and spikes and lulls in resource demand over time. As a consequence, cloud VMs rarely carry work at the same speed, thereby preventing tasks from making progress at (approximately) constant rates. Clearly, this situation can create a tricky load imbalance and degrade overall performance because load imbalance makes a program's performance contingent on its slowest task. Distributed programs can attempt to provide relief by detecting slow tasks and scheduling corresponding speculative tasks on fast VMs so that the latter finish earlier. Specifically, two tasks with the same responsibility can compete by running at two different VMs, with the one that finishes earlier getting committed and the other killed. Hadoop MapReduce follows a similar strategy for solving the same problem, called speculative execution. Unfortunately, distinguishing between slow and fast tasks/VMs is challenging on the cloud. It could happen that a certain VM running a task is temporarily passing through a demand spike, or it could be the case that the VM is simply faulty. In theory, not every node that is detected to be slow is faulty, and differentiating between faulty and slow nodes is hard.6 Because of this problem, Hadoop MapReduce does not perform well in heterogeneous environments.5, 7 Reasons for that and details on Hadoop's speculative execution are presented in the Hadoop MapReduce section.



References

  1. M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and I. Stoica (2008). Improving MapReduce Performance in Heterogeneous Environments OSDI
  2. G. Coulouris, J. Dollimore, T. Kindberg, and G. Blair (May 2011). Distributed Systems: Concepts and Design Addison-Wesley
  3. B. Farley, V. Varadarajan, K. Bowers, A. Juels, T. Ristenpart, and M. Swift (2012). More for Your Money: Exploiting Performance Heterogeneity in Public Clouds SOCC
  4. M. S. Rehman and M. F. Sakr (Nov. 2010). Initial Findings for Provisioning Variation in Cloud Computing CloudCom
  5. Sizes for Linux virtual machines in Azure
  6. A. S. Tanenbaum and M. V. Steen (October 12, 2006). Distributed Systems: Principles and Paradigms Prentice Hall, Second Edition
  7. T. D. Braun, H. J. Siegel, N. Beck, L. L. Blni, M. Maheswaran, A. I. Reuther, J. P. Robertson, M. D. Theys, B. Yao, D Hensgen, and R. F. Freund (June 2001). A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems JPDC