Introduction

4 minutes

MapReduce is a data-parallel framework, originally created by Google¹, for processing big-data applications on large suites of computers. The system attempts to decrease interprocess communication by moving computation toward data. It also transparently tolerates data and computation faults, both highly probable in large-scale cloud settings. Developers, consequently, recognize MapReduce for its scalability, fault tolerance, and elasticity. Now, there are many implementations of the MapReduce programming model available to users on the cloud, many of which rely on Apache Hadoop.

Since its debut, MapReduce has been frequently associated with Hadoop,^{3, 4} an open-source implementation of MapReduce. Academic, government, and industrial use of Hadoop is growing rapidly.² For instance, Yahoo! uses it for around 80% to 90% of its jobs⁵. Others, such as Facebook and Microsoft, have also advocated for the framework.⁶ Academia currently uses Hadoop for seismic simulation, natural-language processing, and web data mining, among other applications.^{7, 8}

Hadoop MapReduce performs most of the labor involved in implementing cloud programs through three main strategies:

It automatically breaks down jobs into distributed tasks to effectively exploit task parallelism.
It considers data locality and variations in overall system workloads to schedule jobs and tasks efficiently at participating cluster nodes.
It transparently tolerates both data and task failures.

Learning objectives

In this module, you will:

Identify the underlying distributed programming model of MapReduce
Explain how MapReduce can exploit data parallelism
Identify the input and output of map and reduce tasks
Define task elasticity, and indicate its importance for effective job scheduling
Explain the map and reduce task-scheduling strategies in Hadoop MapReduce
List the elements of the YARN architecture, and identify the role of each element
Summarize the lifecycle of a MapReduce job in YARN
Compare and contrast the architectures and the resource allocators of YARN and the previous Hadoop MapReduce
Indicate how job and task scheduling differ in YARN as opposed to the previous Hadoop MapReduce

Prerequisites

Understand what cloud computing is, including cloud service models and common cloud providers
Know the technologies that enable cloud computing
Understand how cloud service providers pay for and bill for the cloud
Know what datacenters are and why they exist
Know how datacenters are set up, powered, and provisioned
Understand how cloud resources are provisioned and metered
Be familiar with the concept of virtualization
Know the different types of virtualization
Understand CPU virtualization
Understand memory virtualization
Understand I/O virtualization
Know about the different types of data and how they're stored
Be familiar with distributed file systems and how they work
Be familiar with NoSQL databases and object storage, and how they work
Know what distributed programming is and why it's useful for the cloud

References

J. Dean and S. Ghemawat (Dec. 2004). MapReduce: Simplified Data Processing on Large Clusters OSDI
H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu (2011). Starfish: A Self-Tuning System for Big Data Analytics CIDR
Z. Fadika and M. Govindaraju (Dec. 2010). LEMO-MR: Low Overhead and Elastic MapReduce Implementation Optimized for Memory and CPU-Intensive Applications IEEE 2nd Int. Conf. on Cloud Computing Technology and Science (CloudCom)
Hadoop
N. B. Rizvandi, A. Y. Zomaya, A. J. Boloori, and J. Taheri (2011). Preliminary Results: Modeling Relation between Total Execution Time of MapReduce Applications and Number of Mappers/Reducers Tech. R. 679, The University of Sydney
S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi (Dec. 2010). LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud CloudCom
M. Hammoud and M. F. Sakr (2011). Locality-Aware Reduce Task Scheduling for MapReduce CloudCom
M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and I. Stoica (2008). Improving MapReduce Performance in Heterogeneous Environments OSDI

Continue

Learning objectives

Prerequisites

References

Feedback