Distributed computing on the cloud: Spark
Spark is an open-source cluster-computing framework with different strengths than MapReduce has. Learn about how Spark works.
Learning objectives
In this module, you will:
- Recall the features of an iterative programming framework
- Describe the architecture and job flow in Spark
- Recall the role of resilient distributed datasets (RDDs) in Spark
- Describe the properties of RDDs in Spark
- Compare and contrast RDDs with distributed shared-memory systems
- Describe fault-tolerance mechanics in Spark
- Describe the role of lineage in RDDs for fault tolerance and recovery
- Understand the different types of dependencies between RDDs
- Understand the basic operations on Spark RDDs
- Step through a simple iterative Spark program
- Recall the various Spark libraries and their functions
In partnership with Dr. Majd Sakr and Carnegie Mellon University.
Prerequisites
- Understand what cloud computing is, including cloud service models and common cloud providers
- Know the technologies that enable cloud computing
- Understand how cloud service providers pay for and bill for the cloud
- Know what datacenters are and why they exist
- Know how datacenters are set up, powered, and provisioned
- Understand how cloud resources are provisioned and metered
- Be familiar with the concept of virtualization
- Know the different types of virtualization
- Understand CPU virtualization
- Understand memory virtualization
- Understand I/O virtualization
- Know about the different types of data and how they're stored
- Be familiar with distributed file systems and how they work
- Be familiar with NoSQL databases and object storage, and how they work
- Know what distributed programming is and why it's useful for the cloud
- Understand MapReduce and how it enables big data computing