Introduction to Azure HDInsight, the Hadoop technology stack, and Hadoop clusters
This article provides an introduction to Azure HDInsight, a cloud distribution of the Hadoop technology stack. It also covers what a Hadoop cluster is and when you would use it.
What is HDInsight and the Hadoop technology stack?
Azure HDInsight is a cloud distribution of the Hadoop components from the Hortonworks Data Platform (HDP). Apache Hadoop was the original open-source framework for distributed processing and analysis of big data sets on clusters of computers.
The Hadoop technology stack includes related software and utilities, including Apache Hive, HBase, Spark, Kafka, and many others. To see available Hadoop technology stack components on HDInsight, see Components and versions available with HDInsight. To read more about Hadoop in HDInsight, see the Azure features page for HDInsight.
What is a Hadoop cluster, and when do you use it?
Hadoop is also a cluster type that has:
- YARN for job scheduling and resource management
- MapReduce for parallel processing
- The Hadoop distributed file system (HDFS)
Hadoop clusters are most often used for batch processing of stored data. Other kinds of clusters in HDInsight have additional capabilities: Spark has grown in popularity because of its faster, in-memory processing. See Cluster types on HDInsight for details.
What is big data?
Big data describes any large body of digital information, such as:
- Sensor data from industrial equipment
- Customer activity collected from a website
- A Twitter newsfeed
Big data is being collected in escalating volumes, at higher velocities, and in a greater variety of formats. It can be historical (meaning stored) or real time (meaning streamed from the source).
Cluster types in HDInsight
HDInsight includes specific cluster types and cluster customization capabilities, such as adding components, utilities, and languages.
Spark, Kafka, Interactive Query, HBase, customized, and other cluster types
HDInsight offers the following cluster types:
- Apache Hadoop: Uses HDFS, YARN resource management, and a simple MapReduce programming model to process and analyze batch data in parallel.
- Apache Spark: A parallel processing framework that supports in-memory processing to boost the performance of big-data analysis applications, Spark works for SQL, streaming data, and machine learning. See What is Apache Spark in HDInsight?
- Apache HBase: A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data - potentially billions of rows times millions of columns. See What is HBase on HDInsight?
- Microsoft R Server: A server for hosting and managing parallel, distributed R processes. It provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight. See Overview of R Server on HDInsight.
- Apache Storm: A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See Analyze real-time sensor data using Storm and Hadoop.
- Apache Interactive Query preview (AKA: Live Long and Process): In-memory caching for interactive and faster Hive queries. See Use Interactive Query in HDInsight.
- Apache Kafka: An open-source platform used for building streaming data pipelines and applications. Kafka also provides message-queue functionality that allows you to publish and subscribe to data streams. See Introduction to Apache Kafka on HDInsight.
You can also configure clusters using the following methods:
- Domain-joined clusters preview: A cluster joined to an Active Directory domain so that you can control access and provide governance for data.
- Custom clusters with script actions: Clusters with scripts that run during provisioning and install additional components.
Example cluster customization scripts
Script actions are Bash scripts on Linux that run during cluster provisioning, and that can be used to install additional components on the cluster.
The following example scripts are provided by the HDInsight team:
- Hue: A set of web applications used to interact with a cluster. Linux clusters only.
- Giraph: Graph processing to model relationships between things or people.
- Solr: An enterprise-scale search platform that allows full-text search on data.
For information on developing your own Script Actions, see Script Action development with HDInsight.
Components and utilities on HDInsight clusters
The following components and utilities are included on HDInsight clusters:
- Ambari: Cluster provisioning, management, monitoring, and utilities.
- Avro (Microsoft .NET Library for Avro): Data serialization for the Microsoft .NET environment.
- Hive & HCatalog: SQL-like querying, and a table and storage management layer.
- Mahout: For scalable machine learning applications.
- MapReduce: Legacy framework for Hadoop distributed processing and resource management. See YARN.
- Oozie: Workflow management.
- Phoenix: Relational database layer over HBase.
- Pig: Simpler scripting for MapReduce transformations.
- Sqoop: Data import and export.
- Tez: Allows data-intensive processes to run efficiently at scale.
- YARN: Resource management that is part of the Hadoop core library.
- ZooKeeper: Coordination of processes in distributed systems.
For information on the specific components and version information, see Hadoop components and versions in HDInsight
Apache Ambari is for provisioning, managing, and monitoring Apache Hadoop clusters. It includes an intuitive collection of operator tools and a robust set of APIs that hide the complexity of Hadoop, simplifying the operation of clusters. HDInsight clusters on Linux provide both the Ambari web UI and the Ambari REST API. Ambari Views on HDInsight clusters allow plug-in UI capabilities. See Manage HDInsight clusters using Ambari and Apache Ambari API reference.
Avro (Microsoft .NET Library for Avro)
The Microsoft .NET Library for Avro implements the Apache Avro compact binary data interchange format for serialization for the Microsoft .NET environment. It defines a language-agnostic schema so that data serialized in one language can be read in another. Detailed information on the format can be found in the <a target=_"blank" href="http://avro.apache.org/docs/current/spec.html">Apache Avro Specification. The format of Avro files supports the distributed MapReduce programming model: Files are “splittable”, meaning you can seek any point in a file and start reading from a particular block. To find out how, see Serialize data with the Microsoft .NET Library for Avro. Linux-based cluster support to come.
Hadoop Distributed File System (HDFS) is a file system that, with YARN and MapReduce, is the core of Hadoop technology. It's the standard file system for Hadoop clusters on HDInsight. See Query data from HDFS-compatible storage.
Hive & HCatalog
Apache Hive is data warehouse software built on Hadoop that allows you to query and manage large datasets in distributed storage by using a SQL-like language called HiveQL. Hive, like Pig, is an abstraction on top of MapReduce, and it translates queries into a series of MapReduce jobs. Hive is closer to a relational database management system than Pig, and is used with more structured data. For unstructured data, Pig is the better choice. See Use Hive with Hadoop in HDInsight.
Apache HCatalog is a table and storage management layer for Hadoop that presents you with a relational view of data. In HCatalog, you can read and write files in any format that works for a Hive SerDe (serializer-deserializer).
Apache Mahout is a library of machine learning algorithms that run on Hadoop. Using principles of statistics, machine learning applications teach systems to learn from data and to use past outcomes to determine future behavior. See Generate movie recommendations using Mahout on Hadoop.
MapReduce is the legacy software framework for Hadoop for writing applications to batch process big data sets in parallel. A MapReduce job splits large datasets and organizes the data into key-value pairs for processing. MapReduce jobs run on YARN. See MapReduce in the Hadoop Wiki.
Apache Oozie is a workflow coordination system that manages Hadoop jobs. It is integrated with the Hadoop stack and supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop. It can also be used to schedule jobs specific to a system, like Java programs or shell scripts. See Use Oozie with Hadoop.
Apache Phoenix is a relational database layer over HBase. Phoenix includes a JDBC driver that allows you to query and manage SQL tables directly. Phoenix translates queries and other statements into native NoSQL API calls - instead of using MapReduce - thus enabling faster applications on top of NoSQL stores. See Use Apache Phoenix and SQuirreL with HBase clusters.
Apache Pig is a high-level platform that allows you to perform complex MapReduce transformations on large datasets by using a simple scripting language called Pig Latin. Pig translates the Pig Latin scripts so they’ll run within Hadoop. You can create User-Defined Functions (UDFs) to extend Pig Latin. See Use Pig with Hadoop.
Apache Tez is an application framework built on Hadoop YARN that executes complex, acyclic graphs of general data processing. It's a more flexible and powerful successor to the MapReduce framework that allows data-intensive processes, such as Hive, to run more efficiently at scale. See "Use Apache Tez for improved performance" in Use Hive and HiveQL.
Apache YARN is the next generation of MapReduce (MapReduce 2.0, or MRv2) and supports data processing scenarios beyond MapReduce batch processing with greater scalability and real-time processing. YARN provides resource management and a distributed application framework. MapReduce jobs run on YARN. See Apache Hadoop NextGen MapReduce (YARN).
Apache ZooKeeper coordinates processes in large distributed systems using a shared hierarchical namespace of data registers (znodes). Znodes contain small amounts of meta information needed to coordinate processes: status, location, configuration, and so on. See an example of ZooKeeper with an HBase cluster and Apache Phoenix.
Programming languages on HDInsight
HDInsight clusters - Spark, HBase, Kafka, Hadoop, and other clusters - support many programming languages, but some aren't installed by default. For libraries, modules, or packages not installed by default, use a script action to install the component.
Default programming language support
By default, HDInsight clusters support:
Additional languages can be installed using script actions.
Java virtual machine (JVM) languages
Many languages other than Java can run on a Java virtual machine (JVM); however, running some of these languages may require additional components installed on the cluster.
These JVM-based languages are supported on HDInsight clusters:
- Jython (Python for Java)
HDInsight clusters support the following languages that are specific to the Hadoop technology stack:
- Pig Latin for Pig jobs
- HiveQL for Hive jobs and SparkSQL
HDInsight Standard and HDInsight Premium
HDInsight provides big data cloud offerings in two categories, Standard and Premium. HDInsight Standard provides an enterprise-scale cluster that organizations can use to run their big data workloads. HDInsight Premium builds on Standard capabilities and provides advanced analytical and security capabilities for an HDInsight cluster. For more information, see Azure HDInsight Premium
Microsoft business intelligence and HDInsight
Familiar business intelligence (BI) tools retrieve, analyze, and report data integrated with HDInsight by using either the Power Query add-in or the Microsoft Hive ODBC Driver:
- Connect Excel to Hadoop with Power Query: Learn how to connect Excel to the Azure Storage account that stores the data from your HDInsight cluster by using Microsoft Power Query for Excel. Windows workstation required.
- Connect Excel to Hadoop with the Microsoft Hive ODBC Driver: Learn how to import data from HDInsight with the Microsoft Hive ODBC Driver. Windows workstation required.
- Microsoft Cloud Platform: Learn about Power BI for Office 365, download the SQL Server trial, and set up SharePoint Server 2013 and SQL Server BI.
- SQL Server Analysis Services
- SQL Server Reporting Services
- Get started with Hadoop in HDInsight: A quick-start tutorial for provisioning HDInsight Hadoop clusters and running sample Hive queries.
- Get started with Spark in HDInsight: A quick-start tutorial for creating a Spark cluster and running interactive Spark SQL queries.
- Use R Server on HDInsight: Start using R Server in HDInsight Premium.
- Provision HDInsight clusters: Learn how to provision an HDInsight Hadoop cluster through the Azure portal, Azure CLI, or Azure PowerShell.