Introduction to Azure HDInsight, the Hadoop technology stack, and Hadoop clusters

This article provides an introduction to Azure HDInsight, a cloud distribution of the Hadoop technology stack. It also covers what a Hadoop cluster is and when you would use it.

What is HDInsight and the Hadoop technology stack?

Azure HDInsight is a cloud distribution of the Hadoop components from the Hortonworks Data Platform (HDP). Apache Hadoop was the original open-source framework for distributed processing and analysis of big data sets on clusters of computers.

The Hadoop technology stack includes related software and utilities, including Apache Hive, HBase, Spark, Kafka, and many others. To see available Hadoop technology stack components on HDInsight, see Components and versions available with HDInsight. To read more about Hadoop in HDInsight, see the Azure features page for HDInsight.

What is a Hadoop cluster, and when do you use it?

Hadoop is also a cluster type that has:

  • YARN for job scheduling and resource management
  • MapReduce for parallel processing
  • The Hadoop distributed file system (HDFS)

Hadoop clusters are most often used for batch processing of stored data. Other kinds of clusters in HDInsight have additional capabilities: Spark has grown in popularity because of its faster, in-memory processing. See Cluster types on HDInsight for details.

What is big data?

Big data describes any large body of digital information, such as:

  • Sensor data from industrial equipment
  • Customer activity collected from a website
  • A Twitter newsfeed

Big data is being collected in escalating volumes, at higher velocities, and in a greater variety of formats. It can be historical (meaning stored) or real time (meaning streamed from the source).

Cluster types in HDInsight

HDInsight includes specific cluster types and cluster customization capabilities, such as adding components, utilities, and languages.

Spark, Kafka, Interactive Query, HBase, customized, and other cluster types

HDInsight offers the following cluster types:

You can also configure clusters using the following methods:

Example cluster customization scripts

Script actions are Bash scripts on Linux that run during cluster provisioning, and that can be used to install additional components on the cluster.

The following example scripts are provided by the HDInsight team:

  • Hue: A set of web applications used to interact with a cluster. Linux clusters only.
  • Giraph: Graph processing to model relationships between things or people.
  • Solr: An enterprise-scale search platform that allows full-text search on data.

For information on developing your own Script Actions, see Script Action development with HDInsight.

Components and utilities on HDInsight clusters

The following components and utilities are included on HDInsight clusters:

  • Ambari: Cluster provisioning, management, monitoring, and utilities.
  • Avro (Microsoft .NET Library for Avro): Data serialization for the Microsoft .NET environment.
  • Hive & HCatalog: SQL-like querying, and a table and storage management layer.
  • Mahout: For scalable machine learning applications.
  • MapReduce: Legacy framework for Hadoop distributed processing and resource management. See YARN.
  • Oozie: Workflow management.
  • Phoenix: Relational database layer over HBase.
  • Pig: Simpler scripting for MapReduce transformations.
  • Sqoop: Data import and export.
  • Tez: Allows data-intensive processes to run efficiently at scale.
  • YARN: Resource management that is part of the Hadoop core library.
  • ZooKeeper: Coordination of processes in distributed systems.
Note

For information on the specific components and version information, see Hadoop components and versions in HDInsight

Ambari

Apache Ambari is for provisioning, managing, and monitoring Apache Hadoop clusters. It includes an intuitive collection of operator tools and a robust set of APIs that hide the complexity of Hadoop, simplifying the operation of clusters. HDInsight clusters on Linux provide both the Ambari web UI and the Ambari REST API. Ambari Views on HDInsight clusters allow plug-in UI capabilities. See Manage HDInsight clusters using Ambari and Apache Ambari API reference.

Avro (Microsoft .NET Library for Avro)

The Microsoft .NET Library for Avro implements the Apache Avro compact binary data interchange format for serialization for the Microsoft .NET environment. It defines a language-agnostic schema so that data serialized in one language can be read in another. Detailed information on the format can be found in the <a target=_"blank" href="http://avro.apache.org/docs/current/spec.html">Apache Avro Specification. The format of Avro files supports the distributed MapReduce programming model: Files are “splittable”, meaning you can seek any point in a file and start reading from a particular block. To find out how, see Serialize data with the Microsoft .NET Library for Avro. Linux-based cluster support to come.

HDFS

Hadoop Distributed File System (HDFS) is a file system that, with YARN and MapReduce, is the core of Hadoop technology. It's the standard file system for Hadoop clusters on HDInsight. See Query data from HDFS-compatible storage.

Hive & HCatalog

Apache Hive is data warehouse software built on Hadoop that allows you to query and manage large datasets in distributed storage by using a SQL-like language called HiveQL. Hive, like Pig, is an abstraction on top of MapReduce, and it translates queries into a series of MapReduce jobs. Hive is closer to a relational database management system than Pig, and is used with more structured data. For unstructured data, Pig is the better choice. See Use Hive with Hadoop in HDInsight.

Apache HCatalog is a table and storage management layer for Hadoop that presents you with a relational view of data. In HCatalog, you can read and write files in any format that works for a Hive SerDe (serializer-deserializer).

Mahout

Apache Mahout is a library of machine learning algorithms that run on Hadoop. Using principles of statistics, machine learning applications teach systems to learn from data and to use past outcomes to determine future behavior. See Generate movie recommendations using Mahout on Hadoop.

MapReduce

MapReduce is the legacy software framework for Hadoop for writing applications to batch process big data sets in parallel. A MapReduce job splits large datasets and organizes the data into key-value pairs for processing. MapReduce jobs run on YARN. See MapReduce in the Hadoop Wiki.

Oozie

Apache Oozie is a workflow coordination system that manages Hadoop jobs. It is integrated with the Hadoop stack and supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop. It can also be used to schedule jobs specific to a system, like Java programs or shell scripts. See Use Oozie with Hadoop.

Phoenix

Apache Phoenix is a relational database layer over HBase. Phoenix includes a JDBC driver that allows you to query and manage SQL tables directly. Phoenix translates queries and other statements into native NoSQL API calls - instead of using MapReduce - thus enabling faster applications on top of NoSQL stores. See Use Apache Phoenix and SQuirreL with HBase clusters.

Pig

Apache Pig is a high-level platform that allows you to perform complex MapReduce transformations on large datasets by using a simple scripting language called Pig Latin. Pig translates the Pig Latin scripts so they’ll run within Hadoop. You can create User-Defined Functions (UDFs) to extend Pig Latin. See Use Pig with Hadoop.

Sqoop

Apache Sqoop is a tool that transfers bulk data between Hadoop and relational databases such as SQL, or other structured data stores, as efficiently as possible. See Use Sqoop with Hadoop.

Tez

Apache Tez is an application framework built on Hadoop YARN that executes complex, acyclic graphs of general data processing. It's a more flexible and powerful successor to the MapReduce framework that allows data-intensive processes, such as Hive, to run more efficiently at scale. See "Use Apache Tez for improved performance" in Use Hive and HiveQL.

YARN

Apache YARN is the next generation of MapReduce (MapReduce 2.0, or MRv2) and supports data processing scenarios beyond MapReduce batch processing with greater scalability and real-time processing. YARN provides resource management and a distributed application framework. MapReduce jobs run on YARN. See Apache Hadoop NextGen MapReduce (YARN).

ZooKeeper

Apache ZooKeeper coordinates processes in large distributed systems using a shared hierarchical namespace of data registers (znodes). Znodes contain small amounts of meta information needed to coordinate processes: status, location, configuration, and so on. See an example of ZooKeeper with an HBase cluster and Apache Phoenix.

Programming languages on HDInsight

HDInsight clusters - Spark, HBase, Kafka, Hadoop, and other clusters - support many programming languages, but some aren't installed by default. For libraries, modules, or packages not installed by default, use a script action to install the component.

Default programming language support

By default, HDInsight clusters support:

  • Java
  • Python

Additional languages can be installed using script actions.

Java virtual machine (JVM) languages

Many languages other than Java can run on a Java virtual machine (JVM); however, running some of these languages may require additional components installed on the cluster.

These JVM-based languages are supported on HDInsight clusters:

  • Clojure
  • Jython (Python for Java)
  • Scala

Hadoop-specific languages

HDInsight clusters support the following languages that are specific to the Hadoop technology stack:

  • Pig Latin for Pig jobs
  • HiveQL for Hive jobs and SparkSQL

HDInsight Standard and HDInsight Premium

HDInsight provides big data cloud offerings in two categories, Standard and Premium. HDInsight Standard provides an enterprise-scale cluster that organizations can use to run their big data workloads. HDInsight Premium builds on Standard capabilities and provides advanced analytical and security capabilities for an HDInsight cluster. For more information, see Azure HDInsight Premium

Microsoft business intelligence and HDInsight

Familiar business intelligence (BI) tools retrieve, analyze, and report data integrated with HDInsight by using either the Power Query add-in or the Microsoft Hive ODBC Driver:

Next steps