Introduction to Azure HDInsight, the Hadoop and Spark technology stack
This article provides an introduction to Azure HDInsight, a fully managed, full spectrum open source analytics service for enterprises. You can use open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, and more.
Apache Hadoop was the original open-source framework for distributed processing and analysis of big data sets on clusters. The Hadoop technology stack includes related software and utilities, including Apache Hive, HBase, Spark, Kafka, and many others. To see available Hadoop technology stack components on HDInsight, see Components and versions available with HDInsight. To read more about Hadoop in HDInsight, see the Azure features page for HDInsight.
Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. To read more about Spark in HDInsight,see the Introduction to Spark on Azure HDInsight.
What is HDInsight and the Hadoop technology stack?
Azure HDInsight is a cloud distribution of the Hadoop components from the Hortonworks Data Platform (HDP). Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. You can use the most popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, and more to enable a broad range of scenarios such as extract, transform, and load (ETL); data warehousing; machine learning; and IoT.
What is big data?
Big data is being collected in escalating volumes, at higher velocities, and in a greater variety of formats. It can be historical (meaning stored) or real time (meaning streamed from the source). See Scenarios for using HDInsight to learn about the most common use cases for big data.
Why should I use HDInsight?
This section lists the capabilities of Azure HDInsight.
|Cloud native||Azure HDInsight enables you to create optimized clusters for Hadoop, Spark, Interactive query (LLAP), Kafka, Storm, HBase, and R Server on Azure. HDInsight also provides an end-to-end SLA on all your production workloads.|
|Low-cost and scalable||HDInsight enables you to scale workloads up or down. You can reduce cost by creating clusters on demand and pay only for what you use. You can also build data pipelines to operationalize your jobs. Decoupled compute and storage provide better performance and flexibility.|
|Secure and compliant||HDInsight enables you to protect your enterprise data assets using Azure Virtual Network, encryption, and integration with Azure Active Directory. HDInsight also meets the most popular industry and government compliance standards.|
|Monitoring||Azure HDInsight integrates with Azure Log Analytics to provide a single interface to monitor all you clusters.|
|Global availability||HDInsight is available in more regions than any other big data analytics offering. Azure HDInsight is also available in Azure Government, China, and Germany that allows you to meet your enterprise needs in key sovereign areas.|
|Productivity||Azure HDInsight enables you to use rich productive tools for Hadoop and Spark with your preferred development environment such as Visual Studio, Eclipse, and IntelliJ for Scala, Python, R, Java, and .NET support. Data scientists can also collaborate using popular notebooks such as Jupyter and Zeppelin.|
|Extensibility||You can extend the HDInsight clusters by installing components (Hue, Presto, etc.) using script actions, by adding edge nodes, or by integrating with other big data certified applications. HDInsight enables seamless integration with the most popular big data solutions with a one-click deployment.|
Scenarios for using HDInsight
Azure HDInsight can be used for a variety of use cases for Big data processing. Big data is being collected in escalating volumes, at higher velocities, and in a greater variety of formats. It can be historical data(data already collected and stored) or real-time data (data directly streamed from the source). The use cases for processing such data can be summarized in the following categories:
Batch processing (ETL)
Extract, transform, and load (ETL) is a process where unstructured or structured data is extracted from heterogeneous data sources, transformed into a structured format, and loaded into a data store. The transformed data can be used for data science or data warehousing.
Internet of Things (IoT)
You can use HDInsight to process streaming data received at real time from a variety of devices. For more information, read this blog.
You can use HDInsight to build applications that extract critical insights from data. You can also use Azure Machine Learning on top of that to predict future trends for your business. For more information, read this customer story.
You can use HDInsight to perform interactive queries at petabyte scales over structured or unstructured data in any format. You can also build models connecting them to BI tools. For more information, read this customer story.
You can use HDInsight to extend your existing on-premises big data infrastructure to Azure to leverage the advanced analytics capabilities of the cloud.
Cluster types in HDInsight
HDInsight includes specific cluster types and cluster customization capabilities, such as adding components, utilities, and languages.
Spark, Kafka, Interactive Query, HBase, customized, and other cluster types
HDInsight offers the following cluster types:
- Apache Hadoop: Uses HDFS, YARN resource management, and a simple MapReduce programming model to process and analyze batch data in parallel.
- Apache Spark: A parallel processing framework that supports in-memory processing to boost the performance of big-data analysis applications, Spark works for SQL, streaming data, and machine learning. See What is Apache Spark in HDInsight?
- Apache HBase: A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data - potentially billions of rows times millions of columns. See What is HBase on HDInsight?
- Microsoft R Server: A server for hosting and managing parallel, distributed R processes. It provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight. See Overview of R Server on HDInsight.
- Apache Storm: A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See Analyze real-time sensor data using Storm and Hadoop.
- Apache Interactive Query preview (AKA: Live Long and Process): In-memory caching for interactive and faster Hive queries. See Use Interactive Query in HDInsight.
- Apache Kafka: An open-source platform used for building streaming data pipelines and applications. Kafka also provides message-queue functionality that allows you to publish and subscribe to data streams. See Introduction to Apache Kafka on HDInsight.
Open source components in HDInsight
Azure HDInsight enables you to create clusters with open source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, HBase, and R. These clusters, by default, come with other open source components included on the cluster such as Ambari, Avro, Hive, HCatalog, Mahout, MapReduce, YARN, Phoenix, Pig, Sqoop, Tez, Oozie, ZooKeeper.
Programming languages on HDInsight
HDInsight clusters - Spark, HBase, Kafka, Hadoop, and other clusters - support many programming languages, but some aren't installed by default. For libraries, modules, or packages not installed by default, use a script action to install the component.
Default programming language support
By default, HDInsight clusters support:
Additional languages can be installed using script actions.
Java virtual machine (JVM) languages
Many languages other than Java can run on a Java virtual machine (JVM); however, running some of these languages may require additional components installed on the cluster.
These JVM-based languages are supported on HDInsight clusters:
- Jython (Python for Java)
HDInsight clusters support the following languages that are specific to the Hadoop technology stack:
- Pig Latin for Pig jobs
- HiveQL for Hive jobs and SparkSQL
Business intelligence on HDInsight
Familiar business intelligence (BI) tools retrieve, analyze, and report data integrated with HDInsight by using either the Power Query add-in or the Microsoft Hive ODBC Driver:
- Apache Spark BI using data visualization tools with Azure HDInsight
- Visualize Hive data with Microsoft Power BI in Azure HDInsight
- Connect Excel to Hadoop with Power Query: Learn how to connect Excel to the Azure Storage account that stores the data from your HDInsight cluster by using Microsoft Power Query for Excel. Windows workstation required.
- Connect Excel to Hadoop with the Microsoft Hive ODBC Driver: Learn how to import data from HDInsight with the Microsoft Hive ODBC Driver. Windows workstation required.
- Microsoft Cloud Platform: Learn about Power BI for Office 365, download the SQL Server trial, and set up SharePoint Server 2013 and SQL Server BI.
- SQL Server Analysis Services
- SQL Server Reporting Services
- Get started with Hadoop in HDInsight
- Get started with Spark in HDInsight
- Get started with Kafka on HDInsight
- Get started with Storm on HDInsight
- Get started with HBase on HDInsight
- Get started with Interactive Quer (LLAP) on HDInsight
- Get started with R Server on HDInsight
- Manage HDInsight clusters
- Secure your HDInsight clusters
- Monitor HDInsight clusters