什麼是 Azure HDInsight 中的 Apache SparkWhat is Apache Spark in Azure HDInsight

Apache Spark 是一個平行處理架構,可支援記憶體內部處理,以大幅提升巨量資料分析應用程式的效能。Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. Azure HDInsight 中的 Apache Spark 是 Microsoft 在雲端的 Apache Spark 實作。Apache Spark in Azure HDInsight is the Microsoft implementation of Apache Spark in the cloud. HDInsight 讓您能夠更輕鬆地在 Azure 中建立並設定 Spark 叢集。HDInsight makes it easier to create and configure a Spark cluster in Azure. HDInsight 中的 Spark 叢集也能與 Azure 儲存體和 Azure Data Lake Storage 相容。Spark clusters in HDInsight are compatible with Azure Storage and Azure Data Lake Storage. 因此,您可以使用 HDInsight Spark 叢集來處理儲存於 Azure 的資料。So you can use HDInsight Spark clusters to process your data stored in Azure. 如需元件和版本資訊,請參閱 Azure HDInsight 中的 Apache Hadoop 元件和版本For the components and the versioning information, see Apache Hadoop components and versions in Azure HDInsight.

Spark:統一架構

什麼是 Apache Spark?What is Apache Spark?

Spark 提供用於記憶體內部叢集運算的基本項目。Spark provides primitives for in-memory cluster computing. Spark 作業可將資料載入並快取到記憶體,以便重複查詢。A Spark job can load and cache data into memory and query it repeatedly. 記憶體內部計算速度優於磁碟型應用程式,例如,會透過 Hadoop 分散式檔案系統 (HDFS) 共用資料的 Hadoop。In-memory computing is much faster than disk-based applications, such as Hadoop, which shares data through Hadoop distributed file system (HDFS). Spark 也會整合到 Scala 程式設計語言中,讓您處理分散式資料集 (像是本機集合)。Spark also integrates into the Scala programming language to let you manipulate distributed data sets like local collections. 您不需要將一切建構成對應和縮減作業。There's no need to structure everything as map and reduce operations.

傳統 MapReduce 與Spark

HDInsight 中的 Spark 叢集可提供完全受控的 Spark 服務。Spark clusters in HDInsight offer a fully managed Spark service. 以下列出在 HDInsight 中建立 Spark 叢集的優點。Benefits of creating a Spark cluster in HDInsight are listed here.

功能Feature 說明Description
輕鬆建立Ease creation 您可以使用 Azure 入口網站、Azure PowerShell 或 HDInsight .NET SDK,在幾分鐘內便能於 HDInsight 中建立新的 Spark 叢集。You can create a new Spark cluster in HDInsight in minutes using the Azure portal, Azure PowerShell, or the HDInsight .NET SDK. 請參閱開始使用 HDInsight 中的 Apache Spark 叢集See Get started with Apache Spark cluster in HDInsight.
容易使用Ease of use HDInsight 中的 Spark 叢集包含 Jupyter 和 Apache Zeppelin Notebook。Spark cluster in HDInsight include Jupyter and Apache Zeppelin notebooks. 您可以使用這些 Notebook 來進行互動式的資料處理和視覺化。You can use these notebooks for interactive data processing and visualization.
REST APIREST APIs HDInsight 中的 Spark 叢集包含 Apache Livy,它是 REST-API 型 Spark 作業伺服器,可用來遠端提交及監視作業。Spark clusters in HDInsight include Apache Livy, a REST API-based Spark job server to remotely submit and monitor jobs. 請參閱使用 Apache Spark REST API 將遠端作業提交至 HDInsight Spark 叢集See Use Apache Spark REST API to submit remote jobs to an HDInsight Spark cluster.
支援 Azure Data Lake StorageSupport for Azure Data Lake Storage HDInsight 中的 Spark 叢集可以使用 Azure Data Lake Storage 作為主要儲存體或額外的儲存體。Spark clusters in HDInsight can use Azure Data Lake Storage as both the primary storage or additional storage. 如需有關 Data Lake Storage 的詳細資訊,請參閱 Azure Data Lake Storage 概觀For more information on Data Lake Storage, see Overview of Azure Data Lake Storage.
Azure 服務整合Integration with Azure services HDInsight 中的 Spark 叢集隨附連至 Azure 事件中樞的連接器。Spark cluster in HDInsight comes with a connector to Azure Event Hubs. 除了 Spark 所提供的 Apache Kafka (英文) 之外,您還可以使用事件中樞來建置串流應用程式。You can build streaming applications using the Event Hubs, in addition to Apache Kafka, which is already available as part of Spark.
ML Server 的支援Support for ML Server 在 HDInsight 中會以 ML 服務叢集類型的形式提供 ML Server 的支援。Support for ML Server in HDInsight is provided as the ML Services cluster type. 您可以設定 ML 服務叢集,以 Spark 叢集所承諾的速度來執行分散式 R 運算。You can set up an ML Services cluster to run distributed R computations with the speeds promised with a Spark cluster. 如需詳細資訊,請參閱開始使用 HDInsight 中的 ML ServerFor more information, see Get started using ML Server in HDInsight.
第三方 IDE 整合Integration with third-party IDEs HDInsight 提供數個 IDE 外掛程式,以用來建立應用程式,並將應用程式提交至 HDInsight Spark 叢集。HDInsight provides several IDE plugins that are useful to create and submit applications to an HDInsight Spark cluster. 如需詳細資訊,請參閱使用 Azure Toolkit for IntelliJ IDEA使用 HDInsight for VSCode使用 Azure Toolkit for EclipseFor more information, see Use Azure Toolkit for IntelliJ IDEA, Use HDInsight for VSCode, and Use Azure Toolkit for Eclipse.
並行查詢Concurrent Queries HDInsight 中的 Spark 叢集支援並行查詢。Spark clusters in HDInsight support concurrent queries. 此功能可讓來自單一使用者的多個查詢或來自不同使用者與應用程式的多個查詢,一起共用相同的叢集資源。This capability enables multiple queries from one user or multiple queries from various users and applications to share the same cluster resources.
SSD 快取Caching on SSDs 您可以選擇將資料快取在記憶體中,或快取在連接叢集節點的 SSD 中。You can choose to cache data either in memory or in SSDs attached to the cluster nodes. 記憶體快取能提供最高的查詢效能,但可能所費不貲。Caching in memory provides the best query performance but could be expensive. SSD 快取提供改善查詢效能的絕佳選項,而且您不需要建立大小可讓整個資料集納入記憶體的叢集。Caching in SSDs provides a great option for improving query performance without the need to create a cluster of a size that is required to fit the entire dataset in memory.
BI 工具整合Integration with BI Tools HDInsight 中的 Spark 叢集會為資料分析提供 BI 工具 (例如 Power BI) 的連接器。Spark clusters in HDInsight provide connectors for BI tools such as Power BI for data analytics.
預先載入的 Anaconda 程式庫Pre-loaded Anaconda libraries HDInsight 中的 Spark 叢集隨附預先安裝的 Anaconda 程式庫。Spark clusters in HDInsight come with Anaconda libraries pre-installed. Anaconda 為機器學習、資料分析、視覺化等主題提供將近 200 個程式庫。Anaconda provides close to 200 libraries for machine learning, data analysis, visualization, etc.
延展性Scalability HDInsight 可讓您變更叢集節點數目。HDInsight allow you to change the number of cluster nodes. 此外,由於所有資料都儲存在 Azure 儲存體或 Data Lake Storage 內,因此您可以在不遺失資料的情況下卸除 Spark 叢集。Also, Spark clusters can be dropped with no loss of data since all the data is stored in Azure Storage or Data Lake Storage.
SLASLA HDInsight 中的 Spark 叢集隨附全天候支援,以及保證正常運作時間達 99.9% 的 SLA。Spark clusters in HDInsight come with 24/7 support and an SLA of 99.9% up-time.

依預設,HDInsight 中的 Apache Spark 叢集能經由叢集提供下列元件。Apache Spark clusters in HDInsight include the following components that are available on the clusters by default.

HDInsight 中的 Spark 叢集也提供 ODBC 驅動程式 (英文),讓您能從 BI 工具 (例如 Microsoft Power BI) 連線到 HDInsight 中的 Spark 叢集。Spark clusters in HDInsight also provide an ODBC driver for connectivity to Spark clusters in HDInsight from BI tools such as Microsoft Power BI.

Spark 叢集架構Spark cluster architecture

HDInsight Spark 的架構

您可以藉由了解 Spark 在 HDInsight 叢集上的執行方式,輕鬆地了解 Spark 的元件。It is easy to understand the components of Spark by understanding how Spark runs on HDInsight clusters.

Spark 應用程式會在叢集上以獨立的處理序組合來執行,並由主程式 (稱為驅動程式) 中的 SparkContext 物件來協調。Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

SparkContext 可以連線到數種類型的叢集管理員,而叢集管理員可以在各個應用程式之間配置資源。The SparkContext can connect to several types of cluster managers, which allocate resources across applications. 這些叢集管理員包括 Apache MesosApache Hadoop YARN 或 Spark 叢集管理員。These cluster managers include Apache Mesos, Apache Hadoop YARN, or the Spark cluster manager. 在 HDInsight 中,使用 YARN 叢集管理員可執行 Spark。In HDInsight, Spark runs using the YARN cluster manager. 一旦連線之後,Spark 就會取得叢集中背景工作節點上的執行程式,也就是為應用程式執行運算和儲存資料的處理序。Once connected, Spark acquires executors on workers nodes in the cluster, which are processes that run computations and store data for your application. 接下來,它會將您的應用程式程式碼 (由傳遞到 SparkContext 的 JAR 或 Python 檔案所定義) 傳送到執行程式。Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. 最後,SparkContext 會將工作傳送到執行程式來執行。Finally, SparkContext sends tasks to the executors to run.

SparkContext 會執行使用者的主要函式,並在背景工作節點上執行各種平行作業。The SparkContext runs the user's main function and executes the various parallel operations on the worker nodes. 然後,SparkContext 會收集作業的結果。Then, the SparkContext collects the results of the operations. 背景工作節點會在 Hadoop 分散式檔案系統中讀取和寫入資料。The worker nodes read and write data from and to the Hadoop distributed file system. 背景工作節點也會將記憶體內部已轉換的資料快取為彈性分散式資料集 (RDD)。The worker nodes also cache transformed data in-memory as Resilient Distributed Datasets (RDDs).

SparkContext 會連線到 Spark 主節點,並負責將應用程式轉換為個別工作的有向圖 (DAG),這類工作會在背景工作節點的執行程式處理序內執行。The SparkContext connects to the Spark master and is responsible for converting an application to a directed graph (DAG) of individual tasks that get executed within an executor process on the worker nodes. 每個應用程式都會取得自己的執行程式處理序,而這些處理序會在整個應用程式的持續時間保持運作,並且在多個執行緒中執行工作。Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads.

HDInsight 中的 Spark 使用案例Spark in HDInsight use cases

HDInsight 中的 Spark 叢集適用於下列重要案例:Spark clusters in HDInsight enable the following key scenarios:

  • 互動式資料分析和 BIInteractive data analysis and BI

    HDInsight 中的 Apache Spark 會將資料儲存在 Azure 儲存體或 Azure Data Lake Storage 中。Apache Spark in HDInsight stores data in Azure Storage or Azure Data Lake Storage. 商務專家和重要決策者可以利用這些資料來進行分析及建立報告,並使用 Microsoft Power BI 來根據分析資料建置互動式報告。Business experts and key decision makers can analyze and build reports over that data and use Microsoft Power BI to build interactive reports from the analyzed data. 分析師可以從叢集儲存體中的非結構化/半結構化資料著手、使用 Notebook 來定義資料的結構描述,然後再使用 Microsoft Power BI 來建置資料模型。Analysts can start from unstructured/semi structured data in cluster storage, define a schema for the data using notebooks, and then build data models using Microsoft Power BI. HDInsight 中的 Spark 叢集也支援 Tableau 等多種第三方 BI 工具,以方便資料分析師、商務專家及重要決策者使用。Spark clusters in HDInsight also support a number of third-party BI tools such as Tableau making it easier for data analysts, business experts, and key decision makers.

    教學課程:使用 Power BI 將 Spark 資料視覺化Tutorial: Visualize Spark data using Power BI

  • Spark 機器學習服務Spark Machine Learning

    Apache Spark 隨附 MLlib,這是以 Spark 為基礎的機器學習程式庫,您可以從 HDInsight 中的 Spark 叢集使用。Apache Spark comes with MLlib, a machine learning library built on top of Spark that you can use from a Spark cluster in HDInsight. HDInsight 中的 Spark 叢集也包含 Anaconda,這是提供各種機器學習套件的 Python 發行版本。Spark cluster in HDInsight also includes Anaconda, a Python distribution with a variety of packages for machine learning. 搭配內建的 Jupyter 和 Zeppelin Notebook 支援,就能擁有適用於建立機器學習應用程式的環境。Couple this with a built-in support for Jupyter and Zeppelin notebooks, and you have an environment for creating machine learning applications.

    教學課程:利用 HVAC 資料來預測建築物的溫度Tutorial: Predict building temperatures using HVAC data
    教學課程:預測食品檢查結果Tutorial: Predict food inspection results

  • Spark 串流和即時資料分析Spark streaming and real-time data analysis

    HDInsight 上的 Spark 叢集提供豐富的支援供您建置即時分析解決方案。Spark clusters in HDInsight offer a rich support for building real-time analytics solutions. 雖然 Spark 已附有從 Kafka、Flume、Twitter、ZeroMQ 或 TCP 通訊端等眾多來源擷取資料的連接器,不過 HDInsight 中的 Spark 仍加入首屈一指的支援,供您從 Azure 事件中樞擷取資料。While Spark already has connectors to ingest data from many sources like Kafka, Flume, Twitter, ZeroMQ, or TCP sockets, Spark in HDInsight adds first-class support for ingesting data from Azure Event Hubs. 事件中樞是 Azure 上最廣泛使用的佇列服務。Event Hubs is the most widely used queuing service on Azure. 擁有立即可用的事件中樞支援,讓 HDInsight 中的 Spark 叢集成為建置即時分析管線的理想平台。Having an out-of-the-box support for Event Hubs makes Spark clusters in HDInsight an ideal platform for building real-time analytics pipeline.

我該從哪裡開始?Where do I start?

您可以使用下列文章來深入了解 HDInsight 中的 Apache Spark:You can use the following articles to learn more about Apache Spark in HDInsight:

後續步驟Next Steps

在本概觀中,您已對 Azure HDInsight 中的 Apache Spark 有了一些基本了解。In this overview, you get some basic understanding of Apache Spark in Azure HDInsight. 請前往下一篇文章,以了解如何建立 HDInsight Spark 叢集以及執行某些 Spark SQL 查詢:Advance to the next article to learn how to create an HDInsight Spark cluster and run some Spark SQL queries: