什麼是 Apache Storm on Azure HDInsight?What is Apache Storm on Azure HDInsight?

Apache Storm 是一個容錯的分散式開放原始碼計算系統。Apache Storm is a distributed, fault-tolerant, open-source computation system. 您可以搭配 Apache Hadoop 使用 Storm 來即時處理資料流。You can use Storm to process streams of data in real time with Apache Hadoop. Storm 解決方案也能夠重播最初未成功處理的資料,保證一定會處理資料。Storm solutions can also provide guaranteed processing of data, with the ability to replay data that was not successfully processed the first time.

為何使用 Apache Storm on HDInsight?Why use Apache Storm on HDInsight?

Storm on HDInsight 提供下列功能︰Storm on HDInsight provides the following features:

  • Storm 運作時間的 99% 服務等級協定 (SLA):如需詳細資訊,請參閱適用於 HDInsight 的 SLA 資訊文件。99% Service Level Agreement (SLA) on Storm uptime: For more information, see the SLA information for HDInsight document.

  • 在建立期間或之後針對 Storm 叢集執行指令碼,可支援輕鬆自訂。Supports easy customization by running scripts against a Storm cluster during or after creation. 如需詳細資訊,請參閱使用指令碼動作來自訂 HDInsight 叢集For more information, see Customize HDInsight clusters using script action.

  • 以多種語言建立解決方案:您可以用所選的語言撰寫 Storm 元件,例如 Java、C# 和 Python。Create solutions in multiple languages: You can write Storm components in the language of your choice, such as Java, C#, and Python.

    • 整合 Visual Studio 與 HDInsight,以供開發、管理及監視 C# 拓撲。Integrates Visual Studio with HDInsight for the development, management, and monitoring of C# topologies. 如需詳細資訊,請參閱使用 HDInsight Tools for Visual Studio 開發 C# Storm 拓撲For more information, see Develop C# Storm topologies with the HDInsight Tools for Visual Studio.

    • 支援 Trident Java 介面。Supports the Trident Java interface. 您可以建立 Storm 拓撲,以支援一次性處理訊息、交易式資料存放區持續性和一組常用的串流分析作業。You can create Storm topologies that support exactly once processing of messages, transactional datastore persistence, and a set of common stream analytics operations.

  • 動態調整:您可以新增或移除背景工作節點,而不影響執行 Storm 拓撲。Dynamic scaling: You can add or remove worker nodes with no impact to running Storm topologies.

    注意

    您必須停用並重新執行拓撲,才能利用透過調整作業新增的節點。You must deactivate and reactivate running topologies to take advantage of new nodes added through scaling operations.

  • 使用多項 Azure 服務建立串流管線:HDInsight 上的 Storm 會與其他 Azure 服務整合,例如事件中樞、SQL Database、Azure 儲存體及 Azure Data Lake Storage。Create streaming pipelines using multiple Azure services: Storm on HDInsight integrates with other Azure services such as Event Hubs, SQL Database, Azure Storage, and Azure Data Lake Storage.

    如需與 Azure 服務整合的解決方案範例,請參閱使用 Apache Storm on HDInsight 處理事件中樞的事件For an example solution that integrates with Azure services, see Process events from Event Hubs with Apache Storm on HDInsight.

如需使用 Apache Storm 作為即時分析解決方案的公司清單,請參閱使用 Apache Storm 的公司For a list of companies that are using Apache Storm for their real-time analytics solutions, see Companies using Apache Storm.

若要开始使用 Storm,请参阅 Apache Storm on HDInsight 入门To get started using Storm, see Get started with Apache Storm on HDInsight.

Apache Storm 的運作方式How does Apache Storm work

Storm 會執行拓撲,而不是您可能熟悉的 Apache Hadoop MapReduce 作業。Storm runs topologies instead of the Apache Hadoop MapReduce jobs that you might be familiar with. Storm 拓撲是由有向非循環圖 (DAG) 中排列的多個元件所組成。Storm topologies are composed of multiple components that are arranged in a directed acyclic graph (DAG). 圖形中元件之間的資料流程。Data flows between the components in the graph. 每個元件會取用一或多個資料流,並可選擇性地發出一或多個資料流。Each component consumes one or more data streams, and can optionally emit one or more streams. 下图演示了一个基本单词计数拓扑中组件之间的数据流动方式:The following diagram illustrates how data flows between components in a basic word-count topology:

Storm 拓扑中组件排列方式的示例

  • Spout 元件可將資料帶入拓撲中。Spout components bring data into a topology. 它們會將一或多個串流發出至拓撲。They emit one or more streams into the topology.

  • Bolt 元件會取用 Spout 或其他 Bolt 所發出的串流。Bolt components consume streams emitted from spouts or other bolts. Bolt 可以選擇性地將串流發出至拓撲。Bolts might optionally emit streams into the topology. Bolt 也負責將資料寫入外部服務或儲存體,例如 HDFS、Kafka 或 HBase。Bolts are also responsible for writing data to external services or storage, such as HDFS, Kafka, or HBase.

可靠性Reliability

即使資料分析分散在數以百計的節點中,Apache Storm 也能保證每則傳入訊息一律都會經過完整處理。Apache Storm guarantees that each incoming message is always fully processed, even when the data analysis is spread over hundreds of nodes.

Nimbus 節點提供與 Apache Hadoop JobTracker 類似的功能,並會透過 Apache Zookeeper 將工作指派給叢集中的其他節點。The Nimbus node provides functionality similar to the Apache Hadoop JobTracker, and it assigns tasks to other nodes in a cluster through Apache ZooKeeper. Zookeeper 節點可為叢集進行協調,並促進 Nimbus 與背景工作節點上的監督員處理序之間的通訊。Zookeeper nodes provide coordination for a cluster and facilitate communication between Nimbus and the Supervisor process on the worker nodes. 如果其中一個處理節點停止,Nimbus 節點會收到通知,然後將工作和相關資料指派給其他節點。If one processing node goes down, the Nimbus node is informed, and it assigns the task and associated data to another node.

Apache Storm 叢集的預設組態只有一個 Nimbus 節點。The default configuration for Apache Storm clusters is to have only one Nimbus node. Storm on HDInsight 會提供兩個 Nimbus 節點。Storm on HDInsight provides two Nimbus nodes. 如果主要節點失敗,Storm 叢集會切換至次要節點,直到主要節點復原為止。If the primary node fails, the Storm cluster switches to the secondary node while the primary node is recovered. 下图说明了 Storm on HDInsight 的任务流配置:The following diagram illustrates the task flow configuration for Storm on HDInsight:

Nimbus、Zookeeper 和監督員的圖表

容易建立Ease of creation

只要花數分鐘即可在 HDInsight 上建立新的 Storm 叢集。You can create a new Storm cluster on HDInsight in minutes. 如需建立 Storm 叢集的相關資訊,請參閱開始使用 Storm on HDInsightFor more information on creating a Storm cluster, see Get started with Storm on HDInsight.

容易使用Ease of use

  • 安全殼層 (SSH) 連線:您可以使用 SSH 透過網際網路存取 Storm 叢集的前端節點。Secure Shell (SSH) connectivity: You can access the head nodes of your Storm cluster over the Internet by using SSH. 您可以使用 SSH,直接在叢集上執行命令。You can run commands directly on your cluster by using SSH.

    如需詳細資訊,請參閱搭配 HDInsight 使用 SSHFor more information, see Use SSH with HDInsight.

  • Web 連線:所有 HDInsight 叢集都會提供 Ambari Web UI。Web connectivity: All HDInsight clusters provide the Ambari web UI. 您可以使用 Ambari Web UI,輕鬆地監視、設定及管理叢集上的服務。You can easily monitor, configure, and manage services on your cluster by using the Ambari web UI. Storm 叢集也會提供 Storm UI。Storm clusters also provide the Storm UI. 您可以使用 Storm UI,從瀏覽器監視及管理執行中的 Storm 拓撲。You can monitor and manage running Storm topologies from your browser by using the Storm UI.

    如需詳細資訊,請參閱使用 Apache Ambari Web UI 管理 HDInsight使用 Apache Storm UI 進行監視和管理文件。For more information, see the Manage HDInsight using the Apache Ambari Web UI and Monitor and manage using the Apache Storm UI documents.

  • Azure PowerShell 和 Azure 傳統 CLI:PowerShell 和傳統 CLI 兩者都提供您可以從用戶端系統使用的命令列公用程式,以便搭配 HDInsight 和其他 Azure 服務運作。Azure PowerShell and Azure Classic CLI: PowerShell and classic CLI both provide command-line utilities that you can use from your client system to work with HDInsight and other Azure services.

  • Visual Studio 整合:Azure Data Lake Tools for Visual Studio 包含可供建立的專案範本C#使用 SCP.NET 架構 Storm 拓撲。Visual Studio integration: Azure Data Lake Tools for Visual Studio include project templates for creating C# Storm topologies by using the SCP.NET framework. Data Lake Tools 也提供一些工具,以利用 Storm on HDInsight 來部署、監視和管理解決方案。Data Lake Tools also provide tools to deploy, monitor, and manage solutions with Storm on HDInsight.

    有关详细信息,请参阅 Develop C# Storm topologies with the HDInsight Tools for Visual Studio(使用用于 Visual Studio 的 HDInsight 工具开发 C# Storm 拓扑)。For more information, see Develop C# Storm topologies with the HDInsight Tools for Visual Studio.

與其他 Azure 服務整合Integration with other Azure services

支援Support

Storm on HDInsight 隨附完整的企業級連續支援。Storm on HDInsight comes with full enterprise-level continuous support. Storm on HDInsight 也有 99.9% 的 SLA。Storm on HDInsight also has an SLA of 99.9 percent. 這表示 Microsoft 保證 Storm 叢集至少 99.9% 的時間具有外部連線能力。That means Microsoft guarantees that a Storm cluster has external connectivity at least 99.9 percent of the time.

如需詳細資訊,請參閱 Azure 文章For more information, see Azure support.

Apache Storm 使用案例Apache Storm use cases

以下是一些 Storm on HDInsight 可能的常見使用案例:The following are some common scenarios for which you might use Storm on HDInsight:

  • 物聯網 (IoT)Internet of Things (IoT)
  • 詐騙偵測Fraud detection
  • 社交分析Social analytics
  • 擷取、轉換和載入 (ETL)Extraction, transformation, and loading (ETL)
  • 網路監視Network monitoring
  • SearchSearch
  • Mobile EngagementMobile engagement

如需真實案例的資訊,請參閱 How companies are using Apache Storm (公司如何使用 Apache Storm) 文件。For information about real-world scenarios, see the How companies are using Apache Storm document.

開發Development

.NET 開發人員可以使用 Data Lake Tools for Visual Studio,以 C# 語言設計和實作拓撲。.NET developers can design and implement topologies in C# by using Data Lake Tools for Visual Studio. 您也可以建立使用 Java 和 C# 元件的混合式拓撲。You can also create hybrid topologies that use Java and C# components.

如需詳細資訊,請參閱 使用 Visual Studio 開發 Apache Storm on HDInsight 的 C# 拓撲For more information, see Develop C# topologies for Apache Storm on HDInsight using Visual Studio.

您也可以使用所選的 IDE 來開發 Java 解決方案。You can also develop Java solutions by using the IDE of your choice. 如需詳細資訊,請參閱開發 Apache Storm on HDInsight 的 Java 拓撲For more information, see Develop Java topologies for Apache Storm on HDInsight.

Python 也可以用於開發 Storm 元件。Python can also be used to develop Storm components. 如需詳細資訊,請參閱使用 Python on HDInsight 開發 Apache Storm 拓撲For more information, see Develop Apache Storm topologies using Python on HDInsight.

常見的開發模式Common development patterns

有保证的消息处理Guaranteed message processing

Apache Storm 可以提供不同程度的訊息處理保證。Apache Storm can provide different levels of guaranteed message processing. 例如,基本的 Storm 應用程式可以保證至少處理一次,而 Trident 可以保證只處理一次。For example, a basic Storm application can guarantee at-least-once processing, and Trident can guarantee exactly once processing.

如需詳細資訊,請參閱 apache.org 上的 保證處理資料 (英文)。For more information, see Guarantees on data processing at apache.org.

IBasicBoltIBasicBolt

读取输入元组,发出零个或多个元组,并在执行方法结束时立即确认输入元组,这种模式非常普通。The pattern of reading an input tuple, emitting zero or more tuples, and then acknowledging the input tuple immediately at the end of the execute method is common. Storm 提供 IBasicBolt 介面將此模式自動化。Storm provides the IBasicBolt interface to automate this pattern.

联接Joins

資料流的聯結方式會隨應用程式而異。How data streams are joined varies between applications. 例如,您可以將多個串流中的每個 Tuple 聯結成一個新的串流,也可以只聯結特定時間範圍的幾批 Tuple。For example, you can join each tuple from multiple streams into one new stream, or you can join only batches of tuples for a specific window. 無論何者,都可利用 fieldsGrouping 來完成聯結。Either way, joining can be accomplished by using fieldsGrouping. 群組欄位是一種定義 Tuple 如何路由傳送至 Bolt 的方法。Field grouping is a way of defining how tuples are routed to bolts.

在下列 Java 範例中,會使用 fieldsGrouping 將源自元件 "1"、"2" 和 "3" 的 Tuple 路由傳送至 MyJoiner bolt:In the following Java example, fieldsGrouping is used to route tuples that originate from components "1", "2", and "3" to the MyJoiner bolt:

builder.setBolt("join", new MyJoiner(), parallelism) .fieldsGrouping("1", new Fields("joinfield1", "joinfield2")) .fieldsGrouping("2", new Fields("joinfield1", "joinfield2")) .fieldsGrouping("3", new Fields("joinfield1", "joinfield2"));

批次Batches

Apache Storm 提供稱為「計時 Tuple」的內部計時機制。Apache Storm provides an internal timing mechanism known as a "tick tuple." 您可以設定在您的拓撲中發出計時 Tuple 的頻率。You can set how often a tick tuple is emitted in your topology.

如需使用 C# 元件中 tick tuple 的範例,請參閱 PartialBoltCount.csFor an example of using a tick tuple from a C# component, see PartialBoltCount.cs.

缓存Caches

内存缓存通常用作加速处理的机制,因为它在内存中存储常用资产。In-memory caching is often used as a mechanism for speeding up processing because it keeps frequently used assets in memory. 由于拓扑跨多个节点分布,而每个节点中有多个流程,因此应考虑使用 fieldsGroupingBecause a topology is distributed across multiple nodes, and multiple processes within each node, you should consider using fieldsGrouping. 使用 fieldsGrouping 可確保含有快取查閱所用欄位的 Tuple 一律會路由傳送至相同程序。Use fieldsGrouping to ensure that tuples containing the fields that are used for cache lookup are always routed to the same process. 此分组功能可以避免在进程间重复缓存条目。This grouping functionality avoids duplication of cache entries across processes.

串流「前 N 個」Stream "top N"

當拓撲取決於計算前 N 個值時,您可平行地計算前 N 個值。When your topology depends on calculating a top N value, calculate the top N value in parallel. 然後將這些計算的輸出合併成一個全域值。Then merge the output from those calculations into a global value. 此作業可利用 fieldsGrouping 來進行,依欄位路由傳送以便平行處理。This operation can be done by using fieldsGrouping to route by field for parallel processing. 然後您可以路由傳送至 bolt 來整體決定前 N 個值。Then you can route to a bolt that globally determines the top N value.

如需計算前 N 個值的範例,請參閱 RollingTopWords 範例。For an example of calculating a top N value, see the RollingTopWords example.

記錄Logging

Storm 使用 Apache Log4j 2 來記錄資訊。Storm uses Apache Log4j 2 to log information. 預設會記錄大量的資料,因此難以排序整個資訊。By default, a large amount of data is logged, and it can be difficult to sort through the information. 您可以將記錄組態檔納入 Storm 拓撲的一部分,以便控制記錄行為。You can include a logging configuration file as part of your Storm topology to control logging behavior.

如需示範如何設定記錄的拓撲範例,請參閱 Storm on HDInsight 的 以 Java 為基礎的 WordCount 範例。For an example topology that demonstrates how to configure logging, see Java-based WordCount example for Storm on HDInsight.

後續步驟Next steps

深入了解使用 Apache Storm on HDInsight 的即時分析解決方案:Learn more about real-time analytics solutions with Apache Storm on HDInsight: