您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

什么是 Azure HDInsight 上的 Apache Storm?What is Apache Storm on Azure HDInsight?

Apache Storm 是一种容错的分布式开源计算系统。Apache Storm is a distributed, fault-tolerant, open-source computation system. 若要实时处理数据流,可以将 Storm 与 Apache Hadoop 配合使用。You can use Storm to process streams of data in real time with Apache Hadoop. Storm 解决方案还提供有保障的数据处理功能,能够重播第一次未成功处理的数据。Storm solutions can also provide guaranteed processing of data, with the ability to replay data that was not successfully processed the first time.

为何使用 Apache Storm on HDInsight?Why use Apache Storm on HDInsight?

Storm on HDInsight 提供以下功能:Storm on HDInsight provides the following features:

  • 针对 Storm 运行时间的 99% 服务级别协议 (SLA) :有关详细信息,请参阅 HDInsight 的 SLA 信息文档。99% Service Level Agreement (SLA) on Storm uptime: For more information, see the SLA information for HDInsight document.

  • 支持在 Storm 群集创建期间或者创建之后,通过针对该群集运行脚本轻松进行自定义。Supports easy customization by running scripts against a Storm cluster during or after creation. 有关详细信息,请参阅使用脚本操作自定义 HDInsight 群集For more information, see Customize HDInsight clusters using script action.

  • 以多种语言创建解决方案:可以根据所选语言(例如 Java、C#、Python)编写 Storm 组件。Create solutions in multiple languages: You can write Storm components in the language of your choice, such as Java, C#, and Python.

    • 将 Visual Studio 与 HDInsight 集成,以便开发、管理和监视 C# 拓扑。Integrates Visual Studio with HDInsight for the development, management, and monitoring of C# topologies. 有关详细信息,请参阅 Develop C# Storm topologies with the HDInsight Tools for Visual Studio(使用用于 Visual Studio 的 HDInsight 工具开发 C# Storm 拓扑)。For more information, see Develop C# Storm topologies with the HDInsight Tools for Visual Studio.

    • 支持 Trident Java 接口。Supports the Trident Java interface. 可以创建支持一次性消息处理、事务性数据存储持久性和一组常见流分析操作的 Storm 拓扑。You can create Storm topologies that support exactly once processing of messages, transactional datastore persistence, and a set of common stream analytics operations.

  • 动态缩放:可以在不影响 Storm 拓扑运行的情况下添加或删除辅助角色节点。Dynamic scaling: You can add or remove worker nodes with no impact to running Storm topologies.

    • 若要利用通过缩放操作添加的新节点,必须停用运行的拓扑,然后再将其重新激活。You must deactivate and reactivate running topologies to take advantage of new nodes added through scaling operations.
  • 使用多个 Azure 服务创建流式处理管道:Storm on HDInsight 集成其他 Azure 服务,例如事件中心、SQL 数据库、Azure 存储、Azure Data Lake Storage。Create streaming pipelines using multiple Azure services: Storm on HDInsight integrates with other Azure services such as Event Hubs, SQL Database, Azure Storage, and Azure Data Lake Storage.

    如需集成 Azure 服务的示例解决方案,请参阅使用 Apache Storm on HDInsight 处理事件中心的事件For an example solution that integrates with Azure services, see Process events from Event Hubs with Apache Storm on HDInsight.

有关在实时分析解决方案中使用 Apache Storm 的公司的列表,请参阅使用 Apache Storm 的公司For a list of companies that are using Apache Storm for their real-time analytics solutions, see Companies using Apache Storm.

若要开始使用 Storm,请参阅在 Azure HDInsight 中创建和监视 Apache Storm 拓扑To get started using Storm, see Create and monitor an Apache Storm topology in Azure HDInsight.

Apache Storm 如何工作How does Apache Storm work

Storm 运行的是拓扑,而不是你可能熟悉的 Apache Hadoop MapReduce 作业。Storm runs topologies instead of the Apache Hadoop MapReduce jobs that you might be familiar with. Storm 拓扑由多个以有向无环图 (DAG) 形式排列的组件构成。Storm topologies are composed of multiple components that are arranged in a directed acyclic graph (DAG). 数据在该图中的组件之间流动。Data flows between the components in the graph. 每个组件使用一个或多个数据流,并可选择性地发出一个或多个流。Each component consumes one or more data streams, and can optionally emit one or more streams. 下图演示了一个基本单词计数拓扑中组件之间的数据流动方式:The following diagram illustrates how data flows between components in a basic word-count topology:

Storm 拓扑中组件排列方式的示例

  • Spout 组件将数据引入拓扑。Spout components bring data into a topology. 它们将一个或多个流发出到拓扑中。They emit one or more streams into the topology.

  • Bolt 组件使用 Spout 或其他 Bolt 发出的流。Bolt components consume streams emitted from spouts or other bolts. Bolt 可以选择性地将流发出到拓扑中。Bolts might optionally emit streams into the topology. Bolt 还负责将数据写入 HDFS、Kafka 或 HBase 等外部服务或存储。Bolts are also responsible for writing data to external services or storage, such as HDFS, Kafka, or HBase.

可靠性Reliability

Apache Storm 保证每个传入消息始终受到完全处理,即使数据分析分散在数百个节点。Apache Storm guarantees that each incoming message is always fully processed, even when the data analysis is spread over hundreds of nodes.

Nimbus 节点提供的功能与 Apache Hadoop JobTracker 类似,它通过 Apache ZooKeeper 将任务分配给群集中的其他节点。The Nimbus node provides functionality similar to the Apache Hadoop JobTracker, and it assigns tasks to other nodes in a cluster through Apache ZooKeeper. Zookeeper 节点为群集提供协调功能,并促进 Nimbus 与辅助节点上的 Supervisor 进程进行通信。Zookeeper nodes provide coordination for a cluster and facilitate communication between Nimbus and the Supervisor process on the worker nodes. 如果处理的一个节点出现故障,Nimbus 节点将得到通知,并分配到另一个节点的任务和关联的数据。If one processing node goes down, the Nimbus node is informed, and it assigns the task and associated data to another node.

Apache Storm 群集的默认配置是只能有一个 Nimbus 节点。The default configuration for Apache Storm clusters is to have only one Nimbus node. HDInsight 上的 Storm 提供两个 Nimbus 节点。Storm on HDInsight provides two Nimbus nodes. 如果主节点出现故障,Storm 群集将切换到辅助节点,同时主节点会恢复。If the primary node fails, the Storm cluster switches to the secondary node while the primary node is recovered. 下图说明了 Storm on HDInsight 的任务流配置:The following diagram illustrates the task flow configuration for Storm on HDInsight:

nimbus、zookeeper 和 supervisor 示意图

容易创建Ease of creation

只需数分钟即可在 HDInsight 上创建新的 Storm 群集。You can create a new Storm cluster on HDInsight in minutes. 有关创建 Storm 群集的详细信息,请参阅使用 Azure 门户创建 Apache Hadoop 群集For more information on creating a Storm cluster, see Create Apache Hadoop clusters using the Azure portal.

易于使用Ease of use

  • 安全外壳 (SSH) 连接:可以使用 SSH 通过 Internet 访问 Storm 群集的头节点。Secure Shell (SSH) connectivity: You can access the head nodes of your Storm cluster over the Internet by using SSH. 可以使用 SSH 直接在群集上运行命令。You can run commands directly on your cluster by using SSH.

    有关详细信息,请参阅 Use SSH with HDInsight(对 HDInsight 使用 SSH)。For more information, see Use SSH with HDInsight.

  • Web 连接:所有 HDInsight 群集都提供 Ambari Web UI。Web connectivity: All HDInsight clusters provide the Ambari web UI. 可以使用 Ambari Web UI 在群集上轻松监视、配置和管理服务。You can easily monitor, configure, and manage services on your cluster by using the Ambari web UI. Storm 群集还提供 Storm UI。Storm clusters also provide the Storm UI. 可以使用 Storm UI,通过浏览器监视和管理 Storm 拓扑的运行。You can monitor and manage running Storm topologies from your browser by using the Storm UI.

    有关详细信息,请参阅使用 Apache Ambari Web UI 管理 HDInsight使用 Apache Storm UI 进行监视和管理文档。For more information, see the Manage HDInsight using the Apache Ambari Web UI and Monitor and manage using the Apache Storm UI documents.

  • Azure PowerShell 和 Azure 经典 CLI:PowerShell 和经典 CLI 都提供命令行实用工具,可在客户端系统中使用这些工具来操作 HDInsight 和其他 Azure 服务。Azure PowerShell and Azure Classic CLI: PowerShell and classic CLI both provide command-line utilities that you can use from your client system to work with HDInsight and other Azure services.

  • Visual Studio 集成:针对 Visual Studio 的 Azure Data Lake 工具包含用于通过 SCP.NET Framework 创建 C# Storm 拓扑的项目模板。Visual Studio integration: Azure Data Lake Tools for Visual Studio include project templates for creating C# Storm topologies by using the SCP.NET framework. Data Lake 工具还提供用于通过 Storm on HDInsight 部署、监视和管理解决方案的工具。Data Lake Tools also provide tools to deploy, monitor, and manage solutions with Storm on HDInsight.

    有关详细信息,请参阅 Develop C# Storm topologies with the HDInsight Tools for Visual Studio(使用用于 Visual Studio 的 HDInsight 工具开发 C# Storm 拓扑)。For more information, see Develop C# Storm topologies with the HDInsight Tools for Visual Studio.

与其他 Azure 服务集成Integration with other Azure services

支持Support

Storm on HDInsight 附带完整的企业级持续支持。Storm on HDInsight comes with full enterprise-level continuous support. Storm on HDInsight 也提供 99.9% 的 SLA。Storm on HDInsight also has an SLA of 99.9 percent. 这意味着,Microsoft 保证至少 99.9% 的时间 Storm 群集都能建立外部连接。That means Microsoft guarantees that a Storm cluster has external connectivity at least 99.9 percent of the time.

有关详细信息,请参阅 Azure 支持For more information, see Azure support.

Apache Storm 用例Apache Storm use cases

以下是可能使用 HDInsight 上的 Storm 的一些常见方案:The following are some common scenarios for which you might use Storm on HDInsight:

  • 物联网 (IoT)Internet of Things (IoT)
  • 欺诈检测Fraud detection
  • 社交分析Social analytics
  • 提取、转换和加载 (ETL)Extraction, transformation, and loading (ETL)
  • 网络监视Network monitoring
  • 搜索Search
  • 移动应用场景Mobile engagement

有关实际方案的信息,请参阅文档 How companies are using Apache Storm(公司如何使用 Apache Storm)。For information about real-world scenarios, see the How companies are using Apache Storm document.

开发Development

.NET 开发人员使用针对 Visual Studio 的 Data Lake 工具即可以 C# 语言设计和实施拓扑。.NET developers can design and implement topologies in C# by using Data Lake Tools for Visual Studio. 你也可以创建使用 Java 和 C# 组件的混合拓扑。You can also create hybrid topologies that use Java and C# components.

有关详细信息,请参阅 使用 Visual Studio 开发 Apache Storm on HDInsight 的 C# 拓扑For more information, see Develop C# topologies for Apache Storm on HDInsight using Visual Studio.

还可以使用所选的 IDE 开发 Java 解决方案。You can also develop Java solutions by using the IDE of your choice. 有关详细信息,请参阅开发适用于 Apache Storm on HDInsight 的 Java 拓扑For more information, see Develop Java topologies for Apache Storm on HDInsight.

还可以使用 Python 开发 Storm 组件。Python can also be used to develop Storm components. 有关详细信息,请参阅使用 Python on HDInsight 开发 Apache Storm 拓扑For more information, see Develop Apache Storm topologies using Python on HDInsight.

常见开发模式Common development patterns

有保证的消息处理Guaranteed message processing

Apache Storm 可以提供不同级别的有保证的消息处理。Apache Storm can provide different levels of guaranteed message processing. 例如,基本的 Storm 应用程序至少可以保证一次处理,而 Trident 仅可以保证一次处理。For example, a basic Storm application can guarantee at-least-once processing, and Trident can guarantee exactly once processing.

有关详细信息,请参阅 apache.org 上的 数据处理保证For more information, see Guarantees on data processing at apache.org.

IBasicBoltIBasicBolt

读取输入元组,发出零个或更多元组,并在执行方法结束时立即确认输入元组,这种模式很普通。The pattern of reading an input tuple, emitting zero or more tuples, and then acknowledging the input tuple immediately at the end of the execute method is common. Storm 提供 IBasicBolt 接口来自动执行这种模式。Storm provides the IBasicBolt interface to automate this pattern.

联接Joins

应用程序之间数据流的联接方式各不相同。How data streams are joined varies between applications. 例如,可以从多个流将每个元组联接到一个新流,也可以仅联接特定窗口的批量元组。For example, you can join each tuple from multiple streams into one new stream, or you can join only batches of tuples for a specific window. 不管什么方式,都可以通过 fieldsGrouping 完成联接。Either way, joining can be accomplished by using fieldsGrouping. 可以通过字段分组来定义将元组路由到 Bolt 的方式。Field grouping is a way of defining how tuples are routed to bolts.

在以下 Java 实例中,fieldsGrouping 用于将来自组件“1”、“2”和“3”的元组路由至 MyJoiner bolt:In the following Java example, fieldsGrouping is used to route tuples that originate from components "1", "2", and "3" to the MyJoiner bolt:

builder.setBolt("join", new MyJoiner(), parallelism) .fieldsGrouping("1", new Fields("joinfield1", "joinfield2")) .fieldsGrouping("2", new Fields("joinfield1", "joinfield2")) .fieldsGrouping("3", new Fields("joinfield1", "joinfield2"));

批处理Batches

Apache Storm 提供名为“计时周期元组”的内部计时机制。Apache Storm provides an internal timing mechanism known as a "tick tuple." 可以在拓扑中对计时周期元组的发出频率进行设置。You can set how often a tick tuple is emitted in your topology.

有关从 C# 组件使用计时周期元组的示例,请参阅 PartialBoltCount.csFor an example of using a tick tuple from a C# component, see PartialBoltCount.cs.

缓存Caches

内存缓存通常用作加速处理的机制,因为它在内存中存储常用资产。In-memory caching is often used as a mechanism for speeding up processing because it keeps frequently used assets in memory. 由于拓扑跨多个节点分布,而每个节点中有多个流程,因此应考虑使用 fieldsGroupingBecause a topology is distributed across multiple nodes, and multiple processes within each node, you should consider using fieldsGrouping. 使用 fieldsGrouping 确保将元组(其中包含的字段用于缓存查找)始终路由到同一进程。Use fieldsGrouping to ensure that tuples containing the fields that are used for cache lookup are always routed to the same process. 此分组功能可以避免在进程间重复缓存条目。This grouping functionality avoids duplication of cache entries across processes.

流“top N”Stream "top N"

当拓扑依赖于计算 top N 值时,请并行计算 top N 值。When your topology depends on calculating a top N value, calculate the top N value in parallel. 然后将这些计算的输出合并到全局值中。Then merge the output from those calculations into a global value. 此操作可以通过 fieldsGrouping 来完成,以便按字段路由来完成并行处理。This operation can be done by using fieldsGrouping to route by field for parallel processing. 然后即可路由到 Bolt,以便通过全局方式确定 top N 值。Then you can route to a bolt that globally determines the top N value.

有关计算 top N 值的示例,请参阅 RollingTopWords 示例。For an example of calculating a top N value, see the RollingTopWords example.

日志记录Logging

Storm 使用 Apache Log4j 2 来记录信息。Storm uses Apache Log4j 2 to log information. 默认情况下,将记录大量的数据,因此很难通过信息排序。By default, a large amount of data is logged, and it can be difficult to sort through the information. 可以让日志记录配置文件包括在 Storm 拓扑中,控制日志记录行为。You can include a logging configuration file as part of your Storm topology to control logging behavior.

有关演示如何配置日志记录的示例拓扑,请参阅适用于 HDInsight 上的 Storm 的 Java-based WordCount (基于 Java 的 WordCount)示例。For an example topology that demonstrates how to configure logging, see Java-based WordCount example for Storm on HDInsight.

后续步骤Next steps

了解有关使用 Apache Storm on HDInsight 构建实时分析解决方案的详细信息:Learn more about real-time analytics solutions with Apache Storm on HDInsight: