Introduction to Apache Storm on HDInsight: Real-time analytics for Hadoop

You can create distributed, real-time analytics solutions by using Apache Storm on Azure HDInsight.

Storm is a distributed, fault-tolerant, open-source computation system. You can use Storm to process data in real time with Hadoop. Storm solutions can also provide guaranteed processing of data, with the ability to replay data that was not successfully processed the first time.

How does Storm work?

Storm runs topologies instead of the MapReduce jobs that you might be familiar with in HDInsight or Hadoop. Topologies are composed of multiple components that are arranged in a directed acyclic graph (DAG). The following diagram illustrates how data flows between components in a basic word-count topology:

Example of how components are arranged in a Storm topology

  • Spout components bring data into a topology. They emit one or more streams into the topology.

  • Bolt components consume streams emitted from spouts or other bolts. Bolts might optionally emit new streams into the topology. Bolts are also responsible for writing data to persistent storage, such as HDFS or HBase.

Why use Storm on HDInsight?

Storm on HDInsight provides the following key benefits:

  • Performs as a managed service with an SLA of 99.9 percent uptime.

  • Supports easy customization by running scripts against a cluster during or after creation. For more information, see Customize HDInsight clusters using script action.

  • Uses various languages. You can write Storm components in the language of your choice, such as Java, C#, and Python.

    • Integrates Visual Studio with HDInsight for the development, management, and monitoring of C# topologies. For more information, see Develop C# Storm topologies with the HDInsight Tools for Visual Studio.

    • Supports the Trident Java interface. You can create Storm topologies that support exactly once processing of messages, transactional datastore persistence, and a set of common stream analytics operations.

  • Scales clusters up and down easily. You can add or remove worker nodes with no impact to running Storm topologies.

  • Integrates with the following Azure services:

    • Azure Event Hubs

    • Azure Virtual Network

    • Azure SQL Database

    • Azure Storage

    • Azure DocumentDB

  • Securely combines the capabilities of multiple HDInsight clusters by using Virtual Network. You can create analytic pipelines that use HDInsight, HBase, or Hadoop clusters.

For a list of companies that are using Storm for their real-time analytics solutions, see Companies using Apache Storm.

To get started using Storm, see Get started with Storm on HDInsight.

Ease of creation

You can provision a new Storm cluster on HDInsight in minutes. For more information on creating a Storm cluster, see Get started with Storm on HDInsight.

Ease of use

  • Secure Shell (SSH) connectivity: You can access the head nodes of your HDInsight cluster over the Internet by using SSH. You can run commands directly on your cluster by using SSH.

    For more information, see Use SSH with HDInsight.

  • Web connectivity: HDInsight clusters provide the Ambari web UI. You can easily monitor, configure, and manage services on your cluster by using the Ambari web UI. Storm on HDInsight also provides the Storm UI. You can monitor and manage running Storm topologies from your browser by using the Storm UI.

    For more information, see Manage HDInsight using the Ambari Web UI and Monitor and manage using the Storm UI.

  • Azure PowerShell and Azure CLI: PowerShell and CLI both provide command-line utilities that you can use from your client system to work with HDInsight and other Azure services.

  • Visual Studio integration: Azure Data Lake Tools for Visual Studio include project templates for creating C# Storm topologies by using the SCP.Net framework. Data Lake Tools also provide tools to deploy, monitor, and manage solutions with Storm on HDInsight.

    For more information, see Develop C# Storm topologies with the HDInsight Tools for Visual Studio.

Integration with other Azure services

Reliability

Storm guarantees that each incoming message is always fully processed, even when the data analysis is spread over hundreds of nodes.

The Nimbus node provides functionality similar to the Hadoop JobTracker, and it assigns tasks to other nodes in a cluster through Zookeeper. Zookeeper nodes provide coordination for a cluster and facilitate communication between Nimbus and the Supervisor process on the worker nodes. If one processing node goes down, the Nimbus node is informed, and it assigns the task and associated data to another node.

The default configuration for Storm is to have only one Nimbus node. Storm on HDInsight runs two Nimbus nodes. If the primary node fails, the HDInsight cluster switches to the secondary node while the primary node is recovered. The following diagram illustrates the task flow configuration for Storm on HDInsight:

Diagram of nimbus, zookeeper, and supervisor

Scale

Although you can specify the number of nodes in your cluster during creation, you might want to grow or shrink the cluster to match workload. You can change the number of nodes in all HDInsight clusters, even while processing data.

Note

To take advantage of new nodes added through scaling, you need to rebalance topologies started before the cluster size was increased.

Support

Storm on HDInsight comes with full enterprise-level continuous support. Storm on HDInsight also has an SLA of 99.9 percent. That means we guarantee that a cluster has external connectivity at least 99.9 percent of the time.

For more information, see Azure support.

Common use cases

The following are some common scenarios for which you might use Storm on HDInsight.

  • Internet of Things (IoT)
  • Fraud detection
  • Social analytics
  • Extraction, transformation, and loading (ETL)
  • Network monitoring
  • Search
  • Mobile engagement

For information about real-world scenarios, see the How companies are using Storm document.

Development

.NET developers can design and implement topologies in C# by using Data Lake Tools for Visual Studio. You can also create hybrid topologies that use Java and C# components.

For more information, see Develop C# topologies for Storm on HDInsight using Visual Studio.

You can also develop Java solutions by using the IDE of your choice. For more information, see Develop Java topologies for Storm on HDInsight.

Python can also be used to develop Storm components. For more information, see Develop Storm topologies using Python on HDInsight.

Common development patterns

Guaranteed message processing

Storm can provide different levels of guaranteed message processing. For example, a basic Storm application can guarantee at-least-once processing, and Trident can guarantee exactly once processing.

For more information, see Guarantees on data processing at apache.org.

IBasicBolt

The pattern of reading an input tuple, emitting zero or more tuples, and then acking the input tuple immediately at the end of the execute method is common. Storm provides the IBasicBolt interface to automate this pattern.

Joins

How data streams are joined varies between applications. For example, you can join each tuple from multiple streams into one new stream, or you can join only batches of tuples for a specific window. Either way, joining can be accomplished by using fieldsGrouping, which is a way of defining how tuples are routed to bolts.

In the following Java example, fieldsGrouping is used to route tuples that originate from components "1", "2", and "3" to the MyJoiner bolt:

builder.setBolt("join", new MyJoiner(), parallelism) .fieldsGrouping("1", new Fields("joinfield1", "joinfield2")) .fieldsGrouping("2", new Fields("joinfield1", "joinfield2")) .fieldsGrouping("3", new Fields("joinfield1", "joinfield2"));

Batches

Storm provides an internal timing mechanism known as a "tick tuple," which can be used to emit a batch every X seconds.

For an example of using a tick tuple from a C# component, see PartialBoltCount.cs.

Trident is based on processing batches of tuples.

Caches

In-memory caching is often used as a mechanism for speeding up processing because it keeps frequently used assets in memory. Because a topology is distributed across multiple nodes, and multiple processes within each node, you should consider using fieldsGrouping. Use fieldsGrouping to ensure that tuples containing the fields that are used for cache lookup are always routed to the same process. This grouping functionality avoids duplication of cache entries across processes.

Stream "top N"

When your topology depends on calculating a top N value, calculate the top N value in parallel. Then merge the output from those calculations into a global value. This operation can be done by using fieldsGrouping to route by field for parallel processing, and then route to a bolt that globally determines the top N value.

For an example of calculating a top N value, see the RollingTopWords example.

Logging

Storm uses Apache Log4j to log information. By default, a large amount of data is logged, and it can be difficult to sort through the information. You can include a logging configuration file as part of your Storm topology to control logging behavior.

For an example topology that demonstrates how to configure logging, see Java-based WordCount example for Storm on HDInsight.

Next steps

Learn more about real-time analytics solutions with Storm on HDInsight: