Tutorial: Write to Apache Hadoop HDFS from Apache Storm on Azure HDInsight
This tutorial demonstrates how to use Apache Storm to write data to the HDFS-compatible storage used by Apache Storm on HDInsight. HDInsight can use both Azure Storage and Azure Data Lake Storage as HDFS-compatible storage. Storm provides an HdfsBolt component that writes data to HDFS. This document provides information on writing to either type of storage from the HdfsBolt.
The example topology used in this document relies on components that are included with Storm on HDInsight. It may require modification to work with Azure Data Lake Storage when used with other Apache Storm clusters.
In this tutorial, you learn how to:
- Configure the cluster with script action
- Build and package the topology
- Deploy and run the topology
- View output data
- Stop the topology
Prerequisites
Apache Maven properly installed according to Apache. Maven is a project build system for Java projects.
An SSH client. For more information, see Connect to HDInsight (Apache Hadoop) using SSH.
The URI scheme for your clusters primary storage. This would be
wasb://for Azure Storage,abfs://for Azure Data Lake Storage Gen2 oradl://for Azure Data Lake Storage Gen1. If secure transfer is enabled for Azure Storage, the URI would bewasbs://. See also, secure transfer.
Example configuration
The following YAML is an excerpt from the resources/writetohdfs.yaml file included in the example. This file defines the Storm topology using the Flux framework for Apache Storm.
components:
- id: "syncPolicy"
className: "org.apache.storm.hdfs.bolt.sync.CountSyncPolicy"
constructorArgs:
- 1000
# Rotate files when they hit 5 MB
- id: "rotationPolicy"
className: "org.apache.storm.hdfs.bolt.rotation.FileSizeRotationPolicy"
constructorArgs:
- 5
- "MB"
- id: "fileNameFormat"
className: "org.apache.storm.hdfs.bolt.format.DefaultFileNameFormat"
configMethods:
- name: "withPath"
args: ["${hdfs.write.dir}"]
- name: "withExtension"
args: [".txt"]
- id: "recordFormat"
className: "org.apache.storm.hdfs.bolt.format.DelimitedRecordFormat"
configMethods:
- name: "withFieldDelimiter"
args: ["|"]
# spout definitions
spouts:
- id: "tick-spout"
className: "com.microsoft.example.TickSpout"
parallelism: 1
# bolt definitions
bolts:
- id: "hdfs-bolt"
className: "org.apache.storm.hdfs.bolt.HdfsBolt"
configMethods:
- name: "withConfigKey"
args: ["hdfs.config"]
- name: "withFsUrl"
args: ["${hdfs.url}"]
- name: "withFileNameFormat"
args: [ref: "fileNameFormat"]
- name: "withRecordFormat"
args: [ref: "recordFormat"]
- name: "withRotationPolicy"
args: [ref: "rotationPolicy"]
- name: "withSyncPolicy"
args: [ref: "syncPolicy"]
This YAML defines the following items:
syncPolicy: Defines when files are synched/flushed to the file system. In this example, every 1000 tuples.fileNameFormat: Defines the path and file name pattern to use when writing files. In this example, the path is provided at runtime using a filter, and the file extension is.txt.recordFormat: Defines the internal format of the files written. In this example, fields are delimited by the|character.rotationPolicy: Defines when to rotate files. In this example, no rotation is performed.hdfs-bolt: Uses the previous components as configuration parameters for theHdfsBoltclass.
For more information on the Flux framework, see https://storm.apache.org/releases/current/flux.html.
Configure the cluster
By default, Storm on HDInsight does not include the components that HdfsBolt uses to communicate with Azure Storage or Data Lake Storage in Storm's classpath. Use the following script action to add these components to the extlib directory for Storm on your cluster:
| Property | Value |
|---|---|
| Script type | - Custom |
| Bash script URI | https://hdiconfigactions.blob.core.windows.net/linuxstormextlibv01/stormextlib.sh |
| Node type(s) | Nimbus, Supervisor |
| Parameters | None |
For information on using this script with your cluster, see the Customize HDInsight clusters using script actions document.
Build and package the topology
Download the example project from https://github.com/Azure-Samples/hdinsight-storm-azure-data-lake-store to your development environment.
From a command prompt, terminal, or shell session, change directories to the root of the downloaded project. To build and package the topology, use the following command:
mvn compile packageOnce the build and packaging completes, there is a new directory named
target, that contains a file namedStormToHdfs-1.0-SNAPSHOT.jar. This file contains the compiled topology.
Deploy and run the topology
Use the following command to copy the topology to the HDInsight cluster. Replace
CLUSTERNAMEwith the name of the cluster.scp target\StormToHdfs-1.0-SNAPSHOT.jar sshuser@CLUSTERNAME-ssh.azurehdinsight.net:StormToHdfs-1.0-SNAPSHOT.jarOnce the upload completes, use the following to connect to the HDInsight cluster using SSH. Replace
CLUSTERNAMEwith the name of the cluster.ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.netOnce connected, use the following command to create a file named
dev.properties:nano dev.propertiesUse the following text as the contents of the
dev.propertiesfile. Revise as needed based on your URI scheme.hdfs.write.dir: /stormdata/ hdfs.url: wasbs:///To save the file, use Ctrl + X, then Y, and finally Enter. The values in this file set the storage URL and the directory name that data is written to.
Use the following command to start the topology:
storm jar StormToHdfs-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --remote -R /writetohdfs.yaml --filter dev.propertiesThis command starts the topology using the Flux framework by submitting it to the Nimbus node of the cluster. The topology is defined by the
writetohdfs.yamlfile included in the jar. Thedev.propertiesfile is passed as a filter, and the values contained in the file are read by the topology.
View output data
To view the data, use the following command:
hdfs dfs -ls /stormdata/
A list of the files created by this topology is displayed. The following list is an example of the data returned by the previous commands:
Found 23 items
-rw-r--r-- 1 storm supergroup 5242880 2019-06-24 20:25 /stormdata/hdfs-bolt-3-0-1561407909895.txt
-rw-r--r-- 1 storm supergroup 5242880 2019-06-24 20:25 /stormdata/hdfs-bolt-3-1-1561407915577.txt
-rw-r--r-- 1 storm supergroup 5242880 2019-06-24 20:25 /stormdata/hdfs-bolt-3-10-1561407943327.txt
-rw-r--r-- 1 storm supergroup 5242880 2019-06-24 20:25 /stormdata/hdfs-bolt-3-11-1561407946312.txt
-rw-r--r-- 1 storm supergroup 5242880 2019-06-24 20:25 /stormdata/hdfs-bolt-3-12-1561407949320.txt
-rw-r--r-- 1 storm supergroup 5242880 2019-06-24 20:25 /stormdata/hdfs-bolt-3-13-1561407952662.txt
-rw-r--r-- 1 storm supergroup 5242880 2019-06-24 20:25 /stormdata/hdfs-bolt-3-14-1561407955502.txt
Stop the topology
Storm topologies run until stopped, or the cluster is deleted. To stop the topology, use the following command:
storm kill hdfswriter
Clean up resources
To clean up the resources created by this tutorial, you can delete the resource group. Deleting the resource group also deletes the associated HDInsight cluster, and any other resources associated with the resource group.
To remove the resource group using the Azure portal:
- In the Azure portal, expand the menu on the left side to open the menu of services, and then choose Resource Groups to display the list of your resource groups.
- Locate the resource group to delete, and then right-click the More button (...) on the right side of the listing.
- Select Delete resource group, and then confirm.
Next steps
In this tutorial, you learned how to use Apache Storm to write data to the HDFS-compatible storage used by Apache Storm on HDInsight.
Discover other Apache Storm examples for HDInsight
Tilbakemeldinger
Send inn og vis tilbakemelding for