您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

教程:从 Apache Storm on Azure HDInsight 写入到 Apache Hadoop HDFSTutorial: Write to Apache Hadoop HDFS from Apache Storm on Azure HDInsight

本教程演示如何使用 Apache Storm 将数据写入到 Apache Storm on HDInsight 使用的 HDFS 兼容存储。This tutorial demonstrates how to use Apache Storm to write data to the HDFS-compatible storage used by Apache Storm on HDInsight. HDInsight 可以使用 Azure 存储和 Azure Data Lake Storage 作为 HDFS 兼容的存储。HDInsight can use both Azure Storage and Azure Data Lake Storage as HDFS-compatible storage. Storm 提供可将数据写入 HDFS 的 HdfsBolt 组件。Storm provides an HdfsBolt component that writes data to HDFS. 本文档提供有关从 HdfsBolt 写入上述任一存储类型的信息。This document provides information on writing to either type of storage from the HdfsBolt.

本文档中使用的示例拓扑依赖于 Storm on HDInsight 随附的组件。The example topology used in this document relies on components that are included with Storm on HDInsight. 对其他 Apache Storm 群集使用此拓扑时,可能需要修改此拓扑才能让它配合 Azure Data Lake Storage 工作。It may require modification to work with Azure Data Lake Storage when used with other Apache Storm clusters.

本教程介绍如何执行下列操作:In this tutorial, you learn how to:

  • 使用脚本操作配置群集Configure the cluster with script action
  • 生成和打包拓扑Build and package the topology
  • 部署并运行拓扑Deploy and run the topology
  • 查看输出数据View output data
  • 停止拓扑Stop the topology

先决条件Prerequisites

示例配置Example configuration

以下 YAML 摘自示例中包含的 resources/writetohdfs.yaml 文件。The following YAML is an excerpt from the resources/writetohdfs.yaml file included in the example. 此文件定义对 Apache Storm 使用 Flux 框架的 Storm 拓扑。This file defines the Storm topology using the Flux framework for Apache Storm.

components:
  - id: "syncPolicy"
    className: "org.apache.storm.hdfs.bolt.sync.CountSyncPolicy"
    constructorArgs:
      - 1000

  # Rotate files when they hit 5 MB
  - id: "rotationPolicy"
    className: "org.apache.storm.hdfs.bolt.rotation.FileSizeRotationPolicy"
    constructorArgs:
      - 5
      - "MB"

  - id: "fileNameFormat"
    className: "org.apache.storm.hdfs.bolt.format.DefaultFileNameFormat"
    configMethods:
      - name: "withPath"
        args: ["${hdfs.write.dir}"]
      - name: "withExtension"
        args: [".txt"]

  - id: "recordFormat"
    className: "org.apache.storm.hdfs.bolt.format.DelimitedRecordFormat"
    configMethods:
      - name: "withFieldDelimiter"
        args: ["|"]

# spout definitions
spouts:
  - id: "tick-spout"
    className: "com.microsoft.example.TickSpout"
    parallelism: 1


# bolt definitions
bolts:
  - id: "hdfs-bolt"
    className: "org.apache.storm.hdfs.bolt.HdfsBolt"
    configMethods:
      - name: "withConfigKey"
        args: ["hdfs.config"]
      - name: "withFsUrl"
        args: ["${hdfs.url}"]
      - name: "withFileNameFormat"
        args: [ref: "fileNameFormat"]
      - name: "withRecordFormat"
        args: [ref: "recordFormat"]
      - name: "withRotationPolicy"
        args: [ref: "rotationPolicy"]
      - name: "withSyncPolicy"
        args: [ref: "syncPolicy"]

此 YAML 定义以下项:This YAML defines the following items:

  • syncPolicy:定义何时将文件同步/刷新到文件系统。syncPolicy: Defines when files are synched/flushed to the file system. 在此示例中,为每隔 1000 个元组。In this example, every 1000 tuples.
  • fileNameFormat:定义写入文件时要使用的路径和文件名模式。fileNameFormat: Defines the path and file name pattern to use when writing files. 在此示例中,路径是在运行时使用筛选器提供的,文件扩展名为 .txtIn this example, the path is provided at runtime using a filter, and the file extension is .txt.
  • recordFormat:定义写入的文件的内部格式。recordFormat: Defines the internal format of the files written. 在此示例中,字段由 | 字符分隔。In this example, fields are delimited by the | character.
  • rotationPolicy:定义何时轮换文件。rotationPolicy: Defines when to rotate files. 在此示例中,不执行轮换。In this example, no rotation is performed.
  • hdfs-bolt:使用前面的组件作为 HdfsBolt 类的配置参数。hdfs-bolt: Uses the previous components as configuration parameters for the HdfsBolt class.

有关 Flux 框架的详细信息,请参阅 https://storm.apache.org/releases/current/flux.htmlFor more information on the Flux framework, see https://storm.apache.org/releases/current/flux.html.

配置群集Configure the cluster

默认情况下,Storm on HDInsight 不会在 Storm 的类路径中包含 HdfsBolt 用来与 Azure 存储或 Data Lake Storage 通信的组件。By default, Storm on HDInsight does not include the components that HdfsBolt uses to communicate with Azure Storage or Data Lake Storage in Storm's classpath. 使用以下脚本操作可将这些组件添加到群集上 Storm 的 extlib 目录:Use the following script action to add these components to the extlib directory for Storm on your cluster:

属性Property Value
脚本类型Script type - Custom- Custom
Bash 脚本 URIBash script URI https://hdiconfigactions.blob.core.windows.net/linuxstormextlibv01/stormextlib.sh
节点类型Node type(s) Nimbus、SupervisorNimbus, Supervisor
parametersParameters None

有关在群集中使用此脚本的信息,请参阅使用脚本操作自定义 HDInsight 群集文档。For information on using this script with your cluster, see the Customize HDInsight clusters using script actions document.

生成和打包拓扑Build and package the topology

  1. 将示例项目从 https://github.com/Azure-Samples/hdinsight-storm-azure-data-lake-store 下载到开发环境。Download the example project from https://github.com/Azure-Samples/hdinsight-storm-azure-data-lake-store to your development environment.

  2. 从命令提示符、终端或 Shell 会话将目录更改为所下载项目的根目录。From a command prompt, terminal, or shell session, change directories to the root of the downloaded project. 若要生成和打包拓扑,请使用以下命令:To build and package the topology, use the following command:

    mvn compile package
    

    生成和打包完成之后,会出现名为 target 的新目录,其中包含名为 StormToHdfs-1.0-SNAPSHOT.jar 的文件。Once the build and packaging completes, there is a new directory named target, that contains a file named StormToHdfs-1.0-SNAPSHOT.jar. 此文件包含编译的拓扑。This file contains the compiled topology.

部署并运行拓扑Deploy and run the topology

  1. 使用以下命令可将拓扑复制到 HDInsight 群集。Use the following command to copy the topology to the HDInsight cluster. CLUSTERNAME 替换为群集的名称。Replace CLUSTERNAME with the name of the cluster.

    scp target\StormToHdfs-1.0-SNAPSHOT.jar sshuser@CLUSTERNAME-ssh.azurehdinsight.net:StormToHdfs-1.0-SNAPSHOT.jar
    
  2. 上传完成后,使用以下步骤通过 SSH 可连接到 HDInsight 群集。Once the upload completes, use the following to connect to the HDInsight cluster using SSH. CLUSTERNAME 替换为群集的名称。Replace CLUSTERNAME with the name of the cluster.

    ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.net
    
  3. 连接后,使用以下命令创建一个名为 dev.properties 的文件:Once connected, use the following command to create a file named dev.properties:

    nano dev.properties
    
  4. 将以下文本用作 dev.properties 文件的内容。Use the following text as the contents of the dev.properties file. 基于 URI 方案根据需要进行修改。Revise as needed based on your URI scheme.

    hdfs.write.dir: /stormdata/
    hdfs.url: wasbs:///
    

    要保存文件,请使用 Ctrl + X,并输入 Y,最后按 EnterTo save the file, use Ctrl + X, then Y, and finally Enter. 此文件中的值用于设置存储 URL 和数据将写入到的目录名称。The values in this file set the storage URL and the directory name that data is written to.

  5. 使用以下命令启动拓扑:Use the following command to start the topology:

    storm jar StormToHdfs-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --remote -R /writetohdfs.yaml --filter dev.properties
    

    此命令通过将拓扑提交到群集的 Nimbus 节点,使用 Flux 框架启动拓扑。This command starts the topology using the Flux framework by submitting it to the Nimbus node of the cluster. 拓扑由 jar 包含的 writetohdfs.yaml 文件定义。The topology is defined by the writetohdfs.yaml file included in the jar. dev.properties 文件作为筛选器传递,该文件包含的值由拓扑读取。The dev.properties file is passed as a filter, and the values contained in the file are read by the topology.

查看输出数据View output data

若要查看数据,请使用以下命令:To view the data, use the following command:

hdfs dfs -ls /stormdata/

此时会显示此拓扑创建的文件列表。A list of the files created by this topology is displayed. 以下列表是由前面的命令所返回的数据的示例:The following list is an example of the data returned by the previous commands:

Found 23 items
-rw-r--r--   1 storm supergroup    5242880 2019-06-24 20:25 /stormdata/hdfs-bolt-3-0-1561407909895.txt
-rw-r--r--   1 storm supergroup    5242880 2019-06-24 20:25 /stormdata/hdfs-bolt-3-1-1561407915577.txt
-rw-r--r--   1 storm supergroup    5242880 2019-06-24 20:25 /stormdata/hdfs-bolt-3-10-1561407943327.txt
-rw-r--r--   1 storm supergroup    5242880 2019-06-24 20:25 /stormdata/hdfs-bolt-3-11-1561407946312.txt
-rw-r--r--   1 storm supergroup    5242880 2019-06-24 20:25 /stormdata/hdfs-bolt-3-12-1561407949320.txt
-rw-r--r--   1 storm supergroup    5242880 2019-06-24 20:25 /stormdata/hdfs-bolt-3-13-1561407952662.txt
-rw-r--r--   1 storm supergroup    5242880 2019-06-24 20:25 /stormdata/hdfs-bolt-3-14-1561407955502.txt

停止拓扑Stop the topology

Storm 拓扑会一直运行,直到被停止或群集被删除。Storm topologies run until stopped, or the cluster is deleted. 若要停止拓扑,请使用以下命令:To stop the topology, use the following command:

storm kill hdfswriter

清理资源Clean up resources

若要清理本教程创建的资源,可以删除资源组。To clean up the resources created by this tutorial, you can delete the resource group. 删除资源组也会删除相关联的 HDInsight 群集,以及与资源组相关联的任何其他资源。Deleting the resource group also deletes the associated HDInsight cluster, and any other resources associated with the resource group.

若要使用 Azure 门户删除资源组,请执行以下操作:To remove the resource group using the Azure portal:

  1. 在 Azure 门户中展开左侧的菜单,打开服务菜单,然后选择“资源组”以显示资源组的列表。 In the Azure portal, expand the menu on the left side to open the menu of services, and then choose Resource Groups to display the list of your resource groups.
  2. 找到要删除的资源组,然后右键单击列表右侧的“更多”按钮 (...)。 Locate the resource group to delete, and then right-click the More button (...) on the right side of the listing.
  3. 选择“删除资源组”,然后进行确认。 Select Delete resource group, and then confirm.

后续步骤Next steps

本教程介绍了如何使用 Apache Storm 将数据写入到 Apache Storm on HDInsight 使用的 HDFS 兼容存储。In this tutorial, you learned how to use Apache Storm to write data to the HDFS-compatible storage used by Apache Storm on HDInsight.