在 HDInsight 上使用 Python 開發 Apache Storm 拓撲Develop Apache Storm topologies using Python on HDInsight

了解如何建立使用 Python 元件的 Apache Storm (英文) 拓撲。Learn how to create an Apache Storm topology that uses Python components. Apache Storm 支援多種語言,甚至可讓您將數種語言的元件結合成一個拓撲。Apache Storm supports multiple languages, even allowing you to combine components from several languages in one topology. Flux (英文) 架構 (隨 Storm 0.10.0 一起引進) 可讓您輕鬆建立使用 Python 元件的解決方案。The Flux framework (introduced with Storm 0.10.0) allows you to easily create solutions that use Python components.

重要

本文件中的資訊已使用 Storm on HDInsight 3.6 進行測試。The information in this document was tested using Storm on HDInsight 3.6.

此專案的程式碼位於 https://github.com/Azure-Samples/hdinsight-python-storm-wordcountThe code for this project is available at https://github.com/Azure-Samples/hdinsight-python-storm-wordcount.

必要條件Prerequisites

  • Python 2.7 或更新版本Python 2.7 or higher

  • Java JDK 1.8 或更新版本Java JDK 1.8 or higher

  • Apache Maven 3Apache Maven 3

  • (選擇性) 本機 Storm 開發環境。(Optional) A local Storm development environment. 只有當您想要在本機執行拓撲時,才需要本機 Storm 環境。A local Storm environment is only needed if you want to run the topology locally. 如需詳細資訊,請參閱設定開發環境For more information, see Setting up a development environment.

Storm 多語言支援Storm multi-language support

Apache Storm 專門用來搭配以任何程式設計語言撰寫的元件。Apache Storm was designed to work with components written using any programming language. 這些元件必須了解如何使用 Storm 的 Thrift 定義The components must understand how to work with the Thrift definition for Storm. 在 Python 中,Apache Storm 專案隨附一個模組,可讓您輕鬆地與 Strom 互動。For Python, a module is provided as part of the Apache Storm project that allows you to easily interface with Storm. 您可以在 https://github.com/apache/storm/blob/master/storm-multilang/python/src/main/resources/resources/storm.py 找到此模組。You can find this module at https://github.com/apache/storm/blob/master/storm-multilang/python/src/main/resources/resources/storm.py.

Storm 是在 Java 虛擬機器 (JVM) 上執行的 Java 程序。Storm is a Java process that runs on the Java Virtual Machine (JVM). 以其他語言撰寫的元件會以子流程執行。Components written in other languages are executed as subprocesses. Storm 會使用透過 stdin/stdout 傳送的 JSON 訊息,與這些子流程進行通訊。The Storm communicates with these subprocesses using JSON messages sent over stdin/stdout. 如需各元件之間通訊的詳細資訊,請參閱 多語言通訊協定 文件。More details on communication between components can be found in the Multi-lang Protocol documentation.

採用 Flux 架構的 PythonPython with the Flux framework

Flux 架構可讓您獨立於元件之外定義 Storm 拓撲。The Flux framework allows you to define Storm topologies separately from the components. Flux 架構會使用 YAML 來定義 Storm 拓撲。The Flux framework uses YAML to define the Storm topology. 下列文字是如何在 YAML 文件中參考 Python 元件的範例:The following text is an example of how to reference a Python component in the YAML document:

# Spout definitions
spouts:
  - id: "sentence-spout"
    className: "org.apache.storm.flux.wrappers.spouts.FluxShellSpout"
    constructorArgs:
      # Command line
      - ["python", "sentencespout.py"]
      # Output field(s)
      - ["sentence"]
    # parallelism hint
    parallelism: 1

類別 FluxShellSpout 用來啟動實作 Spout 的 sentencespout.py 指令碼。The class FluxShellSpout is used to start the sentencespout.py script that implements the spout.

Flux 要求 Python 指令碼位於拓撲所在之 jar 檔案內的 /resources 目錄。Flux expects the Python scripts to be in the /resources directory inside the jar file that contains the topology. 因此,這個範例會將 Python 指令碼儲存在 /multilang/resources 目錄。So this example stores the Python scripts in the /multilang/resources directory. pom.xml 使用下列 XML 來包含此檔案:The pom.xml includes this file using the following XML:

<!-- include the Python components -->
<resource>
    <directory>${basedir}/multilang</directory>
    <filtering>false</filtering>
</resource>

如先前所述,有一個 storm.py 檔案實作 Storm 的 Thrift 定義。As mentioned earlier, there is a storm.py file that implements the Thrift definition for Storm. 建置專案時,Flux 架構會自動包含 storm.py,因此您不必擔心要包含它。The Flux framework includes storm.py automatically when the project is built, so you don't have to worry about including it.

建置專案Build the project

從專案根目錄中,使用下列命令︰From the root of the project, use the following command:

mvn clean compile package

此命令會建立 target/WordCount-1.0-SNAPSHOT.jar 檔案,其中包含已編譯的拓撲。This command creates a target/WordCount-1.0-SNAPSHOT.jar file that contains the compiled topology.

在本機測試拓撲Run the topology locally

若要在本機執行拓撲,請使用下列命令:To run the topology locally, use the following command:

storm jar WordCount-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux -l -R /topology.yaml

注意

此命令需要本機 Storm 開發環境。This command requires a local Storm development environment. 如需詳細資訊,請參閱設定開發環境For more information, see Setting up a development environment

拓撲啟動之後,就會將類似下列文字的資訊發出至本機主控台︰Once the topology starts, it emits information to the local console similar to the following text:

24302 [Thread-25-sentence-spout-executor[4 4]] INFO  o.a.s.s.ShellSpout - ShellLog pid:2436, name:sentence-spout Emiting the cow jumped over the moon
24302 [Thread-30] INFO  o.a.s.t.ShellBolt - ShellLog pid:2438, name:splitter-bolt Emitting the
24302 [Thread-28] INFO  o.a.s.t.ShellBolt - ShellLog pid:2437, name:counter-bolt Emitting years:160
24302 [Thread-17-log-executor[3 3]] INFO  o.a.s.f.w.b.LogInfoBolt - {word=the, count=599}
24303 [Thread-17-log-executor[3 3]] INFO  o.a.s.f.w.b.LogInfoBolt - {word=seven, count=302}
24303 [Thread-17-log-executor[3 3]] INFO  o.a.s.f.w.b.LogInfoBolt - {word=dwarfs, count=143}
24303 [Thread-25-sentence-spout-executor[4 4]] INFO  o.a.s.s.ShellSpout - ShellLog pid:2436, name:sentence-spout Emiting the cow jumped over the moon
24303 [Thread-30] INFO  o.a.s.t.ShellBolt - ShellLog pid:2438, name:splitter-bolt Emitting cow
24303 [Thread-17-log-executor[3 3]] INFO  o.a.s.f.w.b.LogInfoBolt - {word=four, count=160}

若要停止拓撲,請使用 Ctrl+CTo stop the topology, use Ctrl + C.

在 HDInsight 上執行 Storm 拓撲Run the Storm topology on HDInsight

  1. 使用下列命令將 WordCount-1.0-SNAPSHOT.jar檔案複製到 Storm on HDInsight 叢集:Use the following command to copy the WordCount-1.0-SNAPSHOT.jar file to your Storm on HDInsight cluster:

    scp target\WordCount-1.0-SNAPSHOT.jar sshuser@mycluster-ssh.azurehdinsight.net
    

    sshuser 取代為叢集的 SSH 使用者。Replace sshuser with the SSH user for your cluster. mycluster 取代為叢集名稱。Replace mycluster with the cluster name. 系統可能會提示您輸入 SSH 使用者的密碼。You may be prompted to enter the password for the SSH user.

    如需有關使用 SSH 和 SCP 的詳細資訊,請參閱搭配 HDInsight 使用 SSHFor more information on using SSH and SCP, see Use SSH with HDInsight.

  2. 上傳檔案後,使用 SSH 連線至叢集:Once the file has been uploaded, connect to the cluster using SSH:

    ssh sshuser@mycluster-ssh.azurehdinsight.net
    
  3. 從 SSH 工作階段中,使用下列命令啟動叢集上的拓撲:From the SSH session, use the following command to start the topology on the cluster:

    storm jar WordCount-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux -r -R /topology.yaml
    
  4. 您可以使用 Storm UI 來檢視叢集上的拓撲。You can use the Storm UI to view the topology on the cluster. Storm UI 位於 https://mycluster.azurehdinsight.net/stormuiThe Storm UI is located at https://mycluster.azurehdinsight.net/stormui. mycluster 取代為您的叢集名稱。Replace mycluster with your cluster name.

注意

Storm 拓撲啟動之後會一直執行到停止為止。Once started, a Storm topology runs until stopped. 若要停止拓撲,請使用下列其中一種方法:To stop the topology, use one of the following methods:

  • 從命令列執行 storm kill TOPOLOGYNAME 命令The storm kill TOPOLOGYNAME command from the command line
  • Storm UI 中的 [終止] 按鈕。The Kill button in the Storm UI.

後續步驟Next steps

請參閱下列文件,了解搭配使用 Python 與 HDInsight 的其他方式。See the following documents for other ways to use Python with HDInsight: