Use Script Action to install external Python packages for Jupyter notebooks in Apache Spark clusters on HDInsight
Learn how to use Script Actions to configure an Apache Spark cluster on HDInsight (Linux) to use external, community-contributed python packages that are not included out-of-the-box in the cluster.
You can also configure a Jupyter notebook by using
%%configure magic to use external packages. For instructions, see Use external packages with Jupyter notebooks in Apache Spark clusters on HDInsight.
You can search the package index for the complete list of packages that are available. You can also get a list of available packages from other sources. For example, you can install packages made available through Anaconda or conda-forge.
In this article, you will learn how to install the TensorFlow package using Script Action on your cluster and use it via the Jupyter notebook.
You must have the following:
- An Azure subscription. See Get Azure free trial.
An Apache Spark cluster on HDInsight. For instructions, see Create Apache Spark clusters in Azure HDInsight.
If you do not already have a Spark cluster on HDInsight Linux, you can run script actions during cluster creation. Visit the documentation on how to use custom script actions.
Use external packages with Jupyter notebooks
From the Azure Portal, from the startboard, click the tile for your Spark cluster (if you pinned it to the startboard). You can also navigate to your cluster under Browse All > HDInsight Clusters.
From the Spark cluster blade, click Script Actions from the left pane. Run the custom action that installs TensorFlow in the head nodes and the worker nodes. The bash script can be referenced from: https://hdiconfigactions.blob.core.windows.net/linuxtensorflow/tensorflowinstall.sh Visit the documentation on how to use custom script actions.
There are two python installations in the cluster. Spark will use the Anaconda python installation located at
/usr/bin/anaconda/bin. Reference that installation in your custom actions via
Open a PySpark Jupyter notebook
A new notebook is created and opened with the name Untitled.pynb. Click the notebook name at the top, and enter a friendly name.
You will now
import tensorflowand run a hello world example.
Code to copy:
import tensorflow as tf hello = tf.constant('Hello, TensorFlow!') sess = tf.Session() print(sess.run(hello))
The result will look like this:
- Spark with BI: Perform interactive data analysis using Spark in HDInsight with BI tools
- Spark with Machine Learning: Use Spark in HDInsight for analyzing building temperature using HVAC data
- Spark with Machine Learning: Use Spark in HDInsight to predict food inspection results
- Spark Streaming: Use Spark in HDInsight for building real-time streaming applications
- Website log analysis using Spark in HDInsight
Create and run applications
Tools and extensions
- Use external packages with Jupyter notebooks in Apache Spark clusters on HDInsight
- Use HDInsight Tools Plugin for IntelliJ IDEA to create and submit Spark Scala applications
- Use HDInsight Tools Plugin for IntelliJ IDEA to debug Spark applications remotely
- Use Zeppelin notebooks with a Spark cluster on HDInsight
- Kernels available for Jupyter notebook in Spark cluster for HDInsight
- Install Jupyter on your computer and connect to an HDInsight Spark cluster