Known issues for Apache Spark cluster on HDInsight
This document keeps track of all the known issues for the HDInsight Spark public preview.
Livy leaks interactive session
When Livy is restarted (from Ambari or due to headnode 0 virtual machine reboot) with an interactive session still alive, an interactive job session will be leaked. Because of this, new jobs can stuck in the Accepted state, and cannot be started.
Use the following procedure to workaround the issue:
Ssh into headnode. For information, see Use SSH with HDInsight.
Run the following command to find the application IDs of the interactive jobs started through Livy.
yarn application –list
The default job names will be Livy if the jobs were started with a Livy interactive session with no explicit names specified, For the Livy session started by Jupyter notebook, the job name will start with remotesparkmagics_*.
Run the following command to kill those jobs.
yarn application –kill <Application ID>
New jobs will start running.
Spark History Server not started
Spark History Server is not started automatically after a cluster is created.
Manually start the history server from Ambari.
Permission issue in Spark log directory
When hdiuser submits a job with spark-submit, there is an error java.io.FileNotFoundException: /var/log/spark/sparkdriver_hdiuser.log (Permission denied) and the driver log is not written.
- Add hdiuser to the Hadoop group.
- Provide 777 permissions on /var/log/spark after cluster creation.
- Update the spark log location using Ambari to be a directory with 777 permissions.
- Run spark-submit as sudo.
Spark-Phoenix connector is not supported
Currently, the Spark-Phoenix connector is not supported with an HDInsight Spark cluster.
You must use the Spark-HBase connector instead. For instructions see How to use Spark-HBase connector.
Issues related to Jupyter notebooks
Following are some known issues related to Jupyter notebooks.
Notebooks with non-ASCII characters in filenames
Jupyter notebooks that can be used in Spark HDInsight clusters should not have non-ASCII characters in filenames. If you try to upload a file through the Jupyter UI which has a non-ASCII filename, it will fail silently (that is, Jupyter won’t let you upload the file, but it won’t throw a visible error either).
Error while loading notebooks of larger sizes
You might see an error
Error loading notebook when you load notebooks that are larger in size.
If you get this error, it does not mean your data is corrupt or lost. Your notebooks are still on disk in
/var/lib/jupyter, and you can SSH into the cluster to access them. For information, see Use SSH with HDInsight.
Once you have connected to the cluster using SSH, you can copy your notebooks from your cluster to your local machine (using SCP or WinSCP) as a backup to prevent the loss of any important data in the notebook. You can then SSH tunnel into your headnode at port 8001 to access Jupyter without going through the gateway. From there, you can clear the output of your notebook and re-save it to minimize the notebook’s size.
To prevent this error from happening in the future, you must follow some best practices:
- It is important to keep the notebook size small. Any output from your Spark jobs that is sent back to Jupyter is persisted in the notebook. It is a best practice with Jupyter in general to avoid running
.collect()on large RDD’s or dataframes; instead, if you want to peek at an RDD’s contents, consider running
.sample()so that your output doesn’t get too big.
- Also, when you save a notebook, clear all output cells to reduce the size.
Notebook initial startup takes longer than expected
First code statement in Jupyter notebook using Spark magic could take more than a minute.
This happens because when the first code cell is run. In the background this initiates session configuration and Spark, SQL, and Hive contexts are set. After these contexts are set, the first statement is run and this gives the impression that the statement took a long time to complete.
Jupyter notebook timeout in creating the session
When Spark cluster is out of resources, the Spark and Pyspark kernels in the Jupyter notebook will timeout trying to create the session.
Free up some resources in your Spark cluster by:
- Stopping other Spark notebooks by going to the Close and Halt menu or clicking Shutdown in the notebook explorer.
- Stopping other Spark applications from YARN.
- Restart the notebook you were trying to start up. Enough resources should be available for you to create a session now.
- Spark with BI: Perform interactive data analysis using Spark in HDInsight with BI tools
- Spark with Machine Learning: Use Spark in HDInsight for analyzing building temperature using HVAC data
- Spark with Machine Learning: Use Spark in HDInsight to predict food inspection results
- Spark Streaming: Use Spark in HDInsight for building real-time streaming applications
- Website log analysis using Spark in HDInsight
Create and run applications
Tools and extensions
- Use HDInsight Tools Plugin for IntelliJ IDEA to create and submit Spark Scala applicatons
- Use HDInsight Tools Plugin for IntelliJ IDEA to debug Spark applications remotely
- Use Zeppelin notebooks with a Spark cluster on HDInsight
- Kernels available for Jupyter notebook in Spark cluster for HDInsight
- Use external packages with Jupyter notebooks
- Install Jupyter on your computer and connect to an HDInsight Spark cluster