Problem: Library Unavailability Causing Job Failures
This article explains an Import Error you may encounter when launching jobs that import external libraries.
When a job causes a node to restart, the job fails with the following error message:
ImportError: No module named XXX
The Cluster Manager is part of the Azure Databricks service that manages customer Apache Spark clusters. It sends commands to install Python and R libraries when it restarts each node. Sometimes, library installation or downloading of artifacts from the internet can take more time than expected. This occurs due to network latency, or it occurs if the library that is being attached to the cluster has many dependent libraries.
The library installation mechanism guarantees that when a notebook attaches to a cluster, it can import installed libraries. When library installation through PyPI takes excessive time, the notebook attaches to the cluster before the library installation completes. In this case, the notebook is unable to import the library.
Use notebook-scoped library installation commands in the notebook. You can enter the following commands in one cell, which ensures that all of the specified libraries are installed.
To avoid delay in downloading the libraries from the internet repositories, you can cache the libraries in DBFS or Azure Blob Storage.
For example, you can download the wheel or egg file for a Python library to a DBFS or Azure Blob Storage location. You can use the REST API or cluster-scoped init scripts to install libraries from DBFS or Azure Blob Storage.
First, download the wheel or egg file from the internet to the DBFS or Azure Blob Storage location. This can be performed in a notebook as follows:
%sh cd /dbfs/mnt/library wget <whl/egg file location from the pypi repository>
After the wheel or egg file download completes, you can install the library to the cluster using the REST API, UI, or init script commands.