Hail is supported in all releases of Databricks Runtime 6.x for Genomics and in Databricks Runtime 7.4 for Genomics and above.
Create a Hail cluster
To create a cluster with Hail installed:
Set the following environment variable:
This environment variable causes the cluster to launch with Hail 0.2, its dependencies, and Python 3.6 installed.
Use Hail in a notebook
For the most part, Hail 0.2 code in Azure Databricks works identically to the Hail documentation. However, there are a few modifications that are necessary for the Azure Databricks environment.
When initializing Hail, pass in the pre-created
SparkContext and mark the initialization as idempotent. This setting
enables multiple Azure Databricks notebooks to use the same Hail context.
skip_logging_configuration to save logs to the rolling driver log4j output. This setting is only
supported in Databricks Runtime 6.6 for Genomics and above.
import hail as hl hl.init(sc, idempotent=True, quiet=True, skip_logging_configuration=True)
Hail uses the Bokeh library to create plots. The
show function built into Bokeh does not work
in Azure Databricks. To display a Bokeh plot generated by Hail, you can run a command like:
from bokeh.embed import components, file_html from bokeh.resources import CDN plot = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP') html = file_html(plot, CDN, "Chart") displayHTML(html)
See Bokeh for more information.
- When Hail support is enabled, your cluster uses Python 3.6, so notebooks written against different versions of Python may not work.
- When Hail support is enabled, fewer Python libraries are installed by default. You can still use the Libraries feature to install new libraries.
After you’ve set up a Hail cluster, try out the Hail overview notebook.