Hail 0.2

Hail is a library built on Apache Spark for analyzing large genomic datasets. Hail 0.2 is integrated into Databricks Runtime for Genomics.

Create a Hail cluster

To create a cluster with Hail installed:

  1. Set the following environment variable:

    ENABLE_HAIL=true
    

    This environment variable causes the cluster to launch with Hail 0.2, its dependencies, and Python 3.6 installed.

Use Hail in a notebook

For the most part, Hail 0.2 code in Azure Databricks works identically to the Hail documentation. However, there are a few modifications that are necessary for the Azure Databricks environment.

Initialization

When initializing Hail, pass in the pre-created SparkContext and mark the initialization as idempotent. This setting enables multiple Azure Databricks notebooks to use the same Hail context.

import hail as hl
hl.init(sc, idempotent=True)

Plotting

Hail uses the Bokeh library to create plots. The show function built into Bokeh does not work in Azure Databricks. To display a Bokeh plot generated by Hail, you can run a command like:

from bokeh.embed import components, file_html
from bokeh.resources import CDN
plot = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP')
html = file_html(plot, CDN, "Chart")
displayHTML(html)

See Bokeh in Python Notebooks for more information.

Limitations

  • When Hail support is enabled, your cluster uses Python 3.6, so notebooks written against different versions of Python may not work.
  • When Hail support is enabled, fewer Python libraries are installed by default. You can still use the Libraries feature to install new libraries.

After you’ve set up a Hail cluster, try out the Hail overview notebook.

Hail overview notebook

Get notebook