Glow

Glow is an open source project created in collaboration between Databricks and the Regeneron Genetics Center. For information on features in Glow, see the Glow documentation.

Sync Glow notebooks to your workspace

  1. Fork the Glow github repo.
  2. Clone your fork to your Databricks workspace using Repos.
  3. The notebooks are located under docs/source/_static.

Glow notebooks location

Set up a Glow environment

Install Glow on an Azure Databricks cluster via Docker with Databricks Container Services.

You can find containers on the ProjectGlow Dockerhub page. These setup environments with Glow and other libraries that were in Databricks Runtime for Genomics (deprecated). Use projectglow/databricks-glow:<databricks_runtime_version>, replacing the tag with an available Databricks Runtime version.

Or install both of these cluster libraries:

  • Maven: io.projectglow:glow-spark3_2.12:<version>
  • PyPI: glow.py==<version>

Important

  • If you install Glow as a stand-alone PyPi package, install it as a cluster library, instead of a notebook-scoped library using the %pip magic command.
  • Ensure that both Maven coordinates and PyPI package are included on the cluster, and that the versions for each match.
  • Install the latest version of Glow on Databricks Runtime, not Databricks Runtime for Genomics (deprecated), which has Glow v0.6 installed by default.
  • Do not install Hail on a cluster with Glow, except when extracting genotypes from a Hail Matrix Table.

Get started with Glow

Databricks recommends that you run the test notebooks on the test data provided by the notebooks before moving on to real data. These notebooks are tested nightly with the latest version of the Glow Docker container.

Important

  • Checkpoint to Delta Lake after ingest of or transformations to genotype data.

Setup automated jobs

After you run the sample notebooks, and then apply the code to real data, you are ready to automate the steps in your pipeline by using jobs.

Important

  • Start small. Experiment on individual variants, samples or chromosomes.
  • Steps in your pipeline might require a different cluster configuration, depending on the type of computation performed.

Tip

  • Use compute-optimized virtual machines to read variant data from cloud object stores.
  • Use Delta Cache accelerated virtual machines to query variant data.
  • Use memory-optimized virtual machines for genetic association studies.
    • Clusters with small machines have a better price-performance ratio when compared with large machines.
  • The Glow Pipe Transformer supports parallelization of deep learning tools that run on GPUs.

The following example cluster configuration runs a genetic association study on a single chromosome. Edit the notebook_path and <databricks_runtime_version> as needed.

databricks jobs create --json-file glow-create-job.json

glow-create-job.json:

{
  "name": "glow_gwas",
  "notebook_task": {
    "notebook_path" : "/Users/<user@organization.com>/glow/docs/source/_static/notebooks/tertiary/gwas-quantitative",
    "base_parameters": {
      "allele_freq_cutoff": 0.01
    }
  },
  "new_cluster": {
    "spark_version": "<databricks_runtime_version>.x-scala2.12",
    "azure_attributes": {
      "first_on_demand": 1,
      "availability": "SPOT_WITH_FALLBACK_AZURE",
      "spot_bid_max_price": -1
    },
    "node_type_id": "Standard_E8s_v3",
    "num_workers": 32,
    "spark_conf": {
      "spark.sql.execution.arrow.maxRecordsPerBatch": 100
    },
    "docker_image": {
      "url": "projectglow/databricks-glow:<databricks_runtime_version>"
    }
  }
}