Pre-packaged SnpEff annotation pipeline

Note

Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule, see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.

Setup

Run SnpEff (v4.3) as an Azure Databricks job. Most likely, an Azure Databricks solutions architect will set up the initial job for you. The necessary details are:

Benchmarks

The pipeline has been tested on 85.2 million variant sites from the 1000 Genomes project using the following cluster configurations:

  • Driver: Standard_DS13_v2
  • Workers: Standard_D32s_v3 * 7 (224 cores)
  • Runtime: 2.5 hours

Reference genomes

You must configure the reference genome using environment variables. To use GRCh37, set the environment variable:

refGenomeId=grch37

To use GRCh38 instead, set the environment variable:

refGenomeId=grch38

Parameters

The pipeline accepts a number of parameters that control its behavior. The most important and commonly changed parameters are documented here; the rest can be found in the SnpEff Annotation pipeline notebook. After importing the notebook and setting it as a job task, you can set these parameters for all runs or per-run.

Parameter Default Description
inputVariants n/a Path of input variants (VCF or Delta Lake).
output n/a The path where pipeline output should be written.
exportVCF false If true, the pipeline writes results in VCF as well as Delta Lake.
exportVCFAsSingleFile false If true, exports VCF as single file

Output

The annotated variants are written out to Delta tables inside the provided output directory. If you configured the pipeline to export to VCF, they’ll appear under the output directory as well.

output
|---annotations
    |---Delta files
|---annotations.vcf

SnpEff annotation pipeline notebook

Get notebook