SAIGE 和 SAIGE-基因(使用管道 转换器) SAIGE and SAIGE-GENE (using Pipe Transformer)

SAIGE是一个命令行工具,用 R 编写,用于对 biobank 数据执行遗传关联测试。SAIGE is a command-line tool written in R for performing genetic association tests on biobank-scale data. 使用发光的管道变压器对基因组学运行单变量关联测试方法 SAIGE基于区域的关联测试方法 SAIGE-Databricks Runtime 基因Run the single-variant association test method SAIGE and the region-based association test method SAIGE-GENE on Databricks Runtime for Genomics using Glow’s Pipe Transformer.

创建 SAIGE 群集 Create a SAIGE cluster

使用init 脚本在群集中的每个节点上安装HTSLib以及 SAIGE 包、其依赖项和关联的输入文件。Install HTSLib as well as the SAIGE package, its dependencies, and associated input files on every node in the cluster using an init script. 建议创建装入点,以便使用保险丝装载从云存储中存储和下载这些文件。We recommend creating a mount point to store and download these files from cloud storage with the FUSE mount. 将脚本中的值替换为装入点和本地目录。Replace the values in the script with your mount point and local directory.

#!/usr/bin/env bash

set -ex
set -o pipefail

# Install SKAT for SAIGE-GENE
sudo apt-get -y install libboost-all-dev autoconf
Rscript -e 'install.packages("SKAT", repos="https://cran.us.r-project.org")'

# Install SAIGE
cd /opt
git clone https://github.com/weizhouUMICH/SAIGE
cd SAIGE
Rscript extdata/install_packages.R
R CMD INSTALL *.tar.gz

# Install HTSLIB
cd /opt
git clone https://github.com/samtools/htslib
cd htslib
autoheader
autoconf
./configure
make
make install

# Sync SAIGE input files
cp -r <mount-point>/* <local-dir>

运行关联测试Run association tests

若要并行化关联测试,请使用管道转换器。To parallelize the association test, use the Pipe Transformer.

  • 命令:包含SAIGESAIGE-基因脚本的 JSON 数组Command: A JSON array containing a SAIGE or SAIGE-GENE script
  • 输入格式化程序: .VCFInput formatter: VCF
  • 输出格式化程序: CSVOutput formatter: CSV
    • 标头标志: trueHeader flag: true
    • 分隔符: spaceDelimiter: space
import glow
import json

cmd = json.dumps(["bash", "-c", saige_script])
output_df = glow.transform("pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header=input_vcf,
                           output_formatter='csv', out_header='true', out_delimiter=' ')

可以直接从 .VCF 行或使用 .VCF 行的增量表中读取输入数据帧。You can read the input DataFrame directly from a VCF or from a Delta table with VCF rows.

单变量关联测试(SAIGE)Single-variant association test (SAIGE)

以下单元格显示要进行管道传输的示例 SAIGE 脚本。The following cell shows an example SAIGE script to be piped. 单个变体测试文档中介绍了这些选项。The options are explained in the single-variant association test docs.

#!/bin/sh
set -e

export tmpdir=$(mktemp -d -t vcf.XXXXXX)
cat - > ${tmpdir}/input.vcf
bgzip -c ${tmpdir}/input.vcf > ${tmpdir}/input.vcf.gz
tabix -p vcf ${tmpdir}/input.vcf.gz

Rscript /opt/SAIGE/extdata/step2_SPAtests.R \
    --vcfFile=${tmpdir}/input.vcf.gz \
    --vcfFileIndex=${tmpdir}/input.vcf.gz.tbi \
    --SAIGEOutputFile=${tmpdir}/output.txt \
    --sampleFile=<local-dir>/input/sampleIds.txt \
    --GMMATmodelFile=<local-dir>/step-1/output.rda \
    --varianceRatioFile=<local-dir>/step-1/output.varianceRatio.txt \
    --vcfField=GT \
    --chrom=22 \
    --minMAF=0.0001 \
    --minMAC=1 \
    --numLinesOutput=2 \
    >&2

cat ${tmpdir}/output.txt
rm -rf ${tmpdir}

基于区域的关联测试(SAIGE- 基因) Region-based association test (SAIGE-GENE)

在管道处理之前,按区域或基因对输入数据帧进行分区,其中每个分区都包含已排序、不重复和相关的 .VCF 行集。Before piping, partition the input DataFrame by region or gene, with each partition containing the sorted, distinct, and relevant set of VCF rows. 区域和对应的变量应与 SAIGE-基因脚本中引用的组文件中的区域和对应的变量匹配。The regions and corresponding variants should match those in the group file referenced in the SAIGE-GENE script. 区域可以从任何源派生,如SnpEff 管道中的基因批注。Regions can be derived from any source, such as gene annotations from the SnpEff pipeline.

input_df = genotype_df.join(region_df, ["contigName", "start", "end"]) \
  .repartition("geneName") \
  .select("contigName", "start", "end", "referenceAllele", "alternateAlleles", "genotypes") \
  .sortWithinPartitions("contigName", "start")

以下单元格显示了一个要进行管道传输的 SAIGE 基因脚本示例。The following cell shows an example SAIGE-GENE script to be piped. 基于区域的关联测试文档中介绍了这些选项。The options are explained in the region-based association test docs.

#!/bin/sh
set -e

export tmpdir=$(mktemp -d -t vcf.XXXXXX)
uniq -w 100 - > ${tmpdir}/input.vcf
bgzip -c ${tmpdir}/input.vcf > ${tmpdir}/input.vcf.gz
tabix -p vcf ${tmpdir}/input.vcf.gz

Rscript /opt/SAIGE/extdata/step2_SPAtests.R \
    --vcfFile=${tmpdir}/input.vcf.gz \
    --vcfFileIndex=${tmpdir}/input.vcf.gz.tbi \
    --SAIGEOutputFile=${tmpdir}/output.txt \
    --groupFile=${tmpdir}/groupFile.txt \
    --sampleFile=<local-dir>/input/sampleIds.txt \
    --GMMATmodelFile=<local-dir>/step-1/output.rda \
    --varianceRatioFile=<local-dir>/step-1/output.varianceRatio.txt \
    --sparseSigmaFile=<local-dir>/step-1/output.varianceRatio.txt_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseSigma.mtx \
    --vcfField=GT \
    --chrom=22 \
    --minMAF=0 \
    --minMAC=0.5 \
    --maxMAFforGroupTest=0.5 \
    --numLinesOutput=2 \
    >&2

cat ${tmpdir}/output.txt
rm -rf ${tmpdir}

在管道之后,消除 output 数据帧中每个基因的关联结果的歧义。After piping, disambiguate the association results for each gene in the output DataFrame.

from pyspark.sql import Window

window_spec = Window.partitionBy("Gene")
assoc_df = output_df.withColumn("numMarkerIDs", size(split("markerIDs", ";"))) \
  .withColumn("maxNumMarkerIDs", max("numMarkerIDs").over(window_spec)) \
  .filter("numMarkerIDs = maxNumMarkerIDs") \
  .drop("numMarkerIDs", "maxNumMarkerIDs") \
  .drop_duplicates(["Gene"])

示例笔记本 Example notebooks

空逻辑混合模型适合笔记本Null logistic mixed model fit notebook

获取笔记本Get notebook

SAIGE 笔记本SAIGE notebook

获取笔记本Get notebook

SAIGE-基因笔记本SAIGE-GENE notebook

获取笔记本Get notebook